AMBER 14 NVIDIA GPU
ACCELERATION SUPPORT

| Background | Authorship & Support | Features | Supported GPUs | System Size Limits |
| Accuracy Considerations | Installation and Testing | Running GPU Accelerated Simulations |
| Considerations for Maximizing GPU Performance | Benchmarks |
| Recommended Hardware & Test Drives |
| Return to Main Amber Page |


News and Updates

  • Apr 2014: AMBER 14 will be released soon. >30% performance improvement, peer to peer parallel support and many new GPU features will be added. Release announcements will be made at http://ambermd.org, and on the Amber mailing list.
     

  • Aug 2013: In collaboration with Exxact Corp we have put together an updated version of the AMBER Certified MD Workstation program. This provides a simple way to purchase machines for running GPU accelerated AMBER. The systems come with AMBER 12 pre-installed (AMBER 12 license required) and fully tested and are ready to run out of the box. A range of AMBER optimized specifications are available for different budget levels and all are fully customizable and include a full warranty. Free test drives of these machines are also available. (More details are available on the Recommended Hardware Page.)


Background

This page provides background on running AMBER v14 (PMEMD) with NVIDIA GPU acceleration. If you are using AMBER v12 please see the archived AMBER version 12 GPU pages.

One of the new features of AMBER 11 was the ability to use NVIDIA GPUs to massively accelerate PMEMD for both explicit solvent PME and implicit solvent GB simulations. This was further extended in AMBER 12 and the performance envelope has been pushed even further with AMBER 14. This work is by Ross Walker at the San Diego Supercomputer Center and Scott Le Grand at Amazon Web Services in collaboration with NVIDIA. While this GPU acceleration is considered to be production ready, and has been heavily used since it's release as part of AMBER 11, it is still evolving and thus has not been tested to the same extent as the CPU code has over the years and so users should still exercise caution when using this code. The error checking is not as verbose in the GPU code as it is on the CPU. In particular simulation failures, such as atom collisions or other simulation instabilities will manifest themselves as CUDA launch errors or GPU download failures and not as informative error messages. If you encounter problems during a simulation on the GPU you should first try to run the identical simulation on the CPU to ensure that it is not your simulation setup which is causing problems.

If you encounter problems during a simulation on the GPU you should first try to run the identical simulation on the CPU to ensure that it is not your simulation setup which is causing problems. Feedback and questions should be posted to the Amber mailing list.

New in AMBER 14

AMBER 14 includes full GPU support in PMEMD and is a major update over AMBER 12. Focus for AMBER 14 has been on both extending the performance envelope, supporting multi-GPU runs efficiently and further increasing the features available in the GPU accelerated code. Key new features in the AMBER 14 GPU version of PMEMD include:

  1. ~30% performance improvement for single GPU runs.
  2. Addition of peer to peer support for multi-GPU runs providing enhanced multi-GPU scaling.
  3. Hybrid bitwise reproducible fixed point precision model as standard (SPFP)
  4. Support for Extra Points in Multi-GPU runs.
  5. Jarzynski Sampling
  6. GBSA support
  7. Multi-dimensional Replica Exchange (Temperature and Hamiltonian)
  8. Support for CUDA 5.0, 5.5 and 6.0
  9. Support for latest generation GPUs.
  10. Monte Carlo barostat support providing NPT performance equivalent to NVT.
  11. ScaledMD support.
  12. Improved accelerated (aMD) MD support.
  13. Explicit solvent constant pH support.
  14. NMR restraint support on multiple GPUs.
  15. Improved error messages and checking.
  16. Hydrogen mass repartitioning support (4fs time steps).

Support is provided for single GPU and multiple GPU runs. Employing multiple GPUs in a single simulation requires MPI and the pmemd.cuda.MPI executable. If you have multiple simulations to run then the recommended method is to use one GPU per job. The pmemd GPU code has been developed in such a way that for single GPU runs the PCI-E bus is only used for I/O. This sets AMBER apart from other MD packages since it means the CPU specs do not feature in the GPU code performance. As such low end economic CPUs can be used. Additionally it means that in a system containing 4 GPUs 4 individual calculations can be run at the same time without interfering with each other's performance.

Authorship & Support

NVIDIA CUDA Implementation:

Scott Le Grand (NVIDIA)
Ross C. Walker (SDSC)*

*Corresponding author.

Further information relating to the specifics of the implementation, methods used to achieve performance while controlling accuracy, and details of validation are available or will be shortly from the following publications:

  • Romelia Salomon-Ferrer; Andreas W. Goetz; Duncan Poole; Scott Le Grand;  & Ross C. Walker* "Routine microsecond molecular dynamics simulations with AMBER - Part II: Particle Mesh Ewald" , J. Chem. Theory Comput., 2013, 9 (9), pp 3878-3888 DOI: 10.1021/ct400314y
     
  • Andreas W. Goetz; Mark J. Williamson; Dong Xu; Duncan Poole; Scott Le Grand;  & Ross C. Walker* "Routine microsecond molecular dynamics simulations with AMBER - Part I: Generalized Born", J. Chem. Theory Comput., 2012, 8 (5), pp 1542-1555, DOI: 10.1021/ct200909j
     
  • Scott Le Grand; Andreas W. Goetz; & Ross C. Walker* "SPFP: Speed without compromise - a mixed precision model for GPU accelerated molecular dynamics simulations.", Comp. Phys. Comm, 2013, 184, pp374-380, DOI: 10.1016/j.cpc.2012.09.022

Funding for this work has been graciously provided by NVIDIA, The University of California (UC Lab 09-LR-06-117792), The National Science Foundation's (NSF) TeraGrid Advanced User Support Program through the San Diego Supercomputer Center and NSF SI2-SSE grants to Ross Walker (NSF1047875 / NSF1148276) and Adrian Roitberg (NSF1047919 / NSF1147910)

Citing the GPU Code

If you make use of any of this GPU support in your work please use the following citations (in addition to the standard AMBER 14 citation):

Explicit Solvent

  • Romelia Salomon-Ferrer; Andreas W. Goetz; Duncan Poole; Scott Le Grand;  & Ross C. Walker* "Routine microsecond molecular dynamics simulations with AMBER - Part II: Particle Mesh Ewald" , J. Chem. Theory Comput., 2013, 9 (9), pp 3878-3888. DOI: 10.1021/ct400314y

Implicit Solvent

  • Andreas W. Goetz; Mark J. Williamson; Dong Xu; Duncan Poole; Scott Le Grand;  & Ross C. Walker* "Routine microsecond molecular dynamics simulations with AMBER - Part I: Generalized Born", J. Chem. Theory Comput., 2012, 8 (5), pp 1542-1555 , DOI: 10.1021/ct200909j

^

Cuda Zone

 


Supported Features

The GPU accelerated version of PMEMD 14, supports both explicit solvent PME or IPS simulations in all three canonical ensembles (NVE, NVT and NPT) and implicit solvent Generalized Born simulations. It has been designed to support as many of the standard PMEMD v14 features as possible, however, there are some current limitations that are detailed below. Some of these may be addressed in the near future, and patches released, with the most up to date list posted on the web page. The following options are NOT supported (as of the Apr 2014 AMBER GPU v14.0 release):

1) ibelly /= 0 Simulations using belly style constraints are not supported.
2) icfe /= 0 Support for TI is not currently implemented. It is hoped that an update will be released in a few months that adds support for TI.
3) if (igb/=0 & cut<systemsize) GPU accelerated implicit solvent GB simulations do not support a cutoff.
4) nmropt > 1 Support is not currently available for nmropt > 1. In addition for nmropt = 1 only features that do not change the underlying force field parameters are supported. For example umbrella sampling restraints are supported as well as simulated annealing functions such as variation of Temp0 with simulation step. However, varying the VDW parameters with step is NOT supported.
5) nrespa /= 1 No multiple time stepping is supported.
6) vlimit /= -1 For performance reasons the vlimit function is not implemented on GPUs.
7) es_cutoff /= vdw_cutoff Independent cutoffs for electrostatics and van der Waals are not supported on GPUs.
8) order > 4 PME interpolation orders of greater than 4 are not supported at present.
9) imin=1 (in parallel) Minimization is only supported in the serial GPU code.
10) netfrc is ignored The GPU code does not currently support the netfrc correction in PME calculations and the value of netfrc in the ewald namelist is ignored.
11) emil_do_calc /= 0 Emil is not supported on GPUs.
12) lj264 /= 0 The 12-6-4 potential is not supported on GPUs.
13) isgld > 0 Self guided langevin is not supported on GPUs.
14) nmropt = 1 .and. For n=1,4; iat(n) < 0 The GPU code does not currently support the use of COM simulations.

Additionally there are some minor differences in the output format. For example the Ewald error estimate is NOT calculated when running on a GPU. It is recommended that you first run a short simulation using the CPU code to check the Ewald error estimate is reasonable and that your system is stable. With the exception of item 10 the above limitations are tested for in the code, however, it is possible that there are additional simulation features that have not been implemented or tested on GPUs.

^


Supported GPUs

GPU accelerated PMEMD has been implemented using CUDA and thus will only run on NVIDIA GPUs at present. Due to accuracy concerns with pure single precision the code uses a custom designed hybrid single / double / fixed precision model termed SPFP. This places the requirement that the GPU hardware supports both double precision and integer atomics meaning only GPUs with hardware revision 2.0 and later can be used. Support for hardware revision 1.3 was present in previous versions of the code but for code complexity and maintenance reasons has been deprecated in AMBER 14. At the time of Amber's release the following cards are supported (* = untested):

Hardware Version 3.0 / 3.5
  • Tesla K20/K20X/K40
  • Tesla K10
  • GTX-Titan / GTX-Titan-Black
  • GTX770 / 780 / 780Ti
  • GTX670 / 680 / 690
  • Quadro cards supporting SM3.0* or 3.5*
Hardware Version 2.0

Caution:
There are a number of cards termed GTS (instead of GTX) which are cut down models, often OEM versions. A common example is the GTS250. These budget cards will NOT work since they do not support hardware double precision.

GTX-Titan and GTX-780 cards require NVIDIA Driver version >= 319.60.

GTX-Titan-Black Edition and GTX-780TI cards require NVIDIA Driver version >= 337.09

Other cards not listed here may also be supported as long as they correctly implement the Hardware Revision 2.0, 3.0 or 3.5 specifications. Due to the larger graphics memories and extended testing offered by the Tesla series GPUs these are the recommended models although GeForce cards will work fine. Additionally you should ensure that all GPUs on which you plan to run PMEMD are connected to PCI-E 2.0 x 16 lane slots or better. For peer to peer the cards to be used need to be on the same IOH controller. If this is not the case then you will likely see significantly degraded performance in parallel. To simplify the process of selecting supported hardware we have teamed with Exxact Corp to provide AMBER Certified GPU computing solutions. For more information please refer to the Recommended Hardware section.

Note that you should ensure that all GPUs on which you plan to run PMEMD are connected to PCI-E 2.0 x 16 lane slots or better, especially when running in parallel across multiple GPUs. If this is not the case then you will likely see degraded performance, although this effect is lessened in serial if you write to the mdout or mdcrd files infrequently (e.g. every 2000 steps or so). Scaling over multiple GPUs within a single node is possible, if all are in x16 or better slots. It is also possible to run over multiple nodes using infiniband but not recommended except for loosely coupled runs such as REMD. The main advantage of AMBER's approach to GPU implementation over other implementations such as NAMD and Gromacs is that it is possible to run multiple single GPU runs on a single node with little or no slow down. For example a node with 4 GTX-Titan cards could run 4 individual AMBER DHFR NPT calculations all at the same time without slowdown providing an aggregate throughput in excess of 400ns/day.

Optimum Hardware Designs for Running GPU Amber are provided on the Recommended Hardware page.

^


System Size Limits

In order to obtain the extensive speedups that we see with GPUs it is critical that all of the calculation take place on the GPU within the GPU memory. This avoids the performance hit that one takes copying to and from the GPU and also allows us to achieve extensive speedups for realistic size systems. This avoids the need to create systems with millions of atoms to show reasonable speedups even when sampling lengths are unrealistic. This unfortunately means that the entire calculation must fit within the GPU memory. Additionally we make use of a number of scratch arrays to achieve high performance. This means that the GPU memory usage can actually be higher than a typical CPU run. It also means, due to the way we had to initially implement parallel GPU support that the memory usage per GPU does NOT decrease as you increase the number of GPUs. This is something we hope to fix in the future but for the moment the atom count limitations imposed on systems by the GPU memory is roughly constant whether you run in serial or in parallel.

Since, unlike CPUs it is not possible to add more memory to a GPU (without replacing it entirely) and there is no concept of swap as there is on the CPU the size of the GPU memory imposes hard limits on the number of atoms supported in a simulation. Early on within the mdout file you will find information on the GPU being used and an estimate of the amount of GPU and CPU memory required:

|------------------- GPU DEVICE INFO --------------------
|
| CUDA Capable Devices Detected: 1
| CUDA Device ID in use: 0
| CUDA Device Name: Tesla C2070
| CUDA Device Global Mem Size: 6143 MB
| CUDA Device Num Multiprocessors: 14
| CUDA Device Core Freq: 1.15 GHz
|
|--------------------------------------------------------

...

| GPU memory information:
| KB of GPU memory in use: 4638979
| KB of CPU memory in use: 790531

The reported GPU memory usage is likely an underestimate and meant for guidance only to give you an idea of how close you are to the GPU's memory limit. Just because it is less than the available Device Global Mem Size does not necessarily mean that it will run. You should also be aware that the GPU's available memory is reduced by 1/9th if you have ECC turned on.

Memory usage is affected by the run parameters. In particular the size of the cutoff, larger cutoffs needing more memory, and the ensemble being used. Additionally the physical GPU hardware affects memory usage since the optimizations used are non-identical for different GPU types. Typically, for PME runs, memory usage runs:

NPT > NVT > NVE
NTT=3 > NTT=1 == NTT=2 > NTT=0
Barostat=2 > Barostat=1

Use of restraints etc will also increase the amount of memory in use. As will the density of your system. The higher the density the more pairs per atom there are and thus the more GPU memory will be required. The following table provides an approximate UPPER BOUND to the number of atoms supported as a function of GPU model. These numbers were estimated using boxes of TIP3P water (PME) and solvent caps of TIP3P water (GB). These had lower than optimum densities and so you may find you are actually limited for dense solvated proteins to around 20% less than the numbers here. Nevertheless these should provide reasonable estimates to work from.

All numbers are for SPFP precision and are approximate limits. The actual limits will depend on system density, simulation settings etc. These numbers are thus designed to serve as guidelines only.

Explicit Solvent (PME)

8 angstrom cutoff. Cubic box of TIP3P Water, NTT=3, NTB=1, NTP=1,NTF=2,NTC=2,DT=0.002.

GPU PME (max atoms)

type

memory

SPFP

GTX580 3.0 GB 1,240,000
M2090 (ecc off) 6.0 GB 2,680,000
GTX680 2.0 GB 920,000
K10/GTX680 (ecc off) 4.0 GB 1,810,000
K20X/GTX-Titan (ecc off) 6.0 GB 2,710,000
K40 (ecc off) 12.0 GB 5,021,000

Implicit Solvent (GB)

No cutoff, Sphere of TIP3P Water (for testing purposes only), NTT=3, NTB=0, NTF=2, NTC=2, DT=0.002, IGB=1

GPU GB (max atoms)

type

memory

SPFP

GTX580 3.0 GB 1,195,000
M2090 (ecc off) 6.0 GB 1,740,000
GTX680 2.0 GB 960,000
K10/GTX680 (ecc off) 4.0 GB 1,310,000
K20X/GTX-Titan (ecc off) 6.0 GB 1,750,000
K40 (ecc off) 12.0 GB 3,350,000

^


Accuracy Considerations

The nature of current generation GPUs is such that single precision arithmetic is considerably faster (>23x for K10, >2x for K20) than double precision arithmetic. This poses an issue when trying to obtain good performance from GPUs. Traditionally the CPU code in Amber has always used double precision throughout the calculation. While this full double precision approach has been implemented in the GPU code it gives very poor performance. Thus the AMBER GPU implementation was initially written to use a combination of single and double precision, termed hybrid precision (SPDP), that is discussed in further detail in the references provided at the beginning of this page. This approach used single precision for individual calculations within the simulation but double precision for all accumulations. It also used double precision for shake calculations and for other parts of the code where loss of precision was deemed to be unacceptable. Tests have shown that energy conservation is equivalent to the full double precision code and specific ensemble properties, such as order parameters, match the full double precision CPU code. The user should understand though that this approach leads to divergence between GPU and CPU simulations, similar to that observed when running the CPU code across different processor counts in parallel but occurring more rapidly. For this reason the GPU test cases are more sensitive to rounding differences caused by hardware and compiler variations and will likely require manual inspection of the test case diff files in order to verify that the installation is providing correct results. There can also be differences between CPU and GPU runs and between GPU and GPU on different GPU models for runs that rely on the random number stream. For example NTT=2 and NTT=3. This is because the random number stream is different between CPUs and GPUs.

With AMBER 14 the single and double precision models (SPSP, SPDP and DPDP) have been deprecated and replaced with a new hybrid model that combines single precision calculation with fixed precision accumulation. This is described in detail in the following manuscript and tests have shown that it provides accuracy that is as good or better than the original SPDP model. This new precision model is supported on v2.0 hardware and newer and is especially beneficial on Kepler I (3.0) hardware.

Scott Le Grand; Andreas W. Goetz; & Ross C. Walker* "SPFP: Speed without compromise - a mixed precision model for GPU accelerated molecular dynamics simulations.", Comp. Phys. Comm, 2013, 184, pp374-380, DOI: 10.1016/j.cpc.2012.09.022

While the default precision model is currently the new SPFP model there is also a DPFP model available which provides full double precision calculation with bitwise reproducible fixed precision accumulation to facilitate advanced testing and comparison. The precision models supported in AMBER 14, and determined at compile time as described later, are:

  • SPFP - (Default) Uses a combination of single precision for calculation and fixed (integer) precision for accumulation. This approach provides optimum speedup on Kepler cards, minimizes memory overhead and provides greater net precision than the original SPDP model. It is designed for optimum performance on hardware revision 2.0 or later.
     
  • DPFP - Uses double precision (and double precision equivalent fixed precision) for the entire calculation. This provides for careful regression testing against the CPU code. It makes no additional approximations above and beyond the CPU implementation and would be the model of choice if performance was not a consideration. On v2.0 NVIDIA hardware (e.g. M2090) the performance is approximately half that of the SPFP model while on v3.0 NVIDIA hardware (e.g. K10) the performance is substantially less than the SPFP model.

Recommendation for Minimization

One limitation of the SPFP precision model is that force can be truncated if they overflow the fixed precision representation. This should never be a problem during MD simulations for any well behaved system. However, for minimization or very early in the heating phase it can present a problem. This is especially true if two atoms are close to each other and thus have large VDW repulsions. It is recommended therefore that for minimization you use the CPU version of the code. Only in situations where you are confident the structure is reasonable, for example if it was a snapshot from dynamics, should you use the GPU code (SPFP) for minimization.

^


Installation and Testing

The single GPU version of PMEMD is called pmemd.cuda while the multi-GPU version is called pmemd.cuda.MPI. These are built separately from the standard serial and parallel installations. Before attempting to build the GPU versions of PMEMD you should have built and tested at least the serial version of Amber and preferably the parallel version as well. This will help ensure that basic issues relating to standard compilation on your hardware and operating system do not lead to confusion with GPU related compilation and testing problems. You should also be reasonably familiar with Amberís compilation and test procedures. The minimum requirements for building the GPU version of PMEMD are, as of the AMBER 14 release:

  • NVIDIA Toolkit v5.0, 5.5 or 6.0. (v5.0 recommended)
     
  • NVIDIA GPU supporting Hardware Revision 2.0 or 3.0 and later.
     
  • NVIDIA CUDA Driver v319.60 or later (v337.09 or later required for GTX-780Ti and GTX-Titan-Black)
     
  • MPI library for parallel GPU. (MVAPICH2 v1.8 or later / MPICH2 v1.4p1 or later recommended)

It is assumed that you have already correctly installed and tested CUDA support on your GPU. Before attempting to build pmemd.cuda you should make sure you have correctly installed both the NVIDIA Toolkit (nvcc compiler) and a CUDA supporting NVIDIA driver. You may want to try downloading the NVIDIA CUDA SDK (available from http://www.nvidia.com/ or included as 'samples' in the toolkit v5.0 or later) and see if you can build that. Additionally the environment variable CUDA_HOME should be set to point to your NVIDIA Toolkit installation and $CUDA_HOME/bin/ should be in your path. At the time of release CUDA 5.0 or later is required with 5.0, 5.5 and 6.0 having been tested and officially supported. For performance reasons, at the time of writing, CUDA 5.0 is currently the recommended version to use.

Building and Testing the Default SPFP Precision Model

Single GPU (pmemd.cuda)

Assuming you have a working CUDA installation you can build the single GPU version, pmemd.cuda, using the default SPFP precision model as follows:

cd $AMBERHOME
make clean
./configure -cuda gnu    (or intel)
make install

Next you can run the tests using the default GPU (the one with the largest memory) with:

make test.cuda

The majority of these tests should pass. However, given the parallel nature of GPUs, meaning the order of operation is not well defined, the limited precision of the SPFP precision model, and variations in the random number generator on different GPU hardware, it is not uncommon for there to be several possible failures. You may also see some tests, particularly the cellulose test, fail on GPUs with limited memory. You should inspect the diff file created in the $AMBERHOME/logs/test_amber_cuda/ directory to manually verify any possible failures. Differences which occur on only a few lines and are minor in nature can be safely ignored. Any large differences, or if you are unsure, should be posted to the Amber mailing list for comment.

Multiple GPU (pmemd.cuda.MPI)

Once you have built and tested the serial GPU version you can optionally build the parallel version (if you have multiple GPUs of the same model). Unlike the CPU code it is not necessary to build the parallel version of the GPU code in order to access specific simulation options (except REMD). Thus you only need to build the parallel GPU code if you plan to run a single calculation across multiple GPUs. AMBER 12 had poor multi-GPU scaling for GPUs later than the Fermi (C2050/2070/2090) generation. This scaling has been improved significantly in AMBER 14 via the use of peer to peer communication over the PCI-E bus. While the underlying communication between GPUs is peer to peer if supported the code still requires MPI in order to run.

No special software beyond CUDA 5.0 and an MPI library is required to support peer to peer. For details on determing which GPUs can operate in peer to peer mode please see the multi-GPU section of the running on GPUs section below. The instructions here assume that you can already successfully build the MPI version of the CPU code. If you cannot, then you should focus on solving this before you move on to attempting to build the parallel GPU version.

The parallel GPU version of the code works using MPI v1 or later.

You can build the multi-GPU code as follows:

cd $AMBERHOME
make clean
./configure -cuda -mpi gnu    (or intel)
make install

Next you can run the tests using GPUs enumerated sequentially within a node (if you have multiple nodes or more complex GPU setups within a node then you should refer to the discussion below on running on multiple GPUs):

export DO_PARALLEL='mpirun -np 2' # for bash/sh
setenv DO_PARALLEL 'mpirun -np 2' # for csh/tcsh
./test_amber_cuda_parallel.sh

Building non-standard Precision Models

You can build different precision models as described below. However, be aware that this is meant largely as a debugging and testing issue and NOT for running production calculations. Please post any questions or comments you may have regarding this to the Amber mailing list. You should also be aware that the variation in test case results due to rounding differences will be markedly higher when testing the SPDP precision model. You select which precision model to compile as follows:

cd $AMBERHOME
make clean
./configure -cuda_DPDP gnu (use -cuda_SPDP for the SPDP model)
make install

This will produce executables named pmemd.cuda_XXXX where XXXX is the precision model selected at configure time (SPDP or DPDP). You can then test this on the GPU with the most memory as follows:

cd $AMBERHOME/test/
./test_amber_cuda.sh DPDP (to test the DPDP precision model)

Testing Alternative GPUs

Should you wish to run the serial GPU tests on a GPU different from the one with the most memory (and lowest GPU ID if more than one identical GPU exists) then you can provide this by setting the CUDA_VISIBLE_DEVICES environment variable. For example, to test the GPU with ID = 2 and the default SPFP precision model you would specify:

cd $AMBERHOME
export CUDA_VISIBLE_DEVICES=2
make test.cuda

^


Running GPU Accelerated Simulations

Single GPU

In order to run a single GPU accelerated MD simulation the only change required is to use the executable pmemd.cuda in place of pmemd. E.g.

$AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd

This will automatically run the calculation on the GPU with the most memory even if that GPU is already in use (see below for system settings to have the code auto select unused GPUs). If you have only a single CUDA capable GPU in your machine then this is fine, however if you want to control which GPU is used, for example you have a Tesla C2050 (3GB) and a Tesla C2070 (6GB) in the same machine and want to use the C2050 which has less memory, or you want to run multiple independent simulations using different GPUs then you manually need to specify the GPU ID to use using the CUDA_VISIBLE_DEVICES environment variable. The environment variable CUDA_VISIBLE_DEVICES lists which devices are visible as a comma-separated string. For example, if your desktop has two tesla cards and a Quadro:

$ ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla C2050"
Device 1: "Tesla C2070"
Device 2: "Quadro FX 3800"

By setting CUDA_VISIBLE_DEVICES you can make only a subset of them visible to the runtime:

$ export CUDA_VISIBLE_DEVICES="0,2"
$ ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla C2050"
Device 1: "Quadro FX 3800"

Hence if you wanted to run two pmemd.cuda runs, with the first running on the C2050 and the second on the C2070 you would run as follows:

$ export CUDA_VISIBLE_DEVICES="0"
nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd </dev/null &

$ export CUDA_VISIBLE_DEVICES="1"
nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd </dev/null &

In this way you only ever expose a single GPU to the pmemd.cuda executable and so avoid issues with the running of multiple runs on the same GPU. This approach is the basis of how you can control GPU usage in parallel runs.

An alternative approach, and the one recommended if you have multiple GPUs in a single node is to set them Persistence and Compute Exclusive Modes. In this mode a GPU will reject more than one job. pmemd.cuda and pmemd.cuda.MPI are capable of detecting this and will automatically run on a free GPU. For example supposed you have 4 GPUs in your node and you run 5 calculations:

$AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout.1 -x mdcrd.1 -inf mdinfo.1 &
$AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout.2 -x mdcrd.2 -inf mdinfo.2 &
$AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout.3 -x mdcrd.3 -inf mdinfo.3 &
$AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout.4 -x mdcrd.4 -inf mdinfo.4 &
$AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout.5 -x mdcrd.5 -inf mdinfo.5 &

In this situation calculation 1 will run on GPU 0, 2 on GPU 1, 3 on GPU 2 and 4 on GPU 3. The 5th job will quit immediately with an error stating that no free GPUs are available.

cudaMemcpyToSymbol: SetSim copy to cSim failed all CUDA-capable devices are busy or unavailable

This approach is useful since it means you do not have to worry about setting CUDA_VISIBLE_DEVICES and you do not have to worry about accidentally running multiple jobs on the same GPU. This approach also works in parallel. The code will automatically look for GPUs that are not in use and will use them automatically and will quit if sufficient free GPUs are not available. You set Persistence and Compute Exclusive Modes by running the following as root:

$ nvidia-smi -pm 1
$ nvidia-smi -c 3

The disadvantage of this is that you need to have root permissions to set it and the setting is also lost on a reboot. We recommend that you add these settings to your system's startup scripts. In Redhat or Centos this can be accomplished with:

echo "nvidia-smi -pm 1" >> /etc/rc.d/rc.local
echo "nvidia-smi -c 3" >> /etc/rc.d/rc.local

Note this approach also works on clusters where queuing systems understand GPUs as resources and thus can keep track of total gpus allocated but do not control which GPU you see on a node.

If you'd prefer more control or monitoring of GPU jobs then the following scripts provides ways to monitor usage and interface to the CUDA_VISIBLE_DEVICES approach.

Tracking GPU Usage

In order to determine which GPUs are currently running jobs and which are available for new job submissions, one can check device utilization reported by the NVIDIA-SMI tool (installed as part of the nvidia toolkit). While the device IDs assigned by NVIDIA-SMI do not necessarily correspond to those assigned by deviceQuery, the two can be correlated via PCI bus IDs. Below are two simple bash scripts that demonstrate how GPU utilization, as well as other state information from the NVIDIA-SMI can be returned in terms of deviceQuery IDs for use in GPU monitoring or guiding selection of CUDA_VISIBLE_DEVICES. Note that the NVIDIA-SMI does not support newer GTX series cards for return of most state information (including device utilization), simply printing N/A for the given field. As such, these scripts as written may only be useful for Tesla series cards. The scripts assume deviceQuery is in the user's PATH.

Example 1: gpu-info
This script prints a concise report of useful information for all GPUs in a node, including which are currently utilized (running jobs). Unutilized GPUs can be assigned to CUDA_VISIBLE_DEVICES for new job submissions. For a node containing 4 GPUs where 0 and 1 are idle, and 2 and 3 are running jobs:

$ ./gpu-info
===================================================
Device Model Temperature Utilization
===================================================
0 Tesla C2070 83 C 0 %
1 Tesla C2070 86 C 0 %
2 Tesla C2070 88 C 99 %
3 Tesla C2070 87 C 99 %
===================================================

#!/bin/bash

# gpu-info
# Written by Jodi Hadden, University of Georgia, 2011. Updated 2013.
# List GPUs by deviceQuery ID with corresponding temperature and utilization from NVIDIA-SMI 
# Works for NVIDIA-SMI 3.295.41 Driver Version: 295.41 for Tesla series cards

# Check for deviceQuery
if ! type deviceQuery >/dev/null
then
        echo "Please add the location of deviceQuery to your PATH!"
        exit
fi

# Make all GPUs temporarily visible
num_gpus=`nvidia-smi -a | grep "Attached GPUs" | cut -b35`
export CUDA_VISIBLE_DEVICES=`seq -s , 0 $num_gpus`

# Obtain PCI bus IDs that correspond to deviceQuery IDs, converting decimal to hex for use with NVIDIA-SMI
pci_bus_array=( $( ( echo "obase=16" ; deviceQuery -noprompt | grep "Device PCI Bus ID / PCI location ID" | cut -b50-60 | sed "s/\/ 0//" ) | bc ) )
rm -f deviceQuery.txt SdkMasterLog.csv

# Print a neat table
echo "==================================================="
echo -e "Device\tModel\t\tTemperature\tUtilization"
echo "==================================================="

# For each deviceQuery ID, use corresponding PCI bus ID to print temperature and utilization from NVIDIA-SMI 
gpu_id=0
while [ $gpu_id -lt ${#pci_bus_array[*]} ]
do

        # Get and print information from NVIDIA-SMI
        model=`nvidia-smi -a -i 0:${pci_bus_array[$gpu_id]}:0 | grep "Product Name" | cut -b35-45`
        temperature=`nvidia-smi -a -i 0:${pci_bus_array[$gpu_id]}:0 | grep -A1 "Temperature" | grep "Gpu" | cut -b35-45`
        utilization=`nvidia-smi -a -i 0:${pci_bus_array[$gpu_id]}:0 | grep -A1 "Utilization" | grep "Gpu" | cut -b35-45`

        echo -e "$gpu_id\t$model\t$temperature\t\t$utilization"

        let gpu_id=$gpu_id+1

done

echo "==================================================="

Example 2: gpu-free
This script prints a list of deviceQuery IDs corresponding to all currently unutilized GPUs (those not running jobs) in a node. The output can be assigned directly to CUDA_VISIBLE_DEVICES. For a node containing 4 GPUs where all 4 are idle:

$ ./gpu-free
0,1,2,3

For a node containing 4 GPUs where 0 and 1 are idle, and 2 and 3 are running jobs:

$ ./gpu-free
0,1

A generic job submission script could include setting CUDA_VISIBLE_DEVICES to the output of gpu-free as described below in order to facilitate automatic assignment of an available GPU to a given job. Approximately 10 seconds should be allotted between each subsequent job submission (call of the script) to allow the GPU utilizations adequate time to update.

$ export CUDA_VISIBLE_DEVICES=`gpu-free`

Reliable use of this method requires setting GPUs to persistence and exclusive process modes upon each reboot as described above.

#!/bin/bash

# gpu-free
# Written by Jodi Hadden, University of Georgia, 2011. Updated 2013.
# Print free GPU deviceQuery IDs to export as visible with CUDA_VISIBLE_DEVICES
# Works for NVIDIA-SMI 3.295.41 Driver Version: 295.41 for Tesla series cards

# Check for deviceQuery
if ! type deviceQuery >/dev/null
then
        echo "Please add the location of deviceQuery to your PATH!"
        exit
fi

# Make all GPUs temporarily visible
num_gpus=`nvidia-smi -a | grep "Attached GPUs" | cut -b35`
export CUDA_VISIBLE_DEVICES=`seq -s , 0 $num_gpus`

# Obtain PCI bus IDs that correspond to deviceQuery IDs, converting decimal to hex for use with NVIDIA-SMI
pci_bus_array=( $( ( echo "obase=16" ; deviceQuery -noprompt | grep "Device PCI Bus ID / PCI location ID" | cut -b50-60 | sed "s/\/ 0//" ) | bc ) )
rm -f deviceQuery.txt SdkMasterLog.csv

# For each PCI bus ID, check to see if that GPU is being utilized
gpu_id=0
while [ $gpu_id -lt ${#pci_bus_array[*]} ]
do

        # Get utilization from NVIDIA-SMI
        utilization=`nvidia-smi -a -i 0:${pci_bus_array[$gpu_id]}:0 | grep -A1 "Utilization" | grep "Gpu" | cut -b35-36`

        # If a GPU is not being utilized, add its deviceQuery ID to the free GPU array
        # Note: GPUs can show 1% utilization if NVIDIA-SMI is running in the background, so used -le 1 instead of -eq 0 here
        if [ $utilization -le 1 ]
        then
                free_gpu_array[$gpu_id]=$gpu_id
        fi

        let gpu_id=$gpu_id+1

done

# Print free GPUs to export as visible
free_gpus=`echo ${free_gpu_array[*]} | sed "s/ /,/g"`
echo $free_gpus

Multi GPU

When running in parallel across multiple GPUs and or nodes the selection of which GPU to run on becomes more complicated. Ideally you would have a batch scheduling system that will set everything up for you correctly, however, the following instructions provide a way to control this yourself. To understand how to control which GPUs are used it is first necessary to understand the GPU scheduling algorithm used by pmemd.cuda.MPI and where to look in the mdout file to verify which GPUs are actually being used.

When running in parallel pmemd.cuda.MPI keeps track of the GPU ID's already assigned on each node and will not reuse a GPU unless there are insufficient available for the number of MPI threads specified. Currently the assignment is 1 GPU per MPI thread and so if you want to run on 4 GPUs you would run with 4 MPI threads. It is your responsibility to ensure, when running across multiple nodes that threads are handed out sequentially to alternate nodes. For example if you had 4 dual quad core nodes, each with 1 GPU in it is essential that your MPI nodefile be laid out such that 'mpirun -np 4' will give you 1 thread on each of the 4 nodes and NOT 4 threads on the first node. This is normally accomplished by listing each node once in the nodefile. E.g. if you had 2 GPUs per node and 2 nodes then, to run across 4 GPUs your nodefile would typically look like this:

node0
node1

You would then run the code with (note the exact commands used will depend on your MPI installation):

mpirun --machinefile=nodefile -np 4 $AMBERHOME/bin/pmemd.cuda.MPI \
-O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd </dev/null &

This would allocate GPUs to threads and nodes as follows:
 

ThreadID NodeID

GPUID

0 0 0
1 1 0
2 0 1
3 1 1

You can check which GPUs are in use on which nodes by looking at the GPU Information section at the beginning of the mdout file. For example, here is what the output looks like for a run on 4 GPUs with 2 x M2050 GPUs per node:

|------------------- GPU DEVICE INFO --------------------
|
|                         Task ID:      0
|   CUDA Capable Devices Detected:      2
|           CUDA Device ID in use:      0
|                CUDA Device Name: Tesla M2050
|     CUDA Device Global Mem Size:   3071 MB
| CUDA Device Num Multiprocessors:     14
|           CUDA Device Core Freq:   1.15 GHz
|
|
|                         Task ID:      1
|   CUDA Capable Devices Detected:      2
|           CUDA Device ID in use:      0
|                CUDA Device Name: Tesla M2050
|     CUDA Device Global Mem Size:   3071 MB
| CUDA Device Num Multiprocessors:     14
|           CUDA Device Core Freq:   1.15 GHz
|
|
|                         Task ID:      2
|   CUDA Capable Devices Detected:      2
|           CUDA Device ID in use:      1
|                CUDA Device Name: Tesla M2050
|     CUDA Device Global Mem Size:   3071 MB
| CUDA Device Num Multiprocessors:     14
|           CUDA Device Core Freq:   1.15 GHz
|
|
|                         Task ID:      3
|   CUDA Capable Devices Detected:      2
|           CUDA Device ID in use:      1
|                CUDA Device Name: Tesla M2050
|     CUDA Device Global Mem Size:   3071 MB
| CUDA Device Num Multiprocessors:     14
|           CUDA Device Core Freq:   1.15 GHz
|
|--------------------------------------------------------

As you can see here the first MPI thread used GPUID 0, the second also GPUID 0 but this is because the second thread was running on a different node. Then GPUID 1 was used for MPI thread 3 and also for MPI thread 4 but again this would be on a different node.

Hence if your nodes are homogenous with exactly the same GPUs on all nodes running in parallel it is simply a case of running the same number of MPI threads as you want GPUs and ensuring that the threads are correctly distributed to each node. The selection of GPUs to use will be automatic. The same is true if you want to run on a single node. Suppose you had 4 GPUs, all the same, on a single node. Then you could just run with mpirun -np 4 and all 4 GPUs would be used.

What if I have a more complex setup or want more control over which GPUs are used?

In this case you should use the CUDA_VISIBLE_DEVICES environment variable as described above. For example if you had 2 nodes, each with a C1060, 2 C2050's and a FX3800 in and you wanted to run across the 4 C2050's (assuming you have a decent interconnect such as Gen 2 QDR IB between the nodes) then you would proceed as follows:

First obtain the native GPU IDs (I am assuming the device IDs are the same on each node, if they are not then you will just need a more complex script for setting the environment variables):

$ ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla C2050"
Device 1: "Tesla C1060"
Device 2: "Tesla C2050"
Device 3: "Quadro FX 3800"

In this case you would use the following command to limit the visible GPUs to device ID 0 and 2 - which will be re-enumerated as 0 and 1:

$ export CUDA_VISIBLE_DEVICES="0,2"
$ ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla C2050"
Device 1: "Tesla C2050"

If you were running a 2 GPU job on just the one node then at this point you could just run with 'mpirun -np 2'. However, if you want to run across the 2 nodes you need to ensure the environment variables are propagated to both nodes. One simple option is to edit your login scripts, such as .bashrc or .cshrc and set the environment variable there so when your mpirun command connects to each node with SSH the environment variable is automatically set. This, however, can be tedious. The more generic approach if you are NOT using a queuing system which automatically exports environment variables to all nodes is to pass it as part of your MPI run command. The method for doing this will vary depending on your MPI installation and you should check the documentation for your MPI to see how to do this ('mpirun --help' can often provide you with the information needed). The following example is for mvapich2 v1.5:

mpiexec -np 4 -genvall $AMBERHOME/bin/pmemd.cuda.MPI -O -i mdin -o mdout \
-p prmtop -c inpcrd -r restrt -x mdcrd -inf mdinfo

Again you can check the mdout file to make sure the GPUs you expect to be used actually are.

Note when running in parallel the PCI bus (and IB interconnect) tends to be saturated. Hence you can run additional single GPU jobs on the remaining GPUs not being used for a parallel run, as these only touch the PCI bus for I/O but you can't run additional multi-GPU jobs on the same nodes as they will compete for bandwidth. This means that if you have 4 GPUs in a single node you can run either 4 single GPU jobs; 2 single GPU jobs and a 2GPU job; or a 4GPU job. You CANNOT run 2 x 2GPU jobs.

^


Considerations for Maximizing GPU Performance

There are a number of considerations above and beyond those typically used on a CPU for maximizing the performance achievable for a GPU accelerated PMEMD simulation. The following provides some tips for ensuring good performance.

  1. Avoid using small values of NTPR, NTWX, NTWV, NTWE and NTWR. Writing to the output, restart and trajectory files too often can hurt performance even on CPU runs, however, this is more acute for GPU accelerated simulations because there is a substantial cost in copying data to and from the GPU. Performance is maximized when CPU to GPU memory synchronizations are minimized. This is achieved by computing as much as possible on the GPU and only copying back to CPU memory when absolutely necessary. There is an additional overhead in that performance is boosted by only calculating the energies when absolutely necessary, hence setting NTPR or NTWE to low values will result in excessive energy calculations. You should not set any of these values to less than 100 (except 0 to disable them) and ideally use values of 500 or more. >100000 for NTWR is ideal.
     
  2. Avoid setting ntave /= 0. Turning on the printing of running averages results in the code means needing to calculate both energy and forces on every step. This can lead to a performance loss of 8% or more when running on the GPU. This can also affect performance on CPU runs although the difference is not as marked. Similar arguments apply to setting the value of ene_avg_sampling to small values.
     
  3. Avoid using the NPT ensemble (NTB=2) when it is not required. Performance will generally be NVE~NVT>NPT. However, for explicit solvent simulations it is always necessary to run at least some NPT in order to allow the density to equilibrate. However, once this is done one can typically switch back to NVT for production.
     
  4. Avoid the use of GBSA in implicit solvent GB simulations unless required. The GBSA term is calculated on the CPU and thus requires a synchronization between GPU and CPU memory on every MD step as opposed to every NTPR or NTWX steps when running without this option.
     
  5. Use the Berendsen Thermostat (ntt=1) or Anderson Thermostat (ntt=2) instead of the Langevin Thermostat (ntt=3). Langevin simulations require very large numbers of random numbers which slows performance slightly.
     
  6. Do not assume that for small systems the GPU will always be faster. Typically for GB simulations of less than 150 atoms and PME simulations of less than 9,000 atoms it is not uncommon for the CPU version of the code to outperform the GPU version on a single node. Typically the performance differential between GPU and CPU runs will increase as atom count increases. Additionally the larger the non-bond cutoff used the better the GPU to CPU performance gain will be.
     
  7. When running in parallel across multiple GPUs you should NOT attempt to share nodes and thus interconnects. For example you should avoid running 2 separate MPI jobs on individual nodes. For example if you have 2 nodes, each with a QDR IB card in, 1 C2050 and 1 C1060 you will likely get very poor performance if you attempt to run a dual GPU job on the 2 C2050's and a second dual GPU job on the 2 C1060's. It is also not advisable to mix GPU models when running in parallel. In this situation you are advised to physically place both C2050's in one node and both C1060's in the other. You could of course run a dual C2050 job across the two nodes and then 2 single GPU jobs on each of the C1060's.
     
  8. When running in parallel for maximum performance you should use the latest interconnect technology. At the time of writing this is Gen 2 QDR IB. You should also make sure it is setup for GPU Direct support (Mellanox IB cards) which will typically give between a 10 and 25% performance improvement depending on the specifics of the calculation being run. Ideally you should also make sure that ALL GPU cards AND infiniband cards are in full x16 slots. This means that they are electrically wired as x16 slots and not just physically x16 but in reality split across multiple slots making the slots effectively x8 or x4 slots. (beware of junktm like the Dell 8 way break out boxes)
     
  9. Turn off ECC (C2050 and later). ECC can cost you up to 10% in performance and hurts parallel scaling. You should verify that your GPUs are working correctly, and not giving ECC errors for example before attempting this. You can turn this off on Fermi based cards and later by running the following command for each GPU ID as root, followed by a reboot:

    nvidia-smi -g 0 --ecc-config=0   (repeat with -g x for each GPU ID)

    Extensive testing of AMBER on a wide range of hardware has established that ECC has little to no benefit on the reliability of AMBER simulations. This is part of the reason it is acceptable (see recommended hardware) to use the GeForce gaming cards for AMBER simulations.
     
  10. Turn on boost clocks if supported. Newer GPUs from NVIDIA, such as the K40, support boost clocks which allow the clock speed to be increased if there is power and temperature headroom. This must be turned on to obtain optimum performance with AMBER. If you have a K40 or newer GPU supporting boost clocks then run the following:

    sudo nvidia-smi -i 0 -ac 3004,875       which puts device 0 into the highest boost state.

    To return to normal do: sudo nvidia-smi -rac

    To enable this setting without being root do: nvidia-smi -acp 0

 

^


Recommended Hardware

For examples of hardware that has been optimized for running GPU AMBER simulations, in terms of both price and performance please see the following page.

^


Benchmarks

Benchmarks are available on the following page.

^