AMBER 11 NVIDIA GPU
ACCELERATION SUPPORT

| Background | Authorship & Support | Features | Supported GPUs | System Size Limits |
| Accuracy Considerations | Installation and Testing | Running GPU Accelerated Simulations |
| Considerations for Maximizing GPU Performance | Benchmarks |
| Recommended Hardware | MD SimCluster |
| Return to Main Amber Page |


Major Update Released 18 Aug 2011 (v2.2 [Amber 11 Bugfix.17])
Doubles Performance for PME Calculations on GPUs.

NVIDIA and Partners Announce MD SimCluster Test Drive
Program for AMBER


The little AMBER engine that could - NVIDIA CEO highlights AMBER GPU port at GTC 2010

Multiple GPUs now supported over MPI with latest patch (bugfix.9)


click for larger image

Video extract from the NVIDIA GTC 2010 Conference in San Jose, CA. The Walker Molecular Dynamics Lab's work, funded under NSF1047875, on accelerating the AMBER Molecular Dynamics package with GPUs is featured in the keynote speech by the CEO of NVIDIA. (You can also download the video if the above player does not work. (47 Mbyte)

 

Background

This page provides background on running AMBER (PMEMD) with NVIDIA GPU acceleration.

One of the new features of PMEMD 11 is the ability to use NVIDIA GPUs to accelerate both explicit solvent PME and implicit solvent GB simulations. While this GPU acceleration is considered to be production ready it is still very new and thus has not been tested anywhere near as extensively as the CPU code has over the years. Therefore users should still exercise caution when using this code. The error checking is not as verbose in the GPU code as it is on the CPU. If you encounter problems during a simulation on the GPU you should first try to run the identical simulation on the CPU to ensure that it is not your simulation setup which is causing problems. Feedback and questions should be posted to the Amber mailing list.

Authorship & Support

NVIDIA CUDA Implementation:

Scott Le Grand (NVIDIA)
Duncan Poole (NVIDIA)
Ross C. Walker (SDSC)

Further information relating to the specifics of the implementation, methods used to achieve performance while controlling accuracy and details of validation will be available shortly from the following publications:

  • Andreas W. Goetz; Mark J. Williamson; Dong Xu; Duncan Poole; Scott L. Grand;  & Ross C. Walker* "Routine microsecond molecular dynamics simulations with AMBER - Part I: Generalized Born", J. Chem. Theory Comput., 2012, 8 (5), pp 1542–1555 , DOI: 10.1021/ct200909j
     
  • Romelia Salomon-Ferrer; Andreas W. Goetz; Duncan Poole; Scott L. Grand;  & Ross C. Walker* "Routine microsecond molecular dynamics simulations with AMBER - Part II: Particle Mesh Ewald" (in preparation), 2012

Funding for this work has been graciously provided by NVIDIA, The University of California (UC Lab 09-LR-06-117792), The National Science Foundation's (NSF) TeraGrid Advanced User Support Program through the San Diego Supercomputer Center and a NSF SI2-SSE grant to Ross Walker (NSF1047875) and Adrian Roitberg (NSF1047919)

If you make use of this GPU support in your work please cite both the usual Amber 11 citation and the above manuscripts.

^

Cuda Zone


Supported Features

The GPU accelerated version of PMEMD 11, supports both explicit solvent PME simulations in all three canonical ensembles (NVE, NVT and NPT) and implicit solvent Generalized Born simulations. It has been designed to support as many of the standard PMEMD v11 features as possible, however, there are some current limitations that are detailed below. Some of these may be addressed in the near future, and patches released, with the most up to date list posted on the web page. The following options are NOT supported (as of the August 2011 patch [v2.2]):

1) ibelly /= 0 Simulations using belly style constraints are not supported.
2) if (igb/=0 & cut<systemsize) GPU accelerated implicit solvent GB simulations do not support a cutoff.
3) nmropt /= 0 There is currently no support for the various nmropt modes.
4) igb /= 0,1,2,5 Only igb models 1,2, and 5 are supported.
5) nrespa /= 1 No multiple time stepping is supported.
6) vlimit /= -1 For performance reasons the vlimit function is not implemented on GPUs.
7) numextra > 0 (in parallel) Extra points are only supported in the serial GPU code at present.
8) es_cutoff /= vdw_cutoff Independent cutoffs for electrostatics and van der Waals are not supported on GPUs.
8) order > 4 PME interpolation orders of greater than 4 are not supported at present.
9) imin=1 (in parallel) Minimization is only supported in the serial GPU code.

Additionally there are some minor differences in the output format. For example the Ewald error estimate is NOT calculated when running on a GPU. It is recommended that you first run a short simulation using the CPU code to check the Ewald error estimate is reasonable and that your system is stable. The above limitations are tested for in the code, however, it is possible that there are additional simulation features that have not been implemented or tested on GPUs.

^


Supported GPUs

GPU accelerated PMEMD has been implemented using CUDA and thus will only run on NVIDIA GPUs at present. Due to accuracy concerns with pure single precision the code makes use of double precision in several places. This places the requirement that the GPU hardware supports double precision meaning only GPUs with hardware revision 1.3 or 2.0 and later can be used. At the time of Amber 11's release this comprises the following NVIDIA cards (* = untested):

Hardware Version 2.0
Hardware Version 1.3
  • Tesla C1060/S1070
  • Quadro FX4800*/5800*
  • GTX295/285/280*/275*/260*/250*/240*/220*/210*
Caution: There are a number of cards termed GTS (instead of GTX) which are cut down models, often OEM versions. A common example is the GTS250. These budget cards will NOT work since they do not support hardware double precision.

aThere are currently known issues with GTX465, GTX470, GTX480 and GTX560, GTX570, GTX580 and GTX590 cards running in parallel where they hang after unspecified periods of time. Issues in serial should be addressed as of the latest bugfix.

Due to the larger graphics memories offered by the Tesla series GPUs these are the recommended models. The Tesla models also support GPU Direct which can help with scaling between nodes. Additionally you should ensure that all GPUs on which you plan to run PMEMD are connected to PCI-E 2.0 x 16 lane slots or better, especially when running in parallel across multiple GPUs. If this is not the case then you will likely see degraded performance, although this effect is lessened in serial if you write to the mdout or mdcrd files infrequently. As of Oct 2010 (patch level bugfix.9) multiple GPUs can be used. Best performance is seen using QDR Inifiniband and 1 GPU per node for which, depending on system size and simulation parameters, scaling can be seen to approximately 8 to 16 GPUs. Scaling over multiple GPUs within a single node is also possible, if all are in x16 slots. It is also possible to run multiple single GPU runs on a single node and experience to date shows that even up to 8 GPUs per node the performance impact is minimal. Details of how to run MPI and multiple single GPU runs per node are given below.

^


System Size Limits

In order to obtain the extensive speedups that we see with GPUs it is critical that all of the calculation take place on the GPU within the GPU memory. This avoids the performance hit that one takes copying to and from the GPU and also allows us to achieve extensive speedups for realistic size systems. This avoids the need to create systems with millions of atoms to show reasonable speedups even when sampling lengths are unrealistic. This unfortunately means that the entire calculation must fit within the GPU memory. Additionally we make use of a number of scratch arrays to achieve high performance. This means that the GPU memory usage can actually be higher than a typical CPU run. It also means, due to the way we had to initially implement parallel GPU support that the memory usage per GPU does NOT decrease as you increase the number of GPUs. This is something we hope to fix in the future but for the moment the atom count limitations imposed on systems by the GPU memory is roughly constant whether you run in serial or in parallel.

Since, unlike CPUs it is not possible to add more memory to a GPU (without replacing it entirely) and there is no concept of swap as there is on the CPU the size of the GPU memory imposes hard limits on the number of atoms supported in a simulation. Early on within the mdout file you will find information on the GPU being used and an estimate of the amount of GPU and CPU memory required:

|------------------- GPU DEVICE INFO --------------------
|
| CUDA Capable Devices Detected: 1
| CUDA Device ID in use: 0
| CUDA Device Name: Tesla C2070
| CUDA Device Global Mem Size: 6143 MB
| CUDA Device Num Multiprocessors: 14
| CUDA Device Core Freq: 1.15 GHz
|
|--------------------------------------------------------

...

| GPU memory information:
| KB of GPU memory in use: 4638979
| KB of CPU memory in use: 790531

The reported GPU memory usage is likely an underestimate and meant for guidance only to give you an idea of how close you are to the GPU's memory limit. Just because it is less than the available Device Global Mem Size does not necessarily mean that it will run. You should also be aware that the GPU's available memory is reduced by 1/9th if you have ECC turned on.

Memory usage is affected by the run parameters. In particular the size of the cutoff, larger cutoffs needing more memory, and the ensemble being used. Additionally the physical GPU hardware affects memory usage since the optimizations used are non-identical for different GPU types. Typically, for PME runs, memory usage runs:

NPT > NVT > NVE
NTT=3 > NTT=1 == NTT=2 > NTT=0

Use of restraints etc will also increase the amount of memory in use. As will the density of your system. The higher the density the more pairs per atom there are and thus the more GPU memory will be required. The following table provides an UPPER BOUND to the number of atoms supported (as of Amber11.bugfix10) as a function of GPU model. These numbers were estimated using boxes of TIP3P water (PME) and solventcaps of TIP3P water (GB). These had lower than optimum densities and so you may find you are actually limited for dense solvated proteins to around 20% less than the numbers here. Nevertheless these should provide reasonable estimates to work from:

All numbers are for SPDP precision and are approximate limits. The actual limits will depend on system density, simulations settings etc. These numbers are thus designed to serve as guidelines only.

Explicit Solvent (PME)

8 angstrom cutoff. Cubic box of TIP3P Water, NTT=0/3, NTB=0/1/2, NTP=0/1,NTF=2,NTC=2,DT=0.002.

CARD GPU MEM RUN TYPE MAX ATOMS
GTX-295 895 MB NVE 220,000
    NVT 139,000
    NPT 136,000
       
Tesla C1060 4.0 GB NVE 1,070,000
    NVT 1,035,000
    NPT 1,020,000
       
Tesla C2050 3.0 GB NVE 1,020,000
  (ECC off) NVT 936,000
    NPT 918,000
       
Tesla C2070 6.0 GB NVE 2,080,000
  (ECC off) NVT 1,875,000
    NPT 1,860,000

Implicit Solvent (GB)

No cutoff, Sphere of TIP3P Water (for testing purposes only), NTT=0/3, NTB=0, NTF=2, NTC=2, DT=0.002, IGB=1

CARD GPU MEM RUN TYPE MAX ATOMS
GTX-295 895 MB NVE 20,500
    NVT 19,200
       
Tesla C1060 4.0 GB NVE 46,350
    NVT 45,200
       
Tesla C2050 3.0 GB NVE 39,250
  (ECC off) NVT 38,100
       
Tesla C2070 6.0 GB NVE 54,000
  (ECC off) NVT 53,050

^


Accuracy Considerations

The nature of current generation GPUs is such that single precision arithmetic is considerably faster (>8x for C1060 and >2x for C2050) than double precision arithmetic. This poses an issue when trying to obtain good performance from GPUs. Traditionally the CPU code in Amber has always used double precision throughout the calculation. While this full double precision approach has been implemented in the GPU code it gives very poor performance and so the default precision model used when running on GPUs is a combination of single and double precision, termed hybrid precision (SPDP), that is discussed in further detail in the references given above. This approach uses single precision for individual calculations within the simulation but double precision for all accumulations. It also uses double precision for shake calculations and for other parts of the code where loss of precision was deemed to be unacceptable. Tests have shown that energy conservation is equivalent to the full double precision code and specific ensemble properties, such as order parameters, match the full double precision CPU code. However, the user should be aware that such tests at the time of writing are not exhaustive and more in-depth validation is being conducted, the results of which will be reported in the literature. Previous acceleration approaches, such as the MDGRAPE accelerated sander, have used similar hybrid precision models and thus we believe that this is a reasonable compromise between accuracy and performance. The user should understand though that this approach leads to rapid divergence between GPU and CPU simulations, similar to that observed when running the CPU code across different processor counts in parallel but occurring much more rapidly. For this reason the GPU test cases are more sensitive to rounding differences caused by hardware and compiler variations and will likely require manual inspection of the test case diff files in order to verify that the installation is providing correct results.

While the default precision model is currently the hybrid SPDP model three different precision models have been implemented within the GPU code to facilitate advanced testing and comparison. The choice of default precision model may change in the future based on the outcome of detailed validation tests of the three different approaches. The precision models supported, and determined at compile time as described later, are:

  • SPSP - Use single precision for the entire calculation with the exception of SHAKE which is always done in double precision. This provides the highest performance and is probably the most directly comparable model to other GPU MD implementations. However, insufficient testing has been done to know if the use of single precision throughout the simulation (certainly NVE simulations are not possible) is acceptable and so for the moment this precision model should be used for testing and debugging purposes only.
     
  • SPDP - (Default) Use a combination of single precision for calculation and double precision for accumulation. This approach is believed to provide the optimum tradeoff between accuracy and performance and hence at the time of release is the default model invoked when using the executable pmemd.cuda / pmemd.cuda.MPI.
     
  • DPDP - Use double precision for the entire calculation. This provides for careful regression testing against the CPU code. It makes no additional approximations above and beyond the CPU implementation and would be the model of choice if performance was not a consideration. On v1.3 NVIDIA hardware (e.g. C1060) the performance is approximately a fifth that of the SPDP model while on v2.0 NVIDIA hardware (e.g. C2050) the performance is approximately half that of the SPDP model.

^


Installation and Testing

If you have not yet done so, visit this link to obtain CUDA support in Amber 11.

The single GPU version of PMEMD is called pmemd.cuda while the multi-GPU version is called pmemd.cuda.MPI. Both must be built separately from the standard serial and parallel installations. Before attempting to build the GPU versions of PMEMD you should have built and tested at least the serial version of Amber and preferably the parallel version as well. This will help ensure that basic issues relating to standard compilation on your hardware and operating system do not lead to confusion with GPU related compilation and testing problems. You should also be familiar with Amber’s compilation and test procedures. The minimum requirements for building the GPU version of PMEMD are, as of the august 2011 release:

  • NVIDIA Toolkit v3.2 or later.
     
  • NVIDIA GPU supporting Hardware Revision 1.3 or 2.0 and later. (This excludes revision
    1.5)
     
  • NVIDIA CUDA Driver v260.19.26 or later.
     
  • AMBERTools v1.5.

It is assumed that you have already correctly installed and tested CUDA support on your GPU. Before attempting to build pmemd.cuda you should first download and compile the NVIDIA CUDA SDK (available from http://www.nvidia.com/). Ensure that you can successfully build and run the deviceQuery program provided with this SDK since the output from this will be required if you are seeking help on the Amber mailing list. Additionally the environment variable CUDA_HOME should be set to point to your NVIDIA Toolkit installation and $CUDA_HOME/bin/ should be in your path.

Building and Testing the Default SPDP Precision Model

Single GPU (pmemd.cuda)

Assuming you have a working CUDA installation you can build the single GPU version, pmemd.cuda, using the default precision model as follows:

cd $AMBERHOME/AmberTools/src/
make clean
./configure -cuda intel    (or gnu)
cd ../../
./AT15_Amber.py
cd src/
make clean
make cuda

Next you can run the tests using the default GPU (the one with the largest memory) with:

cd $AMBERHOME/test/
./test_amber_cuda.sh

The majority of these tests should pass. However, given the parallel nature of GPUs, meaning the order of operation is not well defined, and the limited precision of the SPDP precision model it is not uncommon for there to be several possible failures. You may also see some tests, particularly the GB nucleosome test, fail on GPUs with limited memory. You should inspect the diff file created in the $AMBERHOME/test/logs/test_amber_cuda/ directory to manually verify any possible failures. Differences which occur on only a few lines and are minor in nature can be safely ignored. Any large differences, or if you are unsure, should be posted to the Amber mailing list for comment.

Multiple GPU (pmemd.cuda.MPI) [bugfix.9 or later!]

Once you have built and tested the serial GPU version you can next build the parallel version. The instructions here assume that you have applied ALL the bugfixes for AMBER and that you can already successfully build the MPI version of the CPU code. If you cannot, then you should focus on solving this before you move on to attempting to build the parallel GPU version.

The parallel GPU version of the code works using MPI and requires support for MPI v2.0 or later. We recommend using MVAPICH2, Intel MPI or MPICH2. OpenMPI tends to give poor performance and may not support all MPI v2.0 features. Additionally if you plan on running over multiple nodes you should enable GPU Direct if your hardware supports it. It must also have been configured to support Fortran 90, C and C++ and all three compilers must be compatible. E.g. mixing of GNU C/C++ with Intel Fortran 90 is NOT supported.

You can build the multi-GPU code as follows:

cd $AMBERHOME/AmberTools/src/
make clean
./configure -cuda -mpi intel    (or gnu)
cd ../../
./AT15_Amber.py
cd src/
make clean
make cuda_parallel

Next you can run the tests using GPUs enumerated sequentially within a node (if you have multiple nodes or more complex GPU setups within a node then you should refer to the discussion below on running on multiple GPUs):

cd $AMBERHOME/test/
./test_amber_cuda_parallel.sh

Segfaults in Parallel: If you find that runs across multiple nodes (i.e. using the infiniband adapter) segfault almost immediately then this is most likely an issue with GPU Direct v2 (CUDA v4.0) not being properly supported by your hardware and driver installations. In most cases setting the following environment variable on all nodes (put it in your .bashrc) will fix the problem:

export CUDA_NIC_INTEROP=1

Building non-standard Precision Models

You can build different precision models as described below. However, be aware that this is meant largely as a debugging and testing issue and NOT for running production calculations. Please post any questions or comments you may have regarding this to the Amber mailing list. You should also be aware that the variation in test case results due to rounding differences will be markedly higher when testing the SPSP precision model. You select which precision model to compile as follows:

cd $AMBERHOME/AmberTools/src/
make clean
./configure -cuda_DPDP gnu (use -cuda_SPSP for the SPSP model)
cd ../../
./AT15_Amber.py
cd src/
make clean
make cuda

This will produce executables named pmemd.cuda_XXXX where XXXX is the precision model selected at configure time (SPSP or DPDP). You can then test this on the GPU with the most memory as follows:

cd $AMBERHOME/test/
./test_amber_cuda.sh -1 DPDP (to test the DPDP precision model)

Testing Alternative GPUs

Should you wish to run the serial GPU tests on a GPU different from the one with the most memory (and lowest GPU ID if more than one identical GPU exists) then you can provide this as the first argument to the test script. For example, to test the GPU with ID = 2 and the default SPDP precision model you would specify:

cd $AMBERHOME/test/
./test_amber_cuda.sh 2 SPDP

This would be the same as using the executable '$AMBERHOME/bin/pmemd.cuda' with the command line flag '-gpu 2' as described below.

Testing on different GPUs in parallel is slightly more complicated and it recommended you use the CUDA_VISIBLE_DEVICES environment variable and discussed below to control which GPU is used for testing. This approach can also be used with single GPU runs.

^


Running GPU Accelerated Simulations

Single GPU

In order to run a single GPU accelerated MD simulation the only change required is to use the executable pmemd.cuda in place of pmemd. E.g.

$AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd

This will automatically run the calculation on the GPU with the most memory even if that GPU is already in use. If you have only a single CUDA capable GPU in your machine then this is fine, however if you want to control which GPU is used, for example you have a Tesla C2050 (3GB) and a Tesla C1060 (4GB) in the same machine and want to use the C2050 which has less memory, or you want to run multiple independent simulations using different GPUs then you manually need to specify the GPU ID on the command line using either the -gpu option or the CUDA_VISIBLE_DEVICES environment variable. Both approaches are described below:
 

-gpu Specifies which GPU should be used for running a single GPU accelerated PMEMD calculation. This is based on the hardware ID of the GPU card which can be obtained by running the deviceQuery command from the NVIDIA CUDA SDK. Valid values are from -1 to 32. A value of -1 (default) means that the GPU with the most memory should be used while values of 0 or greater select individual GPUs.

$AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd -gpu 1

In this way it is possible to make use of multiple GPUs in a single node, such as those provided by a Tesla S1070, for multiple simultaneous calculations.

An alternative approach, and then recommended for both parallel runs and for running within a batch queuing system is to use the CUDA_VISIBLE_DEVICES environment variable. The environment variable CUDA_VISIBLE_DEVICES lists which devices are visible as a comma-separated string. For example, if your desktop has two tesla cards and a Quadro:

$ ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla C2050"
Device 1: "Tesla C1060"
Device 2: "Quadro FX 3800"

By setting the envar, you can make only a subset of them visible to the runtime:

$ export CUDA_VISIBLE_DEVICES="0,2"
$ ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla C2050"
Device 1: "Quadro FX 3800"

Hence if you wanted to run two pmemd.cuda runs, with the first running on the C2050 and the second on the C1060 you would run as follows:

$ export CUDA_VISIBLE_DEVICES="0"
nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd </dev/null &

$ export CUDA_VISIBLE_DEVICES="1"
nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd </dev/null &

In this way you only ever expose a single GPU to the pmemd.cuda executable and so avoid issues with the running of multiple runs on the same GPU. This approach is the basis of how you can control GPU usage in parallel runs.

Multi GPU

When running in parallel across multiple GPUs and or nodes the selection of which GPU to run on becomes significantly more complicated. The use of the -gpu command line option will NOT work in parallel and instead we recommend you use the CUDA_VISIBLE_DEVICES environment variable. Ideally you would have a batch scheduling system that will set everything up for you correctly, however, the following instructions can be done to control this yourself. To understand how to control which GPUs are used it is first necessary to understand the GPU scheduling algorithm used by pmemd.cuda.MPI and where to look in the mdout file to verify which GPUs are actually being used.

When running in parallel pmemd.cuda.MPI keeps track of the GPU ID's already assigned on each node and will not reuse a GPU unless there insufficient available for the number of MPI threads specified. Currently the assignment is 1 GPU per MPI thread and so if you want to run on 4 GPUs you would run with 4 MPI threads. It is your responsibility to ensure, when running across multiple nodes that threads are handed out sequentially to alternate nodes. For example if you had 4 dual quad core nodes, each with 1 GPU in it is essential that your MPI nodefile be laid out such that 'mpirun -np 4' will give you 1 thread on each of the 4 nodes and NOT 4 threads on the first node. This is normally accomplished by listing each node once in the nodefile. E.g. if you had 2 GPUs per node and 2 nodes then, to run across 4 GPUs your nodefile would typically look like this:

node0
node1

You would then run the code with (note the exact commands used will depend on your MPI installation):

mpirun -np 4 $AMBERHOME/bin/pmemd.cuda.MPI -O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd </dev/null &

This would allocate GPUs to threads and nodes as follows:
 

ThreadID NodeID

GPUID

0 0 0
1 1 0
2 0 1
3 0 1

You can check which GPUs are in use on which nodes by looking at the GPU Information section at the beginning of the mdout file. For example, here is what the output looks like for a run on 4 GPUs with 2 x M2050 GPUs per node:

|------------------- GPU DEVICE INFO --------------------
|
|                         Task ID:      0
|   CUDA Capable Devices Detected:      2
|           CUDA Device ID in use:      0
|                CUDA Device Name: Tesla M2050
|     CUDA Device Global Mem Size:   3071 MB
| CUDA Device Num Multiprocessors:     14
|           CUDA Device Core Freq:   1.15 GHz
|
|
|                         Task ID:      1
|   CUDA Capable Devices Detected:      2
|           CUDA Device ID in use:      0
|                CUDA Device Name: Tesla M2050
|     CUDA Device Global Mem Size:   3071 MB
| CUDA Device Num Multiprocessors:     14
|           CUDA Device Core Freq:   1.15 GHz
|
|
|                         Task ID:      2
|   CUDA Capable Devices Detected:      2
|           CUDA Device ID in use:      1
|                CUDA Device Name: Tesla M2050
|     CUDA Device Global Mem Size:   3071 MB
| CUDA Device Num Multiprocessors:     14
|           CUDA Device Core Freq:   1.15 GHz
|
|
|                         Task ID:      3
|   CUDA Capable Devices Detected:      2
|           CUDA Device ID in use:      1
|                CUDA Device Name: Tesla M2050
|     CUDA Device Global Mem Size:   3071 MB
| CUDA Device Num Multiprocessors:     14
|           CUDA Device Core Freq:   1.15 GHz
|
|--------------------------------------------------------

As you can see here the first MPI thread used GPUID 0, the second also GPUID 0 but this is because the second thread was running on a different node. Then GPUID 1 was used for MPI thread 3 and also for MPI thread 4 but again this would be on a different node.

Hence if your nodes are homogenous with the same exactly the same GPUs on all nodes running in parallel is simply a case of running the same number of MPI threads as you want GPUs and ensuring that the threads are correctly distributed to each node. The selection of GPUs to use will be automatic. The same is true if you want to run on a single node. Suppose you had 4 GPUs, all the same, on a single node. Then you could just run with mpirun -np 4 and all 4 GPUs would be used.

What if I have a more complex setup or want more control over which GPUs are used?

In this case you should use the CUDA_VISIBLE_DEVICES environment variable as described above. For example if you had 2 nodes, each with a C1060, 2 C2050's and a FX3800 in and you wanted to run across the 4 C2050's (assuming you have a decent interconnect such as Gen 2 QDR IB between the nodes) then you would proceed as follows:

First obtain the native GPU IDs (I am assuming the device IDs are the same on each node, if they are not then you will just need a more complex script for setting the environment variables):

$ ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla C2050"
Device 1: "Tesla C1060"
Device 2: "Tesla C2050"
Device 3: "Quadro FX 3800"

In this case you would use the following command to limit the visible GPUs to device ID 0 and 2 - which will be re-enumerated as 0 and 1:

$ export CUDA_VISIBLE_DEVICES="0,2"
$ ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla C2050"
Device 1: "Tesla C2050"

If you were running a 2 GPU job on just the one node then at this point you could just run with 'mpirun -np 2'. However, if you want to run across the 2 nodes you need to ensure the environment variables are propogated to both nodes. One simple option is to edit your login scripts, such as .bashrc or .cshrc and set the environment variable there so when your mpirun command connects to each node with SSH the environment variable is automatically set. This, however, can be tedious. The more generic approach if you are NOT using a queuing system which automatically exports environment variables to all nodes is to pass it as part of your MPI run command. The method for doing this will vary depending on your MPI installation and you should check the documentation for your MPI to see how to do this ('mpirun --help' can often provide you with the information needed). The following example is for mvapich2 v1.5:

mpiexec -np 4 -genvall $AMBERHOME/bin/pmemd.cuda.MPI -O -i mdin -o mdout \
-p prmtop -c inpcrd -r restrt -x mdcrd -inf mdinfo

Again you can check the mdout file to make sure the GPUs you expect to be used actually are.

Note: In parallel when running PME calculations (ntb>0) a power of two GPUs (and thus MPI threads) must be used. For GB simulations any number of threads can be used as long as there are 25x more atoms than GPUs.

^


Considerations for Maximizing GPU Performance

There are a number of considerations above and beyond those typically used on a CPU for maximizing the performance achievable for a GPU accelerated PMEMD simulation. The following provides some tips for ensuring good performance.

  1. Avoid using small values of NTPR, NTWX, NTWV, NTWE and NTWR. Writing to the output, restart and trajectory files too often can hurt performance even on CPU runs, however, this is more acute for GPU accelerated simulations because there is a substantial cost in copying data to and from the GPU. Performance is maximized when CPU to GPU memory synchronizations are minimized. This is achieved by computing as much as possible on the GPU and only copying back to CPU memory when absolutely necessary. There is an additional overhead in that performance is boosted by only calculating the energies when absolutely necessary, hence setting NTPR or NTWE to low values will result in excessive energy calculations. You should not set any of these values to less than 100 (except 0 to disable them) and ideally use values of 500 or more. >10000 for NTWR is ideal.
     
  2. Avoid setting ntave /= 0. Turning on the printing of running averages results in the code needing to calculate both energy and forces on every step. This can lead to a performance loss of 8% or more when running on the GPU. This can also affect performance on CPU runs although the difference is not as marked. Similar arguments apply to setting the value of ene_avg_sampling to small values.
     
  3. Do not use versions of the NVCC compiler prior to v3.2. The FFT performance has been greatly improved in v3.2.
     
  4. Avoid using the NTP ensemble (NTB=2) when it is not required. Performance will generally be NVE~NVT>NPT. However, for explicit solvent simulations it is always necessary to run at least some NPT in order to allow the density to equilibrate. However, once this is done one can typically switch back to NVT or NVE for production.
     
  5. Use the Berendsen Thermostat (ntt=1) or Anderson Thermostat (ntt=2) instead of the Langevin Thermostat (ntt=3). Langevin simulations require very large numbers of random numbers which slows performance.
     
  6. Do not assume that for small systems the GPU will always be faster. Typically for GB simulations of less than 150 atoms and PME simulations of less than 9,000 atoms it is not uncommon for the CPU version of the code to outperform the GPU version on a single node. Typically the performance differential between GPU and CPU runs will increase as atom count increases. Additionally the larger the non-bond cutoff used the better the GPU to CPU performance gain will be.
     
  7. When running in parallel across multiple GPUs you should NOT attempt to share nodes and thus interconnects. For example you should avoid running 2 separate MPI jobs on individual nodes. For example if you have 2 nodes, each with a QDR IB card in, 1 C2050 and 1 C1060 you will likely get very poor performance if you attempt to run a dual GPU job on the 2 C2050's and a second dual GPU job on the 2 C1060's. It is also not advisable to mix GPU models when running in parallel. In this situation you are advised to physically place both C2050's in one node and both C1060's in the other. You could of course run a dual C2050 job across the two nodes and then 2 single GPU jobs on each of the C1060's. The performance impact of this has not been fully characterized though and your mileage may vary. The best advice we can offer is to try it and see.
     
  8. When running in parallel for maximum performance you should use the latest interconnect technology. At the time of writing this is Gen 2 QDR IB. You should also make sure it is setup for GPU Direct support (Mellanox IB cards) which will typically give between a 10 and 25% performance improvement depending on the specifics of the calculation being run. Ideally you should also make sure that ALL GPU cards AND infiniband cards are in full x16 slots. This means that they are electrically wired as x16 slots and not just physically x16 but in reality split across multiple slots making the slots effectively x8 or x4 slots.
     
  9. Turn off ECC (C2050 and later). ECC can cost you up to 10% in performance and hurts parallel scaling. You should verify that your GPUs are working correctly, and not giving ECC errors for example before attempting this. You can turn this off on Fermi based cards and later by running the following command for each GPU ID as root, followed by a reboot:

nvidia-smi -g 0 --ecc-config=0   (repeat with -g x for each GPU ID)

^


Recommended Hardware

The AMBER GPU implementation has been designed to work on a huge range of hardware. Essentially the only thing you need is a NVIDIA GPU supporting hardware revision 1.3 or 2.0 or later as detailed above. However, there are some considerations when it comes to maximizing performance both in serial and parallel.

Serial

In serial the performance for each independent AMBER GPU job is, assuming NTPR, NTWX etc are large enough, mostly independent of the underlying motherboard chipset, the PCI-E bandwidth and the number of GPUs per node. At a minimum you need as least one free core per node. If you are building small desktops to run serial calculations then multiple GPUs per node will be the most cost effective. Ideally you should still try to keep the GPUs on x16 PCI-E slots and make sure your power supply is sufficient to power all the GPUs under full load.

Parallel

In parallel considerations change to the available bandwidth in the node and between nodes. The ideal specification for performance is 1 GPU per node in an x16 slot and 1 QDR-IB card per node in an x16 slot. This is obviously expensive however and so the best price / performance tradeoff is probably to install 2 GPUs per node (both in x16) and 1 QDR IB card per node (also in x16). 2 QDR IB cards per node would be even better except that there is a bug in the GPU Direct implementation right now that causes segfaults when trying to use two IB cards per node.

Building your own System

If you wish to build your own system for running GPU AMBER from parts then your main considerations are a suitable motherboard, a processor with at least 1 core per GPU and a power supply beefy enough to run everything. For a simple 2 GPU custom system I recommend the following (prices as of Sept 2011):

1 x Corsair Professional Series Gold High Performance 1200-Watt Power Supply CMPSU-1200AX $259.99

1 x Asus 90-MSVDAA-G0AAY00Z P8p67 Ws Revolution Rev 3.0 Lga1155 4d3 4pciex16 3pciex1 $245.99

2 x Seagate Barracuda Green 2TB SATA 6Gb/s 64MB Cache 3.5-Inch Internal Bare Drive ST2000DL003 $79.99 each

1 x Intel Core i7 Processor i7-2600 3.4GHz 8MB LGA1155 CPU BX80623I72600 $299.95

1 x Lian Li PC-P50 $269.98

1 x Corsair Memory Vengeance 16 Quad Channel Kit DDR3 1600 MHz (PC3 12800) 240-Pin DDR3 SDRAM CMZ16GX3M4A1600C9 $94.99

2 x EVGA GeForce GTX 580 3072 MB GDDR5 PCB PCI Express 2.0 2DVI/Mini-HDMI SLI Ready Limited Lifetime Warranty Graphics Card, 03G-P3-1584-AR $589.99 each

Total Price: $2510.86 (as of Sept 2011)

This hardware is what was used to generate the 1 x GTX580 and 2 x GTX580 benchmark numbers on the Main GPU Benchmark page.

Purchasing Professional Workstations and Cluster Nodes - MD SimCluster

If you are purchasing heavy duty workstations or installing clusters where you need high availability, reliability, warranty, service etc then you should consider purchasing the Tesla GPUs (which also have more memory and use less power). You should also consider getting a preconfigured and pre-tested system.

In order to simplify this process we have teamed up with NVIDIA and their Partners to design a series of systems referred to as MD SimClusters. These are professional configurations (suitable for research group, department or centralized computing services use) where the hardware has been chosen to offer the best tradeoff between price, performance, power use and reliability. These machines can be ordered preconfigured with AMBER, cluster management and queuing software. They can also be customized as needed.

GPU Computing Enabled Cluster Designed for GPU Accelerated AMBER
  • 4 Nodes x 8 Tesla M2090 GPUs with CUDA 4.0.
  • Dual 6-core CPUs + Infiniband QDR /node.
  • Preconfigured With AMBER 11 (AMBER license required).
  • No Setup necessary.
  • Buy cluster loaded with Linux OS and cluster management software.

 

At the time of writing these machines are available from Exxact Corporation and Microway.

The performance on these clusters should be approximately the same as that given for the M2090 (2 GPU per node) numbers on the main GPU benchmark page. However we hope to add vendor specific benchmarks shortly.

Free test drives are available on MD Sim Cluster machines for you to test out the performance of AMBER running your own systems. More information on the MD SimCluster Test Drive Program is available through NVIDIA (http://www.bit.ly/simcluster_amber).

^


Benchmarks

Benchmarks are available on the following page.

^