AMBER 11 NVIDIA GPU
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The little AMBER engine that could - NVIDIA CEO highlights AMBER GPU port at GTC 2010
Multiple GPUs now supported over MPI with latest patch (bugfix.9) |
Video extract from the NVIDIA GTC 2010 Conference in San Jose, CA. The Walker Molecular Dynamics Lab's work, funded under NSF1047875, on accelerating the AMBER Molecular Dynamics package with GPUs is featured in the keynote speech by the CEO of NVIDIA. (You can also download the video if the above player does not work. (47 Mbyte) |
BackgroundThis page provides background on running AMBER (PMEMD) with NVIDIA GPU acceleration. One of the new features of PMEMD 11 is the ability to use NVIDIA GPUs to accelerate both explicit solvent PME and implicit solvent GB simulations. While this GPU acceleration is considered to be production ready it is still very new and thus has not been tested anywhere near as extensively as the CPU code has over the years. Therefore users should still exercise caution when using this code. The error checking is not as verbose in the GPU code as it is on the CPU. If you encounter problems during a simulation on the GPU you should first try to run the identical simulation on the CPU to ensure that it is not your simulation setup which is causing problems. Feedback and questions should be posted to the Amber mailing list. Authorship & SupportNVIDIA CUDA Implementation:
Further information relating to the specifics of the implementation, methods used to achieve performance while controlling accuracy and details of validation will be available shortly from the following publications:
Funding for this work has been graciously provided by NVIDIA, The University of California (UC Lab 09-LR-06-117792), The National Science Foundation's (NSF) TeraGrid Advanced User Support Program through the San Diego Supercomputer Center and a NSF SI2-SSE grant to Ross Walker (NSF1047875) and Adrian Roitberg (NSF1047919) If you make use of this GPU support in your work please cite both the usual Amber 11 citation and the above manuscripts. |
|
Supported FeaturesThe GPU accelerated version of PMEMD 11, supports both explicit solvent PME simulations in all three canonical ensembles (NVE, NVT and NPT) and implicit solvent Generalized Born simulations. It has been designed to support as many of the standard PMEMD v11 features as possible, however, there are some current limitations that are detailed below. Some of these may be addressed in the near future, and patches released, with the most up to date list posted on the web page. The following options are NOT supported (as of the August 2011 patch [v2.2]):
Additionally there are some minor differences in the output format. For example the Ewald error estimate is NOT calculated when running on a GPU. It is recommended that you first run a short simulation using the CPU code to check the Ewald error estimate is reasonable and that your system is stable. The above limitations are tested for in the code, however, it is possible that there are additional simulation features that have not been implemented or tested on GPUs. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Supported GPUsGPU accelerated PMEMD has been implemented using CUDA and thus will only run on NVIDIA GPUs at present. Due to accuracy concerns with pure single precision the code makes use of double precision in several places. This places the requirement that the GPU hardware supports double precision meaning only GPUs with hardware revision 1.3 or 2.0 and later can be used. At the time of Amber 11's release this comprises the following NVIDIA cards (* = untested):
aThere are currently known issues with GTX465, GTX470, GTX480 and GTX560, GTX570, GTX580 and GTX590 cards running in parallel where they hang after unspecified periods of time. Issues in serial should be addressed as of the latest bugfix. Due to the larger graphics memories offered by the Tesla series GPUs these are the recommended models. The Tesla models also support GPU Direct which can help with scaling between nodes. Additionally you should ensure that all GPUs on which you plan to run PMEMD are connected to PCI-E 2.0 x 16 lane slots or better, especially when running in parallel across multiple GPUs. If this is not the case then you will likely see degraded performance, although this effect is lessened in serial if you write to the mdout or mdcrd files infrequently. As of Oct 2010 (patch level bugfix.9) multiple GPUs can be used. Best performance is seen using QDR Inifiniband and 1 GPU per node for which, depending on system size and simulation parameters, scaling can be seen to approximately 8 to 16 GPUs. Scaling over multiple GPUs within a single node is also possible, if all are in x16 slots. It is also possible to run multiple single GPU runs on a single node and experience to date shows that even up to 8 GPUs per node the performance impact is minimal. Details of how to run MPI and multiple single GPU runs per node are given below. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
System Size LimitsIn order to obtain the extensive speedups that we see with GPUs it is critical that all of the calculation take place on the GPU within the GPU memory. This avoids the performance hit that one takes copying to and from the GPU and also allows us to achieve extensive speedups for realistic size systems. This avoids the need to create systems with millions of atoms to show reasonable speedups even when sampling lengths are unrealistic. This unfortunately means that the entire calculation must fit within the GPU memory. Additionally we make use of a number of scratch arrays to achieve high performance. This means that the GPU memory usage can actually be higher than a typical CPU run. It also means, due to the way we had to initially implement parallel GPU support that the memory usage per GPU does NOT decrease as you increase the number of GPUs. This is something we hope to fix in the future but for the moment the atom count limitations imposed on systems by the GPU memory is roughly constant whether you run in serial or in parallel. Since, unlike CPUs it is not possible to add more memory to a GPU (without replacing it entirely) and there is no concept of swap as there is on the CPU the size of the GPU memory imposes hard limits on the number of atoms supported in a simulation. Early on within the mdout file you will find information on the GPU being used and an estimate of the amount of GPU and CPU memory required:
The reported GPU memory usage is likely an underestimate and meant for guidance only to give you an idea of how close you are to the GPU's memory limit. Just because it is less than the available Device Global Mem Size does not necessarily mean that it will run. You should also be aware that the GPU's available memory is reduced by 1/9th if you have ECC turned on. Memory usage is affected by the run parameters. In particular the size of the cutoff, larger cutoffs needing more memory, and the ensemble being used. Additionally the physical GPU hardware affects memory usage since the optimizations used are non-identical for different GPU types. Typically, for PME runs, memory usage runs:
Use of restraints etc will also increase the amount of memory in use. As will the density of your system. The higher the density the more pairs per atom there are and thus the more GPU memory will be required. The following table provides an UPPER BOUND to the number of atoms supported (as of Amber11.bugfix10) as a function of GPU model. These numbers were estimated using boxes of TIP3P water (PME) and solventcaps of TIP3P water (GB). These had lower than optimum densities and so you may find you are actually limited for dense solvated proteins to around 20% less than the numbers here. Nevertheless these should provide reasonable estimates to work from: All numbers are for SPDP precision and are approximate limits. The actual limits will depend on system density, simulations settings etc. These numbers are thus designed to serve as guidelines only. Explicit Solvent (PME) 8 angstrom cutoff. Cubic box of TIP3P Water, NTT=0/3, NTB=0/1/2, NTP=0/1,NTF=2,NTC=2,DT=0.002.
Implicit Solvent (GB) No cutoff, Sphere of TIP3P Water (for testing purposes only), NTT=0/3, NTB=0, NTF=2, NTC=2, DT=0.002, IGB=1
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Accuracy ConsiderationsThe nature of current generation GPUs is such that single precision arithmetic is considerably faster (>8x for C1060 and >2x for C2050) than double precision arithmetic. This poses an issue when trying to obtain good performance from GPUs. Traditionally the CPU code in Amber has always used double precision throughout the calculation. While this full double precision approach has been implemented in the GPU code it gives very poor performance and so the default precision model used when running on GPUs is a combination of single and double precision, termed hybrid precision (SPDP), that is discussed in further detail in the references given above. This approach uses single precision for individual calculations within the simulation but double precision for all accumulations. It also uses double precision for shake calculations and for other parts of the code where loss of precision was deemed to be unacceptable. Tests have shown that energy conservation is equivalent to the full double precision code and specific ensemble properties, such as order parameters, match the full double precision CPU code. However, the user should be aware that such tests at the time of writing are not exhaustive and more in-depth validation is being conducted, the results of which will be reported in the literature. Previous acceleration approaches, such as the MDGRAPE accelerated sander, have used similar hybrid precision models and thus we believe that this is a reasonable compromise between accuracy and performance. The user should understand though that this approach leads to rapid divergence between GPU and CPU simulations, similar to that observed when running the CPU code across different processor counts in parallel but occurring much more rapidly. For this reason the GPU test cases are more sensitive to rounding differences caused by hardware and compiler variations and will likely require manual inspection of the test case diff files in order to verify that the installation is providing correct results. While the default precision model is currently the hybrid SPDP model three different precision models have been implemented within the GPU code to facilitate advanced testing and comparison. The choice of default precision model may change in the future based on the outcome of detailed validation tests of the three different approaches. The precision models supported, and determined at compile time as described later, are:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Installation and TestingIf you have not yet done so, visit this link to obtain CUDA support in Amber 11. The single GPU version of PMEMD is called pmemd.cuda while the multi-GPU version is called pmemd.cuda.MPI. Both must be built separately from the standard serial and parallel installations. Before attempting to build the GPU versions of PMEMD you should have built and tested at least the serial version of Amber and preferably the parallel version as well. This will help ensure that basic issues relating to standard compilation on your hardware and operating system do not lead to confusion with GPU related compilation and testing problems. You should also be familiar with Amber’s compilation and test procedures. The minimum requirements for building the GPU version of PMEMD are, as of the august 2011 release:
It is assumed that you have already correctly installed and tested CUDA support on your GPU. Before attempting to build pmemd.cuda you should first download and compile the NVIDIA CUDA SDK (available from http://www.nvidia.com/). Ensure that you can successfully build and run the deviceQuery program provided with this SDK since the output from this will be required if you are seeking help on the Amber mailing list. Additionally the environment variable CUDA_HOME should be set to point to your NVIDIA Toolkit installation and $CUDA_HOME/bin/ should be in your path. Building and Testing the Default SPDP Precision ModelSingle GPU (pmemd.cuda) Assuming you have a working CUDA installation you can build the single GPU version, pmemd.cuda, using the default precision model as follows:
Next you can run the tests using the default GPU (the one with the largest memory) with:
The majority of these tests should pass. However, given the parallel nature of GPUs, meaning the order of operation is not well defined, and the limited precision of the SPDP precision model it is not uncommon for there to be several possible failures. You may also see some tests, particularly the GB nucleosome test, fail on GPUs with limited memory. You should inspect the diff file created in the $AMBERHOME/test/logs/test_amber_cuda/ directory to manually verify any possible failures. Differences which occur on only a few lines and are minor in nature can be safely ignored. Any large differences, or if you are unsure, should be posted to the Amber mailing list for comment. Multiple GPU (pmemd.cuda.MPI) [bugfix.9 or later!] Once you have built and tested the serial GPU version you can next build the parallel version. The instructions here assume that you have applied ALL the bugfixes for AMBER and that you can already successfully build the MPI version of the CPU code. If you cannot, then you should focus on solving this before you move on to attempting to build the parallel GPU version. The parallel GPU version of the code works using MPI and requires support for MPI v2.0 or later. We recommend using MVAPICH2, Intel MPI or MPICH2. OpenMPI tends to give poor performance and may not support all MPI v2.0 features. Additionally if you plan on running over multiple nodes you should enable GPU Direct if your hardware supports it. It must also have been configured to support Fortran 90, C and C++ and all three compilers must be compatible. E.g. mixing of GNU C/C++ with Intel Fortran 90 is NOT supported. You can build the multi-GPU code as follows:
Next you can run the tests using GPUs enumerated sequentially within a node (if you have multiple nodes or more complex GPU setups within a node then you should refer to the discussion below on running on multiple GPUs):
Segfaults in Parallel: If you find that runs across multiple nodes (i.e. using the infiniband adapter) segfault almost immediately then this is most likely an issue with GPU Direct v2 (CUDA v4.0) not being properly supported by your hardware and driver installations. In most cases setting the following environment variable on all nodes (put it in your .bashrc) will fix the problem:
Building non-standard Precision ModelsYou can build different precision models as described below. However, be aware that this is meant largely as a debugging and testing issue and NOT for running production calculations. Please post any questions or comments you may have regarding this to the Amber mailing list. You should also be aware that the variation in test case results due to rounding differences will be markedly higher when testing the SPSP precision model. You select which precision model to compile as follows:
This will produce executables named pmemd.cuda_XXXX where XXXX is the precision model selected at configure time (SPSP or DPDP). You can then test this on the GPU with the most memory as follows:
Testing Alternative GPUsShould you wish to run the serial GPU tests on a GPU different from the one with the most memory (and lowest GPU ID if more than one identical GPU exists) then you can provide this as the first argument to the test script. For example, to test the GPU with ID = 2 and the default SPDP precision model you would specify:
This would be the same as using the executable '$AMBERHOME/bin/pmemd.cuda' with the command line flag '-gpu 2' as described below. Testing on different GPUs in parallel is slightly more complicated and it recommended you use the CUDA_VISIBLE_DEVICES environment variable and discussed below to control which GPU is used for testing. This approach can also be used with single GPU runs. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Running GPU Accelerated SimulationsSingle GPU In order to run a single GPU accelerated MD simulation the only change required is to use the executable pmemd.cuda in place of pmemd. E.g.
This will automatically run the
calculation on the GPU with the most memory even if that GPU
is already in use. If you have only a single CUDA capable
GPU in your machine then this is fine, however if you want
to control which GPU is used, for example you have a Tesla
C2050 (3GB) and a Tesla C1060 (4GB) in the same machine and
want to use the C2050 which has less memory, or you want to
run multiple independent simulations using different GPUs
then you manually need to specify the GPU ID on the command
line using either the -gpu
option or the CUDA_VISIBLE_DEVICES environment variable.
Both approaches are described below:
In this way it is possible to make use of multiple GPUs in a single node, such as those provided by a Tesla S1070, for multiple simultaneous calculations. An alternative approach, and then recommended for both parallel runs and for running within a batch queuing system is to use the CUDA_VISIBLE_DEVICES environment variable. The environment variable CUDA_VISIBLE_DEVICES lists which devices are visible as a comma-separated string. For example, if your desktop has two tesla cards and a Quadro:
By setting the envar, you can make only a subset of them visible to the runtime:
Hence if you wanted to run two pmemd.cuda runs, with the first running on the C2050 and the second on the C1060 you would run as follows:
In this way you only ever expose a single GPU to the pmemd.cuda executable and so avoid issues with the running of multiple runs on the same GPU. This approach is the basis of how you can control GPU usage in parallel runs. Multi GPU When running in parallel across multiple GPUs and or nodes the selection of which GPU to run on becomes significantly more complicated. The use of the -gpu command line option will NOT work in parallel and instead we recommend you use the CUDA_VISIBLE_DEVICES environment variable. Ideally you would have a batch scheduling system that will set everything up for you correctly, however, the following instructions can be done to control this yourself. To understand how to control which GPUs are used it is first necessary to understand the GPU scheduling algorithm used by pmemd.cuda.MPI and where to look in the mdout file to verify which GPUs are actually being used. When running in parallel pmemd.cuda.MPI keeps track of the GPU ID's already assigned on each node and will not reuse a GPU unless there insufficient available for the number of MPI threads specified. Currently the assignment is 1 GPU per MPI thread and so if you want to run on 4 GPUs you would run with 4 MPI threads. It is your responsibility to ensure, when running across multiple nodes that threads are handed out sequentially to alternate nodes. For example if you had 4 dual quad core nodes, each with 1 GPU in it is essential that your MPI nodefile be laid out such that 'mpirun -np 4' will give you 1 thread on each of the 4 nodes and NOT 4 threads on the first node. This is normally accomplished by listing each node once in the nodefile. E.g. if you had 2 GPUs per node and 2 nodes then, to run across 4 GPUs your nodefile would typically look like this:
You would then run the code with (note the exact commands used will depend on your MPI installation):
This would allocate GPUs to threads
and nodes as follows:
You can check which GPUs are in use on which nodes by looking at the GPU Information section at the beginning of the mdout file. For example, here is what the output looks like for a run on 4 GPUs with 2 x M2050 GPUs per node: |------------------- GPU DEVICE INFO -------------------- | | Task ID: 0 | CUDA Capable Devices Detected: 2 | CUDA Device ID in use: 0 | CUDA Device Name: Tesla M2050 | CUDA Device Global Mem Size: 3071 MB | CUDA Device Num Multiprocessors: 14 | CUDA Device Core Freq: 1.15 GHz | | | Task ID: 1 | CUDA Capable Devices Detected: 2 | CUDA Device ID in use: 0 | CUDA Device Name: Tesla M2050 | CUDA Device Global Mem Size: 3071 MB | CUDA Device Num Multiprocessors: 14 | CUDA Device Core Freq: 1.15 GHz | | | Task ID: 2 | CUDA Capable Devices Detected: 2 | CUDA Device ID in use: 1 | CUDA Device Name: Tesla M2050 | CUDA Device Global Mem Size: 3071 MB | CUDA Device Num Multiprocessors: 14 | CUDA Device Core Freq: 1.15 GHz | | | Task ID: 3 | CUDA Capable Devices Detected: 2 | CUDA Device ID in use: 1 | CUDA Device Name: Tesla M2050 | CUDA Device Global Mem Size: 3071 MB | CUDA Device Num Multiprocessors: 14 | CUDA Device Core Freq: 1.15 GHz | |-------------------------------------------------------- As you can see here the first MPI thread used GPUID 0, the second also GPUID 0 but this is because the second thread was running on a different node. Then GPUID 1 was used for MPI thread 3 and also for MPI thread 4 but again this would be on a different node. Hence if your nodes are homogenous with the same exactly the same GPUs on all nodes running in parallel is simply a case of running the same number of MPI threads as you want GPUs and ensuring that the threads are correctly distributed to each node. The selection of GPUs to use will be automatic. The same is true if you want to run on a single node. Suppose you had 4 GPUs, all the same, on a single node. Then you could just run with mpirun -np 4 and all 4 GPUs would be used. What if I have a more complex setup or want more control over which GPUs are used? In this case you should use the CUDA_VISIBLE_DEVICES environment variable as described above. For example if you had 2 nodes, each with a C1060, 2 C2050's and a FX3800 in and you wanted to run across the 4 C2050's (assuming you have a decent interconnect such as Gen 2 QDR IB between the nodes) then you would proceed as follows: First obtain the native GPU IDs (I am assuming the device IDs are the same on each node, if they are not then you will just need a more complex script for setting the environment variables):
In this case you would use the following command to limit the visible GPUs to device ID 0 and 2 - which will be re-enumerated as 0 and 1:
If you were running a 2 GPU job on just the one node then at this point you could just run with 'mpirun -np 2'. However, if you want to run across the 2 nodes you need to ensure the environment variables are propogated to both nodes. One simple option is to edit your login scripts, such as .bashrc or .cshrc and set the environment variable there so when your mpirun command connects to each node with SSH the environment variable is automatically set. This, however, can be tedious. The more generic approach if you are NOT using a queuing system which automatically exports environment variables to all nodes is to pass it as part of your MPI run command. The method for doing this will vary depending on your MPI installation and you should check the documentation for your MPI to see how to do this ('mpirun --help' can often provide you with the information needed). The following example is for mvapich2 v1.5:
Again you can check the mdout file to make sure the GPUs you expect to be used actually are. Note: In parallel when running PME calculations (ntb>0) a power of two GPUs (and thus MPI threads) must be used. For GB simulations any number of threads can be used as long as there are 25x more atoms than GPUs. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Considerations for Maximizing GPU PerformanceThere are a number of considerations above and beyond those typically used on a CPU for maximizing the performance achievable for a GPU accelerated PMEMD simulation. The following provides some tips for ensuring good performance.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Recommended HardwareThe AMBER GPU implementation has been designed to work on a huge range of hardware. Essentially the only thing you need is a NVIDIA GPU supporting hardware revision 1.3 or 2.0 or later as detailed above. However, there are some considerations when it comes to maximizing performance both in serial and parallel. Serial In serial the performance for each independent AMBER GPU job is, assuming NTPR, NTWX etc are large enough, mostly independent of the underlying motherboard chipset, the PCI-E bandwidth and the number of GPUs per node. At a minimum you need as least one free core per node. If you are building small desktops to run serial calculations then multiple GPUs per node will be the most cost effective. Ideally you should still try to keep the GPUs on x16 PCI-E slots and make sure your power supply is sufficient to power all the GPUs under full load. Parallel In parallel considerations change to the available bandwidth in the node and between nodes. The ideal specification for performance is 1 GPU per node in an x16 slot and 1 QDR-IB card per node in an x16 slot. This is obviously expensive however and so the best price / performance tradeoff is probably to install 2 GPUs per node (both in x16) and 1 QDR IB card per node (also in x16). 2 QDR IB cards per node would be even better except that there is a bug in the GPU Direct implementation right now that causes segfaults when trying to use two IB cards per node. Building your own System If you wish to build your own system for running GPU AMBER from parts then your main considerations are a suitable motherboard, a processor with at least 1 core per GPU and a power supply beefy enough to run everything. For a simple 2 GPU custom system I recommend the following (prices as of Sept 2011):
Total Price: $2510.86 (as of Sept 2011) This hardware is what was used to generate the 1 x GTX580 and 2 x GTX580 benchmark numbers on the Main GPU Benchmark page. Purchasing Professional Workstations and Cluster Nodes - MD SimCluster If you are purchasing heavy duty workstations or installing clusters where you need high availability, reliability, warranty, service etc then you should consider purchasing the Tesla GPUs (which also have more memory and use less power). You should also consider getting a preconfigured and pre-tested system. In order to simplify this process we have teamed up with NVIDIA and their Partners to design a series of systems referred to as MD SimClusters. These are professional configurations (suitable for research group, department or centralized computing services use) where the hardware has been chosen to offer the best tradeoff between price, performance, power use and reliability. These machines can be ordered preconfigured with AMBER, cluster management and queuing software. They can also be customized as needed.
At the time of writing these machines are available from Exxact Corporation and Microway. The performance on these clusters should be approximately the same as that given for the M2090 (2 GPU per node) numbers on the main GPU benchmark page. However we hope to add vendor specific benchmarks shortly. Free test drives are available on MD Sim Cluster machines for you to test out the performance of AMBER running your own systems. More information on the MD SimCluster Test Drive Program is available through NVIDIA (http://www.bit.ly/simcluster_amber). |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BenchmarksBenchmarks are available on the following page. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||