Installation and Testing
The single GPU version of PMEMD is called pmemd.cuda
while the multi-GPU
version is called pmemd.cuda.MPI
.
We recommend that you build a serial (CPU) version of pmemd first.
This will help ensure that basic issues relating to standard compilation on
your hardware and operating system do not lead to confusion with GPU related compilation and
testing problems. You should also be reasonably familiar with Amber's compilation and test
procedures.
It is assumed that you have already correctly installed and tested CUDA support on your
GPU. The environment variable CUDA_HOME
should be set to point to your NVIDIA
Toolkit installation and ${CUDA_HOME}/bin/
should be in
your PATH.
CUDA versions 7.5 to 11.x are supported.
Note the CPU compiler has
little effect on the performance of the GPU code and so, while the Intel compilers are
supported for building the GPU code, the recommended approach is to use the GNU
compilers.
Building and Testing the GPU code
Assuming you have a working CUDA installation you can build both precision models
(pmemd.cuda_SPFP
and pmemd.cuda_DPFP
) by
editing your run.cmake to set "-DCUDA=TRUE". Then re-run
./run_cmake and make install.
Next, you can run the tests using the default GPU (the one with the largest memory)
with:
cd $AMBERHOME; make test.serial.cuda
The majority of these tests should pass. However, given the parallel nature of GPUs,
meaning the order of operation is not well defined, the limited precision of the SPFP
precision model, and variations in the random number generator on different GPU hardware, it
is not uncommon for there to be several possible failures, although substantially less than
with earlier versions of Amber. You may also see some tests, particularly the cellulose
test, fail on GPUs with limited memory. You should inspect the diff file created in the
$AMBERHOME/logs/test_amber_cuda/
directory to manually verify any possible
failures. Differences which occur on only a few lines and are minor in nature can be safely
ignored. Any large differences, or if you are unsure, should be posted to the Amber mailing
list for comment.
Testing Alternate GPUs
Should you wish to run the tests on a GPU different from the one with the most memory (and
lowest GPU ID if more than one identical GPU exists) then you should make use of the
CUDA_VISIBLE_DEVICES
environment variable as described below. For example, to
test the GPU with ID = 2 (GPUs are numbered sequentially from zero) and the default SPFP
precision model you would run the tests as follows:
cd $AMBERHOME
export CUDA_VISIBLE_DEVICES=2
make test.cuda.serial
Multiple GPU (pmemd.cuda.MPI
)
Once you have built and tested the serial GPU version you can optionally build the
parallel version (if you have multiple GPUs of the same model). Unlike the CPU code it is
not necessary to build the parallel version of the GPU code in order to access specific
simulation options (except REMD). Thus you only need to build the parallel GPU code if you
plan to run GB simulations or use replica exchange across multiple GPUs. It should be noted that at the time
of writing while a single MD calculation can span multiple GPUs the performance of a single GPU
is now so good that advances in the interconnect
between GPUs have not kept up such that, with the exception of large implicit solvent
simulations, there is little to be gained trying to run a calculation across multiple GPUs.
If you have 4 GPUs in a single node your best method for utilizing them is to run 4
individual calculations from different starting structures / random seeds.
The instructions here assume that you can already successfully build the MPI version of
the CPU code. If you cannot, then you should focus on solving this before you move on to
attempting to build the parallel GPU version. The parallel GPU version of the code works
using MPI v1 or later. You can build the multi-GPU code by setting
both -DMPI and -DCUDA to TRUE in your run_cmake script.
Next you can run the tests using GPUs enumerated sequentially within a node (if you have
multiple nodes or more complex GPU setups within a node then you should refer to the
discussion below on running on multiple GPUs):
export DO_PARALLEL='mpirun -np 2' # for bash/sh
setenv DO_PARALLEL 'mpirun -np 2' # for csh/tcsh
make test.cuda.parallel
Running GPU-Accelerated Simulations
Single GPU
In order to run a single GPU accelerated MD simulation the only change required is to use
the executable pmemd.cuda
in place of pmemd
. For example:
${AMBERHOME}/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop -c inpcrd \
-r restrt -x mdcrd -inf mdinfo
This will automatically run the calculation on the GPU with the most memory even if that
GPU is already in use (see below for system settings to have the code auto select unused
GPUs). If you have only a single CUDA capable GPU in your machine then this is fine.
If you want to control which GPU is used, then you need to specify the GPU ID to use using
the CUDA_VISIBLE_DEVICES
environment variable. For example, you have a Tesla
C2050 (3GB) and a Tesla C2070 (6GB) in the same machine and want to use the C2050 which has
less memory. Or, you may want to run multiple independent simulations using different GPUs.
The environment variable CUDA_VISIBLE_DEVICES
lists which devices are visible as
a comma-separated string. For example, if your desktop has two Tesla cards and a Quadro
(using the deviceQuery function from the NVIDIA CUDA Samples):
$ ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla C2050"
Device 1: "Tesla C2070"
Device 2: "Quadro FX 3800"
By setting CUDA_VISIBLE_DEVICES you can make only a subset of them visible to the
runtime:
$ export CUDA_VISIBLE_DEVICES="0,2"
$ ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla C2050"
Device 1: "Quadro FX 3800"
Hence if you wanted to run two pmemd.cuda runs, with the first running on the C2050 and
the second on the C2070 you would run as follows:
$ export CUDA_VISIBLE_DEVICES="0"
nohup ${AMBERHOME}/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd -inf mdinfo &
$ export CUDA_VISIBLE_DEVICES="1"
nohup ${AMBERHOME}/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop \
-c inpcrd -r restrt -x mdcrd -inf mdinfo &
In this way you only ever expose a single GPU to the pmemd.cuda executable and so avoid
issues with the running of multiple runs on the same GPU. This approach is the basis of how
you can control GPU usage in parallel runs. If you want to know which GPU a calculation is
running on you can check the value of CUDA_VISIBLE_DEVICES and other GPU specific information
that is provided at the beginning of the
mdout file. To check whether a GPU is in use or not you can use the nvidia-smi
command. For Tesla series GPUs (K20, K40 etc) and the latest model Maxwell, Pascal and Turing cards
(GTX-Titan-X, GTX-1080, Titan-XP, RTX-2080Ti etc) this provides % load information and process
information for each GPU. For older GeForce series cards this information is not available
via nvidia-smi so it is better to just check the GPU memory usage and temperature. You can
do this with:
$ nvidia-smi
+------------------------------------------------------+
| NVIDIA-SMI 337.12 Driver Version: 337.12 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:02:00.0 N/A | N/A |
| 26% 32C N/A N/A / N/A | 15MiB / 6143MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:03:00.0 N/A | N/A |
| 0% 35C N/A N/A / N/A | 15MiB / 6143MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
In this case both GPUs are idle (temp is < 50C and Memory usage is < 50MiB). If only GPU
0 was in use it might look something like this:
$ nvidia-smi
+------------------------------------------------------+
| NVIDIA-SMI 337.12 Driver Version: 337.12 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:02:00.0 N/A | N/A |
| 67% 80C N/A N/A / N/A | 307MiB / 6143MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:03:00.0 N/A | N/A |
| 0% 35C N/A N/A / N/A | 15MiB / 6143MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
Note running X11 can confuse things a little bit, showing slightly higher memory usage on
one of the GPUs, but it should still be possible to use this approach to determine which GPU
is in use.
An alternative approach, and the one recommended if you have multiple GPUs in a single
node, and only want to run single GPU jobs, is to set them to Persistence and Compute
Exclusive Modes. In this mode a GPU will reject more than one job:
${AMBERHOME}/bin/pmemd.cuda -O -i mdin -o mdout.1 -x mdcrd.1 -inf mdinfo.1 &
${AMBERHOME}/bin/pmemd.cuda -O -i mdin -o mdout.2 -x mdcrd.2 -inf mdinfo.2 &
${AMBERHOME}/bin/pmemd.cuda -O -i mdin -o mdout.3 -x mdcrd.3 -inf mdinfo.3 &
${AMBERHOME}/bin/pmemd.cuda -O -i mdin -o mdout.4 -x mdcrd.4 -inf mdinfo.4 &
${AMBERHOME}/bin/pmemd.cuda -O -i mdin -o mdout.5 -x mdcrd.5 -inf mdinfo.5 &
In this situation calculation 1 will run on GPU 0, 2 on GPU 1, 3 on GPU 2 and 4 on GPU 3.
The 5th job will quit immediately with an error stating that no free GPUs are available:
cudaMemcpyToSymbol: SetSim copy to cSim failed all CUDA-capable devices are busy or
unavailable
This approach is useful since it means you do not have to worry about setting
CUDA_VISIBLE_DEVICES
and you do not have to worry about accidentally running
multiple jobs on the same GPU. The code will automatically look for GPUs that are not in use
and will use them automatically and will quit if sufficient free GPUs are not available. You
set Persistence and Compute Exclusive Modes by running the following as root:
$ nvidia-smi -pm 1
$ nvidia-smi -c 3
The disadvantage of this is that you need to have root permissions to set it and the
setting is also lost on a reboot. We recommend that you add these settings to your system's
startup scripts. In Redhat or Centos this can be accomplished with:
echo "nvidia-smi -pm 1" >> /etc/rc.d/rc.local
echo "nvidia-smi -c 3" >> /etc/rc.d/rc.local
Note this approach also works on clusters where queuing systems understand GPUs as
resources and thus can keep track of total gpus allocated but do not control which GPU you
see on a node. The downside of doing this is that it prevents multi-GPU runs from being run.
Multi GPU
The way in which a single calculation runs across multiple GPUs was changed in AMBER 14
and the new approach has been kept in AMBER 16 & 18. When the AMBER multi-GPU support
was originally designed the PCI-E bus speed was gen 2 x16 and the GPUs were C1060 or C2050s.
Since then we have seen GPUs advance to Titan-XP, GTX-1080 and M40s which are ~ 16x faster
than the C1060s (more if we include that AMBER 18 is >40% faster than AMBER 14) while
the PCI-E bus speed has only increased by 2x (PCI Gen2 x16 to PCI Gen3 x16) and Infiniband
interconnects by about the same amount. NVLink is also insufficient as well as being
ridiculously expensive. This unfortunate situation is all too common in
parallel machines and yet many machine designers do not seem to appreciate the problems this
places on software designers. The situation is now such that for PME calculations using
traditional MPI it is not possible for AMBER to scale over multiple GPUs. In AMBER 14 and
16 we chose to address this problem by focusing on the use of Peer to Peer communication
within a node. This mode of operation allows GPUs to communicate directly through the PCI-E
bus without going through the CPU chipset which adds too much latency. Assuming all your
GPUs have PCI-E gen 3 x16 bandwidth and support non-blocking peer to peer communication it is
possible to scale to 4 GPUs within a single node, although the scaling may be far from ideal
and only achievable for pre-Pascal generation GPUs. As of AMBER 18, with modern 2018 or
later GPUs, you are unlikely to see benefit beyond a single GPU, although the design of AMBER
is such that it makes sense from a cost perspective to put 4 or 8 GPUs in a single node and run
multiple independent simulations on a single node.
Note the following section is deprecated since pascal and later GPUs do not scale
beyond a single GPU however we leave it here for completeness.
It should be noted that while the legacy MPI and GPU-Direct methods of multi-GPU
communication are still supported, and will be used by the code automatically if peer to peer
communication is not available, you are very unlikely to see any speedup by using multiple
GPUs for a single job if the GPUs are newer than C2050s. Multi-node runs are almost
impossible to get to scale. We suggest saving money by skipping the expensive infiniband
interconnect and instead going for GPU dense nodes. Note the standard 4 x GPU systems such as
those featured on the recommended hardware page are designed to
support 2 x 2 GPU runs at once using peer to peer or 4 x 1 GPU runs.
When running in parallel across multiple GPUs the selection of which GPU to run on becomes
a little more complicated. We have attempted to simplify things for you by writing a short
CUDA program that checks which GPUs in your node can support peer to peer communication. You
should download and build this as follows. Assuming you have the cuda toolkit correctly
installed, this should build you the executable gpuP2PCheck
:
- Download check_p2p.tar.bz2 and save it to your home directory.
- Untar with tar xvjf check_p2p.tar.bz2.
- cd check_p2p
- make
On most machines, this will find will be pairs of GPUs in the order 0+1 and 2+3. In other
words, on a 4 GPU machine you can run a total of two by two GPU jobs, one on GPUs 0 and 1 and
one on GPUs 2 and 3. Running a calculation across more than 2 GPUs will result in peer to
peer being switched off which will likely mean the calculation will run slower than if it had
been run on a single GPU. To see which GPUs in your system can communicate via peer to peer,
you can run the gpuP2PCheck
program you built above. This reports which GPUs
can talk to each other. For example:
CUDA_VISIBLE_DEVICES is unset.
CUDA-capable device count: 4
GPU0 "GeForce GTX TITAN"
GPU1 "GeForce GTX TITAN"
GPU2 "GeForce GTX TITAN"
GPU3 "GeForce GTX TITAN"
Two way peer access between:
GPU0 and GPU1: YES
GPU0 and GPU2: NO
GPU0 and GPU3: NO
GPU1 and GPU2: NO
GPU1 and GPU3: NO
GPU2 and GPU3: YES
In this case, GPUs 0 and 1 can talk to each other and GPUs 2 and 3 can talk to each
other. To run a 2 GPU job on GPUs 0 and 1, run with:
export CUDA_VISIBLE_DEVICES=0,1
mpirun -np 2 ${AMBERHOME}/bin/pmemd.cuda.MPI -O -i mdin -o mdout -p ...
If peer to peer communication is working you will see the following reported in your mdout
file.
|---------------- GPU PEER TO PEER INFO -----------------
|
| Peer to Peer support: ENABLED
|
|--------------------------------------------------------
Since peer to peer communication does not involve the CPU chipset so it is possible,
unlike in previous versions of AMBER, to run multiple multi-GPU runs on a single node. In
our 4 GPU example above, we saw that GPU pairs (0, 1) and (2, 3) can communicate via
peer-to-peer. We can run the following options without the jobs interfering with each other
(performance of each job will be unaffected by the other):
# Option 1: All single GPU
export CUDA_VISIBLE_DEVICES=0
cd run1
nohup ${AMBERHOME}/bin/pmemd.cuda -O -i ... &
export CUDA_VISIBLE_DEVICES=1
cd run2
nohup ${AMBERHOME}/bin/pmemd.cuda -O -i ... &
export CUDA_VISIBLE_DEVICES=2
cd run3
nohup ${AMBERHOME}/bin/pmemd.cuda -O -i ... &
export CUDA_VISIBLE_DEVICES=3
cd run4
nohup ${AMBERHOME}/bin/pmemd.cuda -O -i ... &
# Option 2: One dual GPU and 2 single GPU runs
export CUDA_VISIBLE_DEVICES=0,1
cd run1
nohup mpirun -np 2 ${AMBERHOME}/bin/pmemd.cuda.MPI -O -i ... &
export CUDA_VISIBLE_DEVICES=2
cd run2
nohup ${AMBERHOME}/bin/pmemd.cuda -O -i ... &
export CUDA_VISIBLE_DEVICES=3
cd run3
nohup ${AMBERHOME}/bin/pmemd.cuda -O -i ... &
# Option 3: Two dual GPU runs
export CUDA_VISIBLE_DEVICES=0,1
cd run1
nohup mpirun -np 2 ${AMBERHOME}/bin/pmemd.cuda.MPI -O -i ... &
export CUDA_VISIBLE_DEVICES=2,3
cd run2
nohup mpirun -np 2 ${AMBERHOME}/bin/pmemd.cuda.MPI -O -i ... &