Installation and Testing

The single GPU version of PMEMD is called pmemd.cuda while the multi-GPU version is called pmemd.cuda.MPI. We recommend that you build a serial (CPU) version of pmemd first. This will help ensure that basic issues relating to standard compilation on your hardware and operating system do not lead to confusion with GPU related compilation and testing problems. You should also be reasonably familiar with Amber's compilation and test procedures.

It is assumed that you have already correctly installed and tested CUDA support on your GPU. The environment variable CUDA_HOME should be set to point to your NVIDIA Toolkit installation and ${CUDA_HOME}/bin/ should be in your PATH. CUDA versions 7.5 to 11.x are supported. Note the CPU compiler has little effect on the performance of the GPU code and so, while the Intel compilers are supported for building the GPU code, the recommended approach is to use the GNU compilers.

Building and Testing the GPU code

Assuming you have a working CUDA installation you can build both precision models (pmemd.cuda_SPFP and pmemd.cuda_DPFP) by editing your run.cmake to set "-DCUDA=TRUE". Then re-run ./run_cmake and make install.

Next, you can run the tests using the default GPU (the one with the largest memory) with:

  cd $AMBERHOME; make test.serial.cuda

The majority of these tests should pass. However, given the parallel nature of GPUs, meaning the order of operation is not well defined, the limited precision of the SPFP precision model, and variations in the random number generator on different GPU hardware, it is not uncommon for there to be several possible failures, although substantially less than with earlier versions of Amber. You may also see some tests, particularly the cellulose test, fail on GPUs with limited memory. You should inspect the diff file created in the $AMBERHOME/logs/test_amber_cuda/ directory to manually verify any possible failures. Differences which occur on only a few lines and are minor in nature can be safely ignored. Any large differences, or if you are unsure, should be posted to the Amber mailing list for comment.

Testing Alternate GPUs

Should you wish to run the tests on a GPU different from the one with the most memory (and lowest GPU ID if more than one identical GPU exists) then you should make use of the CUDA_VISIBLE_DEVICES environment variable as described below. For example, to test the GPU with ID = 2 (GPUs are numbered sequentially from zero) and the default SPFP precision model you would run the tests as follows:

    cd $AMBERHOME
    export CUDA_VISIBLE_DEVICES=2
    make test.cuda.serial

Multiple GPU (pmemd.cuda.MPI)

Once you have built and tested the serial GPU version you can optionally build the parallel version (if you have multiple GPUs of the same model). Unlike the CPU code it is not necessary to build the parallel version of the GPU code in order to access specific simulation options (except REMD). Thus you only need to build the parallel GPU code if you plan to run GB simulations or use replica exchange across multiple GPUs. It should be noted that at the time of writing while a single MD calculation can span multiple GPUs the performance of a single GPU is now so good that advances in the interconnect between GPUs have not kept up such that, with the exception of large implicit solvent simulations, there is little to be gained trying to run a calculation across multiple GPUs. If you have 4 GPUs in a single node your best method for utilizing them is to run 4 individual calculations from different starting structures / random seeds.

The instructions here assume that you can already successfully build the MPI version of the CPU code. If you cannot, then you should focus on solving this before you move on to attempting to build the parallel GPU version. The parallel GPU version of the code works using MPI v1 or later. You can build the multi-GPU code by setting both -DMPI and -DCUDA to TRUE in your run_cmake script.

Next you can run the tests using GPUs enumerated sequentially within a node (if you have multiple nodes or more complex GPU setups within a node then you should refer to the discussion below on running on multiple GPUs):

    export DO_PARALLEL='mpirun -np 2' # for bash/sh
    setenv DO_PARALLEL 'mpirun -np 2' # for csh/tcsh
    make test.cuda.parallel

Running GPU-Accelerated Simulations

Single GPU

In order to run a single GPU accelerated MD simulation the only change required is to use the executable pmemd.cuda in place of pmemd. For example:

    ${AMBERHOME}/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop -c inpcrd \
    -r restrt -x mdcrd -inf mdinfo

This will automatically run the calculation on the GPU with the most memory even if that GPU is already in use (see below for system settings to have the code auto select unused GPUs). If you have only a single CUDA capable GPU in your machine then this is fine. If you want to control which GPU is used, then you need to specify the GPU ID to use using the CUDA_VISIBLE_DEVICES environment variable. For example, you have a Tesla C2050 (3GB) and a Tesla C2070 (6GB) in the same machine and want to use the C2050 which has less memory. Or, you may want to run multiple independent simulations using different GPUs. The environment variable CUDA_VISIBLE_DEVICES lists which devices are visible as a comma-separated string. For example, if your desktop has two Tesla cards and a Quadro (using the deviceQuery function from the NVIDIA CUDA Samples):

  $ ./deviceQuery -noprompt | egrep "^Device"
    Device 0: "Tesla C2050"
    Device 1: "Tesla C2070"
    Device 2: "Quadro FX 3800"

By setting CUDA_VISIBLE_DEVICES you can make only a subset of them visible to the runtime:

  $ export CUDA_VISIBLE_DEVICES="0,2"
  $ ./deviceQuery -noprompt | egrep "^Device"
    Device 0: "Tesla C2050"
    Device 1: "Quadro FX 3800"

Hence if you wanted to run two pmemd.cuda runs, with the first running on the C2050 and the second on the C2070 you would run as follows:

  $ export CUDA_VISIBLE_DEVICES="0"
  nohup ${AMBERHOME}/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop \
  -c inpcrd -r restrt -x mdcrd -inf mdinfo &
  $ export CUDA_VISIBLE_DEVICES="1"
  nohup ${AMBERHOME}/bin/pmemd.cuda -O -i mdin -o mdout -p prmtop \
  -c inpcrd -r restrt -x mdcrd -inf mdinfo &

In this way you only ever expose a single GPU to the pmemd.cuda executable and so avoid issues with the running of multiple runs on the same GPU. This approach is the basis of how you can control GPU usage in parallel runs. If you want to know which GPU a calculation is running on you can check the value of CUDA_VISIBLE_DEVICES and other GPU specific information that is provided at the beginning of the mdout file. To check whether a GPU is in use or not you can use the nvidia-smi command. For Tesla series GPUs (K20, K40 etc) and the latest model Maxwell, Pascal and Turing cards (GTX-Titan-X, GTX-1080, Titan-XP, RTX-2080Ti etc) this provides % load information and process information for each GPU. For older GeForce series cards this information is not available via nvidia-smi so it is better to just check the GPU memory usage and temperature. You can do this with:

  $ nvidia-smi
    +------------------------------------------------------+
    | NVIDIA-SMI 337.12     Driver Version: 337.12         |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX TIT...  Off  | 0000:02:00.0     N/A |                  N/A |
    | 26%   32C  N/A     N/A /  N/A |     15MiB /  6143MiB |     N/A      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  GeForce GTX TIT...  Off  | 0000:03:00.0     N/A |                  N/A |
    |  0%   35C  N/A     N/A /  N/A |     15MiB /  6143MiB |     N/A      Default |
    +-------------------------------+----------------------+----------------------+

In this case both GPUs are idle (temp is < 50C and Memory usage is < 50MiB). If only GPU 0 was in use it might look something like this:

  $ nvidia-smi
    +------------------------------------------------------+
    | NVIDIA-SMI 337.12     Driver Version: 337.12         |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX TIT...  Off  | 0000:02:00.0     N/A |                  N/A |
    | 67%   80C  N/A     N/A /  N/A |    307MiB /  6143MiB |     N/A      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  GeForce GTX TIT...  Off  | 0000:03:00.0     N/A |                  N/A |
    |  0%   35C  N/A     N/A /  N/A |     15MiB /  6143MiB |     N/A      Default |
    +-------------------------------+----------------------+----------------------+

Note running X11 can confuse things a little bit, showing slightly higher memory usage on one of the GPUs, but it should still be possible to use this approach to determine which GPU is in use.

An alternative approach, and the one recommended if you have multiple GPUs in a single node, and only want to run single GPU jobs, is to set them to Persistence and Compute Exclusive Modes. In this mode a GPU will reject more than one job:

    ${AMBERHOME}/bin/pmemd.cuda -O -i mdin -o mdout.1 -x mdcrd.1 -inf mdinfo.1 &
    ${AMBERHOME}/bin/pmemd.cuda -O -i mdin -o mdout.2 -x mdcrd.2 -inf mdinfo.2 &
    ${AMBERHOME}/bin/pmemd.cuda -O -i mdin -o mdout.3 -x mdcrd.3 -inf mdinfo.3 &
    ${AMBERHOME}/bin/pmemd.cuda -O -i mdin -o mdout.4 -x mdcrd.4 -inf mdinfo.4 &
    ${AMBERHOME}/bin/pmemd.cuda -O -i mdin -o mdout.5 -x mdcrd.5 -inf mdinfo.5 &

In this situation calculation 1 will run on GPU 0, 2 on GPU 1, 3 on GPU 2 and 4 on GPU 3. The 5th job will quit immediately with an error stating that no free GPUs are available:

    cudaMemcpyToSymbol: SetSim copy to cSim failed all CUDA-capable devices are busy or
    unavailable

This approach is useful since it means you do not have to worry about setting CUDA_VISIBLE_DEVICES and you do not have to worry about accidentally running multiple jobs on the same GPU. The code will automatically look for GPUs that are not in use and will use them automatically and will quit if sufficient free GPUs are not available. You set Persistence and Compute Exclusive Modes by running the following as root:

  $ nvidia-smi -pm 1
  $ nvidia-smi -c 3

The disadvantage of this is that you need to have root permissions to set it and the setting is also lost on a reboot. We recommend that you add these settings to your system's startup scripts. In Redhat or Centos this can be accomplished with:

  echo "nvidia-smi -pm 1" >> /etc/rc.d/rc.local
  echo "nvidia-smi -c 3" >> /etc/rc.d/rc.local

Note this approach also works on clusters where queuing systems understand GPUs as resources and thus can keep track of total gpus allocated but do not control which GPU you see on a node. The downside of doing this is that it prevents multi-GPU runs from being run.

Multi GPU

The way in which a single calculation runs across multiple GPUs was changed in AMBER 14 and the new approach has been kept in AMBER 16 & 18. When the AMBER multi-GPU support was originally designed the PCI-E bus speed was gen 2 x16 and the GPUs were C1060 or C2050s. Since then we have seen GPUs advance to Titan-XP, GTX-1080 and M40s which are ~ 16x faster than the C1060s (more if we include that AMBER 18 is >40% faster than AMBER 14) while the PCI-E bus speed has only increased by 2x (PCI Gen2 x16 to PCI Gen3 x16) and Infiniband interconnects by about the same amount. NVLink is also insufficient as well as being ridiculously expensive. This unfortunate situation is all too common in parallel machines and yet many machine designers do not seem to appreciate the problems this places on software designers. The situation is now such that for PME calculations using traditional MPI it is not possible for AMBER to scale over multiple GPUs. In AMBER 14 and 16 we chose to address this problem by focusing on the use of Peer to Peer communication within a node. This mode of operation allows GPUs to communicate directly through the PCI-E bus without going through the CPU chipset which adds too much latency. Assuming all your GPUs have PCI-E gen 3 x16 bandwidth and support non-blocking peer to peer communication it is possible to scale to 4 GPUs within a single node, although the scaling may be far from ideal and only achievable for pre-Pascal generation GPUs. As of AMBER 18, with modern 2018 or later GPUs, you are unlikely to see benefit beyond a single GPU, although the design of AMBER is such that it makes sense from a cost perspective to put 4 or 8 GPUs in a single node and run multiple independent simulations on a single node.

Note the following section is deprecated since pascal and later GPUs do not scale beyond a single GPU however we leave it here for completeness.

It should be noted that while the legacy MPI and GPU-Direct methods of multi-GPU communication are still supported, and will be used by the code automatically if peer to peer communication is not available, you are very unlikely to see any speedup by using multiple GPUs for a single job if the GPUs are newer than C2050s. Multi-node runs are almost impossible to get to scale. We suggest saving money by skipping the expensive infiniband interconnect and instead going for GPU dense nodes. Note the standard 4 x GPU systems such as those featured on the recommended hardware page are designed to support 2 x 2 GPU runs at once using peer to peer or 4 x 1 GPU runs.

When running in parallel across multiple GPUs the selection of which GPU to run on becomes a little more complicated. We have attempted to simplify things for you by writing a short CUDA program that checks which GPUs in your node can support peer to peer communication. You should download and build this as follows. Assuming you have the cuda toolkit correctly installed, this should build you the executable gpuP2PCheck:

Download check_p2p.tar.bz2 and save it to your home directory.
Untar with tar xvjf check_p2p.tar.bz2.
cd check_p2p
make

On most machines, this will find will be pairs of GPUs in the order 0+1 and 2+3. In other words, on a 4 GPU machine you can run a total of two by two GPU jobs, one on GPUs 0 and 1 and one on GPUs 2 and 3. Running a calculation across more than 2 GPUs will result in peer to peer being switched off which will likely mean the calculation will run slower than if it had been run on a single GPU. To see which GPUs in your system can communicate via peer to peer, you can run the gpuP2PCheck program you built above. This reports which GPUs can talk to each other. For example:

    CUDA_VISIBLE_DEVICES is unset.
    CUDA-capable device count: 4
    GPU0 "GeForce GTX TITAN"
    GPU1 "GeForce GTX TITAN"
    GPU2 "GeForce GTX TITAN"
    GPU3 "GeForce GTX TITAN"

    Two way peer access between:
    GPU0 and GPU1: YES
    GPU0 and GPU2: NO
    GPU0 and GPU3: NO
    GPU1 and GPU2: NO
    GPU1 and GPU3: NO
    GPU2 and GPU3: YES

In this case, GPUs 0 and 1 can talk to each other and GPUs 2 and 3 can talk to each other. To run a 2 GPU job on GPUs 0 and 1, run with:

    export CUDA_VISIBLE_DEVICES=0,1
    mpirun -np 2 ${AMBERHOME}/bin/pmemd.cuda.MPI -O -i mdin -o mdout -p ...

If peer to peer communication is working you will see the following reported in your mdout file.

    |---------------- GPU PEER TO PEER INFO -----------------
    |
    |   Peer to Peer support: ENABLED
    |
    |--------------------------------------------------------

Since peer to peer communication does not involve the CPU chipset so it is possible, unlike in previous versions of AMBER, to run multiple multi-GPU runs on a single node. In our 4 GPU example above, we saw that GPU pairs (0, 1) and (2, 3) can communicate via peer-to-peer. We can run the following options without the jobs interfering with each other (performance of each job will be unaffected by the other):

    # Option 1: All single GPU
    export CUDA_VISIBLE_DEVICES=0
    cd run1
    nohup ${AMBERHOME}/bin/pmemd.cuda -O -i ... &
    export CUDA_VISIBLE_DEVICES=1
    cd run2
    nohup ${AMBERHOME}/bin/pmemd.cuda -O -i ... &
    export CUDA_VISIBLE_DEVICES=2
    cd run3
    nohup ${AMBERHOME}/bin/pmemd.cuda -O -i ... &
    export CUDA_VISIBLE_DEVICES=3
    cd run4
    nohup ${AMBERHOME}/bin/pmemd.cuda -O -i ... &

    # Option 2: One dual GPU and 2 single GPU runs
    export CUDA_VISIBLE_DEVICES=0,1
    cd run1
    nohup mpirun -np 2 ${AMBERHOME}/bin/pmemd.cuda.MPI -O -i ... &
    export CUDA_VISIBLE_DEVICES=2
    cd run2
    nohup ${AMBERHOME}/bin/pmemd.cuda -O -i ... &
    export CUDA_VISIBLE_DEVICES=3
    cd run3
    nohup ${AMBERHOME}/bin/pmemd.cuda -O -i ... &

    # Option 3: Two dual GPU runs
    export CUDA_VISIBLE_DEVICES=0,1
    cd run1
    nohup mpirun -np 2 ${AMBERHOME}/bin/pmemd.cuda.MPI -O -i ... &
    export CUDA_VISIBLE_DEVICES=2,3
    cd run2
    nohup mpirun -np 2 ${AMBERHOME}/bin/pmemd.cuda.MPI -O -i ... &