AMBER 16 NVIDIA GPU
ACCELERATION SUPPORT

| Background | Authorship & Support | Supported Features | Supported GPUs |
| Accuracy Considerations | Installation and Testing | Running GPU Accelerated Simulations |
| Considerations for Maximizing GPU Performance | Benchmarks |
| Recommended Hardware & Test Drives |
| Return to Main Amber Page |

This page describes AMBER 16 GPU support.
If you are using AMBER 14 please see the archived AMBER 14 page here.

Benchmarks

Benchmarks timings by Ross Walker.

This page provides benchmarks for AMBER v16 (PMEMD) with NVIDIA GPU acceleration. If you are using AMBER v14 please see the archived AMBER version 14 benchmarks. If you are using AMBER v12 please see the archived AMBER version 12 benchmarks.

Download AMBER 16 Benchmark Suite

Please Note: The current benchmark timings here are for the current version of AMBER 16 (GPU support revision 16.0.0., April 2016).

Machine Specs

Machine
Exxact AMBER Certified 2U GPU Workstation
CPU = Dual x 8 Core Intel E5-2650v3 (2.3GHz), 64 GB DDR4 Ram
(note the cheaper 6 Core E5-2620v3 and v4 CPUs would also give the same performance for GPU runs)
MPICH v3.1.4 - GNU v4.8.5 - Centos 7.2
CUDA Toolkit NVCC v7.5 (8.0RC1 for GTX-1080)
NVIDIA Driver Linux 64 - 361.43

Code Base = AMBER 16 Release Version (GPU 16.0.0 Apr 2016)

Precision Model = SPFP (GPU), Double Precision (CPU)

Parallel Notes = All multi-GPU runs are intranode with GPU pairs that support peer to peer communication. In the case of the Exxact machine used here this is device IDs 0 & 1 or 2 & 3.

4 x K40 / Titan-X numbers = The 4 x K40 and Titan-X single run numbers are from unique hardware, developed in collaboration with Exxact, employing 2 x 4 way PCI-E gen 3 x16 switches to support nonblocking 4  and 8 way peer to peer. More details about how to purchase such hardware is provided on the recommended hardware page.

K80 = The K80 cards consist of two physical GPUs per board. Thus if you have a system with 4 K80 cards in it nvidia-smi will report 8 GPUs. From an AMBER perspective the GPUs are treated independently. Thus you could run 8 individual single GPU simulations, 4 dual GPU simulations or 2 quad GPU simulations or any combination in between. For the purposes of showing benchmarks here we report the performance for a single GPU (half a K80 board), dual GPU (a full K80 board) and quad GPU (two full K80 boards in PCI slots attached to the same CPU and thus capable of P2P communication).

DGX-1 / P100 Cautions = Due to significant supply issues NVIDIA has been incapable of providing us with access to Pascal P100 test hardware despite it having been publically announced. As such P100 based Pascal hardware is untested at this time. The performance numbers for DGX-1 provided on the benchmark page were supplied by NVIDIA and have not been verified by AMBER developers.

GTX-1080 Caution = The GTX-1080 requires driver version >= 367.27 to give correct numerical results. CUDA 8.0 is also required for optimum performance. Given CUDA 8.0 is not yet publically released you will need to register as an NVIDIA developer in order to download the current release candidate.

Benchmarks were run with ECC turned OFF - we have seen no issues with AMBER reliability related to ECC being on or off. If you see approximately 10% less performance than the numbers here then run the following (for each GPU) as root:

nvidia-smi -g 0 --ecc-config=0    (repeat with -g x for each GPU ID)

Boost Clocks: Some of the latest NVIDIA GPUs, such as the K40 and K80, support boost clocks which increase the clock speed if power and temperature headroom is available. This should be turned on as follows to enable optimum performance with AMBER:

K40: sudo nvidia-smi -i 0 -ac 3004,875
K80: sudo nvidia-smi -i 0 -ac 2505,875
which puts device 0 into the highest boost state.

To return to normal do: sudo nvidia-smi -rac

To enable this setting without being root do: nvidia-smi -acp 0

List of Benchmarks

Explicit Solvent (PME)

  1. DHFR NVE HMR 4fs = 23,558 atoms
  2. DHFR NPT HMR 4fs = 23,558 atoms
  3. DHFR NVE = 23,558 atoms
  4. DHFR NPT = 23,558 atoms
  5. FactorIX NVE = 90,906 atoms
  6. FactorIX NPT = 90,906 atoms
  7. Cellulose NVE = 408,609 atoms
  8. Cellulose NPT = 408,609 atoms
  9. STMV NPT HMR 4fs =  1,067,095 atoms

Implicit Solvent (GB)

  1. TRPCage = 304 atoms
  2. Myoglobin = 2,492 atoms
  3. Nucleosome = 25,095 atoms

You can download a tar file containing the input files for all these benchmarks here (84.1 MB)

Individual vs Aggregate Performance
A unique feature of AMBER's GPU support that sets it apart from the likes of Gromacs and NAMD is that it does NOT rely on the CPU to enhance performance while running on a GPU. This allows one to make extensive use of all of the GPUs in a multi-GPU node with maximum efficiency. It also means one can purchase low cost CPUs making GPU accelerated runs with AMBER substantially more cost effective than similar runs with other GPU accelerated MD codes.

For example, suppose you have a node with 4 GTX-Titan-X GPUs in it. With a lot of other MD codes you can use one to four of those GPUs, plus a bunch CPU cores for a single job. However, the remaining GPUs are not available for additional jobs without hurting the performance of the first job since the PCI-E bus and CPU cores are already fully loaded. AMBER is different. During a single GPU run the CPU and PCI-E bus are barely used. Thus you have the choice of running a single MD run across multiple GPUs, to maximize throughput on a single calculation, or alternatively you could run four completely independent jobs one on each GPU. In this case each individual run, unlike a lot of other GPU MD codes, will run at full speed. For this reason AMBER's aggregate throughput on cost effective multi-GPU nodes massively exceeds that of other codes that rely on constant CPU to GPU communication. This is illustrated below in the plots showing 'aggregate' throughput.

^

 


Explicit Solvent PME Benchmarks

1) DHFR NVE HMR 4fs = 23,558 atoms

 Typical Production MD NVE with
 GOOD energy conservation, HMR, 4fs.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=75000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.004, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,
 /

Single job throughput
(a single run on one or more GPUs within a single node)


* DGX-1 numbers are unverified and the hardware should be considered experimental at this time. See above.
*GTX-1080 numbers require CUDA 8.0.

Aggregate throughput (GTX-Titan-X)
 (individual runs at the same time on the same node)


 

2) DHFR NPT HMR 4fs = 23,558 atoms

Typical Production MD NPT, MC Bar 4fs HMR
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, 
   nstlim=75000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.004, cut=8.,
   ntt=1, tautp=10.0,
   temp0=300.0,
   ntb=2, ntp=1, barostat=2,
   ioutfm=1,
 /

Single job throughput
(a single run on one or more GPUs within a single node)


* DGX-1 numbers are unverified and the hardware should be considered experimental at this time. See above.
*GTX-1080 numbers require CUDA 8.0.

3) DHFR NVE 2fs = 23,558 atoms

 Typical Production MD NVE with
 good energy conservation, 2fs.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=75000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,
 /

Single job throughput
(a single run on one or more GPUs within a single node)


* DGX-1 numbers are unverified and the hardware should be considered experimental at this time. See above.
*GTX-1080 numbers require CUDA 8.0.

4) DHFR NPT 2fs = 23,558 atoms

Typical Production MD NPT, MC Bar 2fs
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, 
   nstlim=75000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=1, tautp=10.0,
   temp0=300.0,
   ntb=2, ntp=1, barostat=2,
   ioutfm=1,
 /

Single job throughput
(a single run on one or more GPUs within a single node)


* DGX-1 numbers are unverified and the hardware should be considered experimental at this time. See above.
*GTX-1080 numbers require CUDA 8.0.

5) Factor IX NVE = 90,906 atoms

 Typical Production MD NVE with
 GOOD energy conservation.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=15000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,nfft1=128,nfft2=64,nfft3=64,
 /
 
Single job throughput
(a single run on one or more GPUs within a single node)


* DGX-1 numbers are unverified and the hardware should be considered experimental at this time. See above.
*GTX-1080 numbers require CUDA 8.0.

6) Factor IX NPT = 90,906 atoms

Typical Production MD NPT, MC Bar 2fs
&cntrl
 ntx=5, irest=1,
 ntc=2, ntf=2, 
 nstlim=15000, 
 ntpr=1000, ntwx=1000,
 ntwr=10000, 
 dt=0.002, cut=8.,
 ntt=1, tautp=10.0,
 temp0=300.0,
 ntb=2, ntp=1, barostat=2,
 ioutfm=1,
/
 
Single job throughput
(a single run on one or more GPUs within a single node)

* DGX-1 numbers are unverified and the hardware should be considered experimental at this time. See above.
*GTX-1080 numbers require CUDA 8.0.

7) Cellulose NVE = 408,609 atoms

Typical Production MD NVE with
GOOD energy conservation.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=10000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,
 /
 
Single job throughput
(a single run on one or more GPUs within a single node)

* DGX-1 numbers are unverified and the hardware should be considered experimental at this time. See above.
*GTX-1080 numbers require CUDA 8.0.

8) Cellulose NPT = 408,609 atoms

Typical Production MD NPT, MC Bar 2fs
 &cntrl
  ntx=5, irest=1,
  ntc=2, ntf=2, 
  nstlim=10000, 
  ntpr=1000, ntwx=1000,
  ntwr=10000, 
  dt=0.002, cut=8.,
  ntt=1, tautp=10.0,
  temp0=300.0,
  ntb=2, ntp=1, barostat=2,
  ioutfm=1,
 /

 

Single job throughput
(a single run on one or more GPUs within a single node)

* DGX-1 numbers are unverified and the hardware should be considered experimental at this time. See above.
*GTX-1080 numbers require CUDA 8.0.

9) STMV NPT HMR 4fs = 1,067,095 atoms

Typical Production MD NPT, HMR, MC Bar 4fs
 &cntrl
  ntx=5, irest=1,
  ntc=2, ntf=2, 
  nstlim=4000, 
  ntpr=1000, ntwx=1000,
  ntwr=4000, 
  dt=0.004, cut=8.,
  ntt=1, tautp=10.0,
  temp0=300.0,
  ntb=2, ntp=1, barostat=2,
  ioutfm=1,
 /

 

Single job throughput
(a single run on one or more GPUs within a single node)

* DGX-1 numbers are unverified and the hardware should be considered experimental at this time. See above.
*GTX-1080 numbers require CUDA 8.0.

^


Implicit Solvent GB Benchmarks

1) TRPCage = 304 atoms

&cntrl
  imin=0,irest=1,ntx=5,
  nstlim=500000,dt=0.002,
  ntf=2,ntc=2,
  ntt=1, tautp=0.5,
  tempi=325.0, temp0=325.0,
  ntpr=1000, ntwx=1000, ntwr=50000,
  ntb=0, igb=1,
  cut=9999., rgbmax=9999.,
/
Note: The TRPCage test is too small to make effective use of the very latest  GPUs hence performance on these cards is not as pronounced over early generation cards as it is for larger GB systems and PME runs. This system is also too small to run effectively over multiple GPUs.


* DGX-1 numbers are unverified and the hardware should be considered experimental at this time. See above.
*GTX-1080 numbers require CUDA 8.0.
TRPCage is too small to effectively scale to multiple modern GPUs
 

2) Myoglobin = 2,492 atoms

&cntrl
  imin=0,irest=1,ntx=5,
  nstlim=50000,dt=0.002,ntb=0,
  ntf=2,ntc=2,
  ntpr=1000, ntwx=1000, ntwr=10000,
  cut=9999.0, rgbmax=15.0,
  igb=1,ntt=3,gamma_ln=1.0,nscm=0,
  temp0=300.0,ig=-1,
/

Note: This test case is too small to make effective use of multiple GPUs when using the latest hardware.


* DGX-1 numbers are unverified and the hardware should be considered experimental at this time. See above.
*GTX-1080 numbers require CUDA 8.0.
Myoglobin is too small to effectively scale to multiple modern GPUs.
 

3) Nucleosome = 25095 atoms

&cntrl
  imin=0,irest=1,ntx=5,
  nstlim=1000,dt=0.002,
  ntf=2,ntc=2,ntb=0,
  igb=5,cut=9999.0,rgbmax=15.0,
  ntpr=200, ntwx=200, ntwr=1000,
  saltcon=0.1,
  temp0=310.0,
  ntt=1,tautp=1.0,
  nscm=0,
/

 


* DGX-1 numbers are unverified and the hardware should be considered experimental at this time. See above.
*GTX-1080 numbers require CUDA 8.0.

^