AMBER 14 NVIDIA GPU
ACCELERATION SUPPORT

| Background | Authorship & Support | Supported Features | Supported GPUs |
| Accuracy Considerations | Installation and Testing | Running GPU Accelerated Simulations |
| Considerations for Maximizing GPU Performance | Benchmarks |
| Recommended Hardware & Test Drives |
| Return to Main Amber Page |

Benchmarks

Benchmarks timings by Ross Walker.

This page provides benchmarks for AMBER v14 (PMEMD) with NVIDIA GPU acceleration. If you are using AMBER v12 please see the archived AMBER version 12 benchmarks.

Download AMBER 14 Benchmark Suite

Please Note: The current benchmark timings here are for AMBER 14 as of release (GPU support revision 14.0.0., April 2014).

Machine Specs

Machine
Exxact AMBER Certified 2U GPU Workstation
CPU = Dual x 8 Core Intel E5-2660v2 (2.2GHz), 64 GB DDR3 1600 MHz Ram
(note the cheaper 6 Core E5-2620v2 CPUs would also give the same performance for GPU runs)
MPICH2 v1.5 - GNU v4.4.7-3 - Centos 6.5
CUDA Toolkit NVCC v5.0
NVIDIA Driver Linux 64 - 337.12

Code Base = AMBER 14 Release (Apr 2014)

Precision Model = SPFP (GPU), Double Precision (CPU)

Parallel Notes = All multi-GPU runs are intranode with GPU pairs that support peer to peer communication. In the case of the Exxact machine used here this is device IDs 0 & 1 or 2 & 3.

GTX970/980 = support for GTX970/980 (Maxwell) cards is currently provisional. Performance of a GTX980 is currently approximately the same as a GTX780. A performance update that improves this substantially will be released soon.

4 x K40 numbers = The 4 x K40 single run numbers are from unique hardware, developed by CirraScale, employing a breakout board and 4 way PCI-E gen 3 x16 switches to support nonblocking 4  and 8 way peer to peer. More details about how to purchase such hardware is provided on the recommended hardware page.

K80 and Titan-Z numbers = The K80 and Titan-Z cards consist of two physical GPUs per board. Thus if you have a system with 4 K80 cards in it nvidia-smi will report 8 GPUs. From an AMBER perspective the GPUs are treated independently. Thus you could run 8 individual single GPU simulations, 4 dual GPU simulations or 2 quad GPU simulations or any combination in between. For the purposes of showing benchmarks here we report the performance for a single GPU (half a K80 board), dual GPU (a full K80 board) and quad GPU (two full K80 boards in PCI slots attached to the same CPU and thus capable of P2P communication).

Benchmarks were run with ECC turned OFF - we have seen no issues with AMBER reliability related to ECC being on or off. If you see approximately 10% less performance than the numbers here then run the following (for each GPU) as root:

nvidia-smi -g 0 --ecc-config=0    (repeat with -g x for each GPU ID)

Boost Clocks: Some of the latest NVIDIA GPUs, such as the K40 and K80, support boost clocks which increase the clock speed if power and temperature headroom is available. This should be turned on as follows to enable optimum performance with AMBER:

K40: sudo nvidia-smi -i 0 -ac 3004,875
K80: sudo nvidia-smi -i 0 -ac 2505,875
which puts device 0 into the highest boost state.

To return to normal do: sudo nvidia-smi -rac

To enable this setting without being root do: nvidia-smi -acp 0

List of Benchmarks

Explicit Solvent (PME)

  1. DHFR NVE HMR 4fs = 23,558 atoms
  2. DHFR NPT HMR 4fs = 23,558 atoms
  3. DHFR NVE = 23,558 atoms
  4. DHFR NPT = 23,558 atoms
  5. FactorIX NVE = 90,906 atoms
  6. FactorIX NPT = 90,906 atoms
  7. Cellulose NVE = 408,609 atoms
  8. Cellulose NPT = 408,609 atoms

Implicit Solvent (GB)

  1. TRPCage = 304 atoms
  2. Myoglobin = 2,492 atoms
  3. Nucleosome = 25,095 atoms

You can download a tar file containing the input files for all these benchmarks here (44.0 MB)

Individual vs Aggregate Performance
A unique feature of AMBER's GPU support that sets it apart from the likes of Gromacs and NAMD is that it does NOT rely on the CPU to enhance performance while running on a GPU. This allows one to make extensive use of all of the GPUs in a multi-GPU node with maximum efficiency. It also means one can purchase low cost CPUs making GPU accelerated runs with AMBER substantially more cost effective than similar runs with other GPU accelerated MD codes.

For example, suppose you have a node with 4 GTX-Titan-Black GPUs in it. With a lot of other MD codes you can use one to four of those GPUs, plus a bunch CPU cores for a single job. However, the remaining GPUs are not available for additional jobs without hurting the performance of the first job since the PCI-E bus and CPU cores are already fully loaded. AMBER is different. During a single GPU run the CPU and PCI-E bus are barely used. Thus you have the choice of running a single MD run across multiple GPUs, to maximize throughput on a single calculation, or alternatively you could run four completely independent jobs one on each GPU. In this case each individual run, unlike a lot of other GPU MD codes, will run at full speed. For this reason AMBER's aggregate throughput on cost effective multi-GPU nodes massively exceeds that of other codes that rely on constant CPU to GPU communication. This is illustrated below in the plots showing 'aggregate' throughput.

^


Explicit Solvent PME Benchmarks

1) DHFR NVE HMR 4fs = 23,558 atoms

 Typical Production MD NVE with
 GOOD energy conservation, HMR, 4fs.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=75000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.004, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,
 /

Single job throughput
(a single run on one or more GPUs within a single node)


 

Aggregate throughput (GTX-Titan-Black)
 (individual runs at the same time on the same node)


 

2) DHFR NPT HMR 4fs = 23,558 atoms

Typical Production MD NPT, MC Bar 4fs HMR
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, 
   nstlim=75000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.004, cut=8.,
   ntt=1, tautp=10.0,
   temp0=300.0,
   ntb=2, ntp=1, barostat=2,
   ioutfm=1,
 /

Single job throughput
(a single run on one or more GPUs within a single node)


 

3) DHFR NVE 2fs = 23,558 atoms

 Typical Production MD NVE with
 good energy conservation, 2fs.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=75000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,
 /

Single job throughput
(a single run on one or more GPUs within a single node)


 

4) DHFR NPT 2fs = 23,558 atoms

Typical Production MD NPT, MC Bar 2fs
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, 
   nstlim=75000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=1, tautp=10.0,
   temp0=300.0,
   ntb=2, ntp=1, barostat=2,
   ioutfm=1,
 /

Single job throughput
(a single run on one or more GPUs within a single node)


 

5) Factor IX NVE = 90,906 atoms

 Typical Production MD NVE with
 GOOD energy conservation.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=15000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,nfft1=128,nfft2=64,nfft3=64,
 /
 
Single job throughput
(a single run on one or more GPUs within a single node)


 

6) Factor IX NPT = 90,906 atoms

Typical Production MD NPT, MC Bar 2fs
&cntrl
 ntx=5, irest=1,
 ntc=2, ntf=2, 
 nstlim=15000, 
 ntpr=1000, ntwx=1000,
 ntwr=10000, 
 dt=0.002, cut=8.,
 ntt=1, tautp=10.0,
 temp0=300.0,
 ntb=2, ntp=1, barostat=2,
 ioutfm=1,
/
 
Single job throughput
(a single run on one or more GPUs within a single node)

 

7) Cellulose NVE = 408,609 atoms

Typical Production MD NVE with
GOOD energy conservation.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=10000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,
 /
 
Single job throughput
(a single run on one or more GPUs within a single node)

 

8) Cellulose NPT = 408,609 atoms

Typical Production MD NPT, MC Bar 2fs
 &cntrl
  ntx=5, irest=1,
  ntc=2, ntf=2, 
  nstlim=10000, 
  ntpr=1000, ntwx=1000,
  ntwr=10000, 
  dt=0.002, cut=8.,
  ntt=1, tautp=10.0,
  temp0=300.0,
  ntb=2, ntp=1, barostat=2,
  ioutfm=1,
 /

 

Single job throughput
(a single run on one or more GPUs within a single node)

 

^


Implicit Solvent GB Benchmarks

1) TRPCage = 304 atoms

&cntrl
  imin=0,irest=1,ntx=5,
  nstlim=500000,dt=0.002,
  ntf=2,ntc=2,
  ntt=1, tautp=0.5,
  tempi=325.0, temp0=325.0,
  ntpr=1000, ntwx=1000, ntwr=50000,
  ntb=0, igb=1,
  cut=9999., rgbmax=9999.,
/
Note: The TRPCage test is too small to make effective use of the very latest GK110 (GTX780/Titan/K20/K40) GPUs hence performance on these cards is not as pronounced over early generation cards as it is for larger GB systems and PME runs. This system is also too small to run effectively over multiple GPUs.


*TRPCage is too small to effectively scale to multiple modern GPUs

2) Myoglobin = 2,492 atoms

&cntrl
  imin=0,irest=1,ntx=5,
  nstlim=50000,dt=0.002,ntb=0,
  ntf=2,ntc=2,
  ntpr=1000, ntwx=1000, ntwr=10000,
  cut=9999.0, rgbmax=15.0,
  igb=1,ntt=3,gamma_ln=1.0,nscm=0,
  temp0=300.0,ig=-1,
/

Note: This test case is too small to make effective use of multiple GPUs when using the latest hardware.


*Myoglobin is too small to effectively scale to multiple modern GPUs.

3) Nucleosome = 25095 atoms

&cntrl
  imin=0,irest=1,ntx=5,
  nstlim=1000,dt=0.002,
  ntf=2,ntc=2,ntb=0,
  igb=5,cut=9999.0,rgbmax=15.0,
  ntpr=200, ntwx=200, ntwr=1000,
  saltcon=0.1,
  temp0=310.0,
  ntt=1,tautp=1.0,
  nscm=0,
/

 


 

^