AMBER 16 GPU
ACCELERATION SUPPORT

| Background | Authorship & Support | Supported Features | Supported GPUs |
| Accuracy Considerations | Installation and Testing | Running GPU Accelerated Simulations |
| Considerations for Maximizing GPU Performance | Benchmarks |
| Recommended Hardware & Test Drives |
| Return to Main Amber Page |

This page describes AMBER 16 GPU support.
If you are using AMBER 14 please see the archived AMBER 14 page here.

Benchmarks

Benchmarks timings by Ross Walker.

This page provides benchmarks for AMBER v16 (PMEMD) with GPU acceleration as of update.8 [Jan 2018]. If you are using AMBER v14 please see the archived AMBER version 14 benchmarks. If you are using AMBER v12 please see the archived AMBER version 12 benchmarks.

Download AMBER 16 Benchmark Suite

Machine Specs

Machine
Exxact AMBER Certified 2U GPU Workstation
CPU = Dual x 8 Core Intel E5-2640v4 (2.2GHz), 64 GB DDR4 Ram
(note the cheaper E5-2620v4 CPUs would also give the same performance for GPU runs)
MPICH v3.1.4 - GNU v5.4.0 - Centos 7.4
CUDA Toolkit NVCC v9.0
NVIDIA Driver Linux 64 - 384.98

Code Base = AMBER 16 + Updates as of Jan 2018

Precision Model = SPFP (GPU), Double Precision (CPU)

Parallel Notes = All multi-GPU runs are intranode with GPU pairs that support peer to peer communication. In the case of the Exxact machine used here this is device IDs 0 & 1 or 2 & 3.

Pascal Titan-X naming = NVIDIA named the latest Pascal based high end GPU (GP102) Titan-X reusing the name GTX-Titan-X from the previous Maxwell generation but dropping the GTX in front of the name. To (try to) avoid confusion we refer to the new Pascal based Titan-X GPU as Titan-XP (a naming convention that NVIDIA finally decided to adopt with the updated Titan-X which they now refer to as Titan-XP) and the previous Maxwell based Titan-X GPU as Titan-X.

ECC = Where applicable benchmarks were run with ECC turned OFF - we have seen no issues with AMBER reliability related to ECC being on or off. If you see approximately 10% less performance than the numbers here then run the following (for each GPU) as root:

nvidia-smi -g 0 --ecc-config=0    (repeat with -g x for each GPU ID)

List of Benchmarks

Explicit Solvent (PME)

  1. DHFR NVE HMR 4fs = 23,558 atoms
  2. DHFR NPT HMR 4fs = 23,558 atoms
  3. DHFR NVE = 23,558 atoms
  4. DHFR NPT = 23,558 atoms
  5. FactorIX NVE = 90,906 atoms
  6. FactorIX NPT = 90,906 atoms
  7. Cellulose NVE = 408,609 atoms
  8. Cellulose NPT = 408,609 atoms
  9. STMV NPT HMR 4fs =  1,067,095 atoms

Implicit Solvent (GB)

  1. TRPCage = 304 atoms
  2. Myoglobin = 2,492 atoms
  3. Nucleosome = 25,095 atoms

You can download a tar file containing the input files for all these benchmarks here (84.1 MB)

Individual vs Aggregate Performance
A unique feature of AMBER's GPU support that sets it apart from the likes of Gromacs and NAMD is that it does NOT rely on the CPU to enhance performance while running on a GPU. This allows one to make extensive use of all of the GPUs in a multi-GPU node with maximum efficiency. It also means one can purchase low cost CPUs making GPU accelerated runs with AMBER substantially more cost effective than similar runs with other GPU accelerated MD codes.

For example, suppose you have a node with 4 GTX-Titan-X GPUs in it. With a lot of other MD codes you can use one to four of those GPUs, plus a bunch CPU cores for a single job. However, the remaining GPUs are not available for additional jobs without hurting the performance of the first job since the PCI-E bus and CPU cores are already fully loaded. AMBER is different. During a single GPU run the CPU and PCI-E bus are barely used. Thus you have the choice of running a single MD run across multiple GPUs, to maximize throughput on a single calculation, or alternatively you could run four completely independent jobs one on each GPU. In this case each individual run, unlike a lot of other GPU MD codes, will run at full speed. For this reason AMBER's aggregate throughput on cost effective multi-GPU nodes massively exceeds that of other codes that rely on constant CPU to GPU communication. This is illustrated below in the plots showing 'aggregate' throughput.

^

 


Price / Performance

Before looking at the raw throughput performance of each of the various benchmarks on different GPU models it is useful to consider the price/performance since NVIDIA GPUs prices span a very large range from the cost effective GeForce cards to the latest eye wateringly expensive Tesla V100 cards. The following plot shows the price / performance ratio relative to the GTX1080 GPU for current GeForce and Tesla GPUs at prices as of Jan 2018. Smaller is better.

 

Explicit Solvent PME Benchmarks

1) DHFR NVE HMR 4fs = 23,558 atoms

 Typical Production MD NVE with
 GOOD energy conservation, HMR, 4fs.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=75000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.004, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
   dsum_tol=0.000001,
 /

Single job throughput
(a single run on one or more GPUs within a single node)


 

Aggregate throughput (GTX-Titan-XP)
 (individual runs at the same time on the same node)


 

2) DHFR NPT HMR 4fs = 23,558 atoms

Typical Production MD NPT, MC Bar 4fs HMR
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, 
   nstlim=75000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.004, cut=8.,
   ntt=1, tautp=10.0,
   temp0=300.0,
   ntb=2, ntp=1, barostat=2,
   ioutfm=1,
 /

Single job throughput
(a single run on one or more GPUs within a single node)


 

3) DHFR NVE 2fs = 23,558 atoms

 Typical Production MD NVE with
 good energy conservation, 2fs.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=75000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,
 /

Single job throughput
(a single run on one or more GPUs within a single node)


 

4) DHFR NPT 2fs = 23,558 atoms

Typical Production MD NPT, MC Bar 2fs
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, 
   nstlim=75000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=1, tautp=10.0,
   temp0=300.0,
   ntb=2, ntp=1, barostat=2,
   ioutfm=1,
 /

Single job throughput
(a single run on one or more GPUs within a single node)


 

5) Factor IX NVE = 90,906 atoms

 Typical Production MD NVE with
 GOOD energy conservation.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=15000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,nfft1=128,nfft2=64,nfft3=64,
 /
 
Single job throughput
(a single run on one or more GPUs within a single node)


 

6) Factor IX NPT = 90,906 atoms

Typical Production MD NPT, MC Bar 2fs
&cntrl
 ntx=5, irest=1,
 ntc=2, ntf=2, 
 nstlim=15000, 
 ntpr=1000, ntwx=1000,
 ntwr=10000, 
 dt=0.002, cut=8.,
 ntt=1, tautp=10.0,
 temp0=300.0,
 ntb=2, ntp=1, barostat=2,
 ioutfm=1,
/
 
Single job throughput
(a single run on one or more GPUs within a single node)

 

7) Cellulose NVE = 408,609 atoms

Typical Production MD NVE with
GOOD energy conservation.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=10000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,
 /
 
Single job throughput
(a single run on one or more GPUs within a single node)

 

8) Cellulose NPT = 408,609 atoms

Typical Production MD NPT, MC Bar 2fs
 &cntrl
  ntx=5, irest=1,
  ntc=2, ntf=2, 
  nstlim=10000, 
  ntpr=1000, ntwx=1000,
  ntwr=10000, 
  dt=0.002, cut=8.,
  ntt=1, tautp=10.0,
  temp0=300.0,
  ntb=2, ntp=1, barostat=2,
  ioutfm=1,
 /

 

Single job throughput
(a single run on one or more GPUs within a single node)

 

9) STMV NPT HMR 4fs = 1,067,095 atoms

Typical Production MD NPT, HMR, MC Bar 4fs
 &cntrl
  ntx=5, irest=1,
  ntc=2, ntf=2, 
  nstlim=4000, 
  ntpr=1000, ntwx=1000,
  ntwr=4000, 
  dt=0.004, cut=8.,
  ntt=1, tautp=10.0,
  temp0=300.0,
  ntb=2, ntp=1, barostat=2,
  ioutfm=1,
 /

 

Single job throughput
(a single run on one or more GPUs within a single node)

 

^


Implicit Solvent GB Benchmarks

1) TRPCage = 304 atoms

&cntrl
  imin=0,irest=1,ntx=5,
  nstlim=500000,dt=0.002,
  ntf=2,ntc=2,
  ntt=1, tautp=0.5,
  tempi=325.0, temp0=325.0,
  ntpr=1000, ntwx=1000, ntwr=50000,
  ntb=0, igb=1,
  cut=9999., rgbmax=9999.,
/
Note: The TRPCage test is too small to make effective use of the very latest  GPUs hence performance on these cards is not as pronounced over early generation cards as it is for larger GB systems and PME runs. This system is also too small to run effectively over multiple GPUs.


TRPCage is too small to effectively scale to modern GPUs
 

2) Myoglobin = 2,492 atoms

&cntrl
  imin=0,irest=1,ntx=5,
  nstlim=50000,dt=0.002,ntb=0,
  ntf=2,ntc=2,
  ntpr=1000, ntwx=1000, ntwr=10000,
  cut=9999.0, rgbmax=15.0,
  igb=1,ntt=3,gamma_ln=1.0,nscm=0,
  temp0=300.0,ig=-1,
/

Note: This test case is too small to make effective use of multiple GPUs when using the latest hardware.


Myoglobin is too small to effectively scale to multiple modern GPUs.
 

3) Nucleosome = 25095 atoms

&cntrl
  imin=0,irest=1,ntx=5,
  nstlim=1000,dt=0.002,
  ntf=2,ntc=2,ntb=0,
  igb=5,cut=9999.0,rgbmax=15.0,
  ntpr=200, ntwx=200, ntwr=1000,
  saltcon=0.1,
  temp0=310.0,
  ntt=1,tautp=1.0,
  nscm=0,
/

 


 

^