AMBER 16 NVIDIA GPU
ACCELERATION SUPPORT

| Background | Authorship & Support | Supported Features | Supported GPUs |
| Accuracy Considerations | Installation and Testing | Running GPU Accelerated Simulations |
| Considerations for Maximizing GPU Performance | Benchmarks |
| Recommended Hardware & Test Drives |
| Return to Main Amber Page |

This page describes AMBER 16 GPU support.
If you are using AMBER 14 please see the archived AMBER 14 page here.

Benchmarks

Benchmarks timings by Ross Walker.

This page provides benchmarks for AMBER v16 (PMEMD) with NVIDIA GPU acceleration. If you are using AMBER v14 please see the archived AMBER version 14 benchmarks. If you are using AMBER v12 please see the archived AMBER version 12 benchmarks.

Download AMBER 16 Benchmark Suite

Please Note: The current benchmark timings here are for the current version of AMBER 16 (GPU support revision 16.0.0., April 2016).

Machine Specs

Machine
Exxact AMBER Certified 2U GPU Workstation
CPU = Dual x 8 Core Intel E5-2650v3 (2.3GHz), 64 GB DDR4 Ram
(note the cheaper 6 Core E5-2620v3 and v4 CPUs would also give the same performance for GPU runs)
MPICH v3.1.4 - GNU v4.8.5 - Centos 7.2
CUDA Toolkit NVCC v7.5 (8.0RC1 for GTX-1070, GTX-1080 & Titan-XP)
NVIDIA Driver Linux 64 - 361.43

Code Base = AMBER 16 Release Version (GPU 16.0.0 Apr 2016)

Precision Model = SPFP (GPU), Double Precision (CPU)

Parallel Notes = All multi-GPU runs are intranode with GPU pairs that support peer to peer communication. In the case of the Exxact machine used here this is device IDs 0 & 1 or 2 & 3.

Pascal Titan-X naming = In a completely nonsensical and unexplained move NVIDIA chose to name the latest Pascal based high end GPU (GP102) Titan-X reusing the name GTX-Titan-X from the previous Maxwell generation but dropping the GTX in front of the name. This is unnecessarily confusing to end users. To avoid this confusion we refer to the new Pascal based Titan-X GPU as Titan-XP and the previous Maxwell based Titan-X GPU as Titan-X.

Tesla Performance: These benchmarks show mostly performance numbers for GeForce GPUs since these tend to provide superior price / performance over Tesla GPUs. As such GeForce GPUs are currently the recommended GPUs to use with AMBER. To avoid clutter, information on limited Tesla models are included here. For performance on older Tesla hardware (M2090, K10, K20, K40 etc) please refer to the archived AMBER 14 and AMBER 12 pages.

DGX-1 / P100 Cautions = No testing has been done on DGX-1 or P100 / GP100 based systems. At the present time Tesla based Pascal GPUs are not supported by AMBER 16.

GTX-1080 Caution = The GTX-1080 requires driver version >= 367.27 to give correct numerical results. CUDA 8.0 is also required for optimum performance. Given CUDA 8.0 is not yet publically released you will need to register as an NVIDIA developer in order to download the current release candidate.

ECC = Where applicable benchmarks were run with ECC turned OFF - we have seen no issues with AMBER reliability related to ECC being on or off. If you see approximately 10% less performance than the numbers here then run the following (for each GPU) as root:

nvidia-smi -g 0 --ecc-config=0    (repeat with -g x for each GPU ID)

List of Benchmarks

Explicit Solvent (PME)

  1. DHFR NVE HMR 4fs = 23,558 atoms
  2. DHFR NPT HMR 4fs = 23,558 atoms
  3. DHFR NVE = 23,558 atoms
  4. DHFR NPT = 23,558 atoms
  5. FactorIX NVE = 90,906 atoms
  6. FactorIX NPT = 90,906 atoms
  7. Cellulose NVE = 408,609 atoms
  8. Cellulose NPT = 408,609 atoms
  9. STMV NPT HMR 4fs =  1,067,095 atoms

Implicit Solvent (GB)

  1. TRPCage = 304 atoms
  2. Myoglobin = 2,492 atoms
  3. Nucleosome = 25,095 atoms

You can download a tar file containing the input files for all these benchmarks here (84.1 MB)

Individual vs Aggregate Performance
A unique feature of AMBER's GPU support that sets it apart from the likes of Gromacs and NAMD is that it does NOT rely on the CPU to enhance performance while running on a GPU. This allows one to make extensive use of all of the GPUs in a multi-GPU node with maximum efficiency. It also means one can purchase low cost CPUs making GPU accelerated runs with AMBER substantially more cost effective than similar runs with other GPU accelerated MD codes.

For example, suppose you have a node with 4 GTX-Titan-X GPUs in it. With a lot of other MD codes you can use one to four of those GPUs, plus a bunch CPU cores for a single job. However, the remaining GPUs are not available for additional jobs without hurting the performance of the first job since the PCI-E bus and CPU cores are already fully loaded. AMBER is different. During a single GPU run the CPU and PCI-E bus are barely used. Thus you have the choice of running a single MD run across multiple GPUs, to maximize throughput on a single calculation, or alternatively you could run four completely independent jobs one on each GPU. In this case each individual run, unlike a lot of other GPU MD codes, will run at full speed. For this reason AMBER's aggregate throughput on cost effective multi-GPU nodes massively exceeds that of other codes that rely on constant CPU to GPU communication. This is illustrated below in the plots showing 'aggregate' throughput.

^

 


Explicit Solvent PME Benchmarks

1) DHFR NVE HMR 4fs = 23,558 atoms

 Typical Production MD NVE with
 GOOD energy conservation, HMR, 4fs.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=75000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.004, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
   dsum_tol=0.000001,
 /

Single job throughput
(a single run on one or more GPUs within a single node)


* Pascal Titan-X [Titan-XP], GTX-1080 and GTX-1070 numbers require CUDA 8.0.

Aggregate throughput (GTX-Titan-X)
 (individual runs at the same time on the same node)


 

2) DHFR NPT HMR 4fs = 23,558 atoms

Typical Production MD NPT, MC Bar 4fs HMR
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, 
   nstlim=75000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.004, cut=8.,
   ntt=1, tautp=10.0,
   temp0=300.0,
   ntb=2, ntp=1, barostat=2,
   ioutfm=1,
 /

Single job throughput
(a single run on one or more GPUs within a single node)


* Pascal Titan-X [Titan-XP], GTX-1080 and GTX-1070 numbers require CUDA 8.0.

3) DHFR NVE 2fs = 23,558 atoms

 Typical Production MD NVE with
 good energy conservation, 2fs.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=75000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,
 /

Single job throughput
(a single run on one or more GPUs within a single node)


* Pascal Titan-X [Titan-XP], GTX-1080 and GTX-1070 numbers require CUDA 8.0.

4) DHFR NPT 2fs = 23,558 atoms

Typical Production MD NPT, MC Bar 2fs
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, 
   nstlim=75000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=1, tautp=10.0,
   temp0=300.0,
   ntb=2, ntp=1, barostat=2,
   ioutfm=1,
 /

Single job throughput
(a single run on one or more GPUs within a single node)


* Pascal Titan-X [Titan-XP], GTX-1080 and GTX-1070 numbers require CUDA 8.0.

5) Factor IX NVE = 90,906 atoms

 Typical Production MD NVE with
 GOOD energy conservation.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=15000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,nfft1=128,nfft2=64,nfft3=64,
 /
 
Single job throughput
(a single run on one or more GPUs within a single node)


* Pascal Titan-X [Titan-XP], GTX-1080 and GTX-1070 numbers require CUDA 8.0.

6) Factor IX NPT = 90,906 atoms

Typical Production MD NPT, MC Bar 2fs
&cntrl
 ntx=5, irest=1,
 ntc=2, ntf=2, 
 nstlim=15000, 
 ntpr=1000, ntwx=1000,
 ntwr=10000, 
 dt=0.002, cut=8.,
 ntt=1, tautp=10.0,
 temp0=300.0,
 ntb=2, ntp=1, barostat=2,
 ioutfm=1,
/
 
Single job throughput
(a single run on one or more GPUs within a single node)

* Pascal Titan-X [Titan-XP], GTX-1080 and GTX-1070 numbers require CUDA 8.0.

7) Cellulose NVE = 408,609 atoms

Typical Production MD NVE with
GOOD energy conservation.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=10000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,
 /
 
Single job throughput
(a single run on one or more GPUs within a single node)

* Pascal Titan-X [Titan-XP], GTX-1080 and GTX-1070 numbers require CUDA 8.0.

8) Cellulose NPT = 408,609 atoms

Typical Production MD NPT, MC Bar 2fs
 &cntrl
  ntx=5, irest=1,
  ntc=2, ntf=2, 
  nstlim=10000, 
  ntpr=1000, ntwx=1000,
  ntwr=10000, 
  dt=0.002, cut=8.,
  ntt=1, tautp=10.0,
  temp0=300.0,
  ntb=2, ntp=1, barostat=2,
  ioutfm=1,
 /

 

Single job throughput
(a single run on one or more GPUs within a single node)

* Pascal Titan-X [Titan-XP], GTX-1080 and GTX-1070 numbers require CUDA 8.0.

9) STMV NPT HMR 4fs = 1,067,095 atoms

Typical Production MD NPT, HMR, MC Bar 4fs
 &cntrl
  ntx=5, irest=1,
  ntc=2, ntf=2, 
  nstlim=4000, 
  ntpr=1000, ntwx=1000,
  ntwr=4000, 
  dt=0.004, cut=8.,
  ntt=1, tautp=10.0,
  temp0=300.0,
  ntb=2, ntp=1, barostat=2,
  ioutfm=1,
 /

 

Single job throughput
(a single run on one or more GPUs within a single node)

* Pascal Titan-X [Titan-XP], GTX-1080 and GTX-1070 numbers require CUDA 8.0.

^


Implicit Solvent GB Benchmarks

1) TRPCage = 304 atoms

&cntrl
  imin=0,irest=1,ntx=5,
  nstlim=500000,dt=0.002,
  ntf=2,ntc=2,
  ntt=1, tautp=0.5,
  tempi=325.0, temp0=325.0,
  ntpr=1000, ntwx=1000, ntwr=50000,
  ntb=0, igb=1,
  cut=9999., rgbmax=9999.,
/
Note: The TRPCage test is too small to make effective use of the very latest  GPUs hence performance on these cards is not as pronounced over early generation cards as it is for larger GB systems and PME runs. This system is also too small to run effectively over multiple GPUs.


* Pascal Titan-X [Titan-XP], GTX-1080 and GTX-1070 numbers require CUDA 8.0.
TRPCage is too small to effectively scale to modern GPUs
 

2) Myoglobin = 2,492 atoms

&cntrl
  imin=0,irest=1,ntx=5,
  nstlim=50000,dt=0.002,ntb=0,
  ntf=2,ntc=2,
  ntpr=1000, ntwx=1000, ntwr=10000,
  cut=9999.0, rgbmax=15.0,
  igb=1,ntt=3,gamma_ln=1.0,nscm=0,
  temp0=300.0,ig=-1,
/

Note: This test case is too small to make effective use of multiple GPUs when using the latest hardware.


* Pascal Titan-X [Titan-XP], GTX-1080 and GTX-1070 numbers require CUDA 8.0.
Myoglobin is too small to effectively scale to multiple modern GPUs.
 

3) Nucleosome = 25095 atoms

&cntrl
  imin=0,irest=1,ntx=5,
  nstlim=1000,dt=0.002,
  ntf=2,ntc=2,ntb=0,
  igb=5,cut=9999.0,rgbmax=15.0,
  ntpr=200, ntwx=200, ntwr=1000,
  saltcon=0.1,
  temp0=310.0,
  ntt=1,tautp=1.0,
  nscm=0,
/

 


* Pascal Titan-X [Titan-XP], GTX-1080 and GTX-1070 numbers require CUDA 8.0.

^