AMBER 12 NVIDIA GPU
ACCELERATION SUPPORT

| Background | Authorship & Support | Supported Features | Supported GPUs |
| Accuracy Considerations | Installation and Testing | Running GPU Accelerated Simulations |
| Considerations for Maximizing GPU Performance | Benchmarks |
| Recommended Hardware & Test Drives |
| Return to Main Amber Page |


THIS IS AN ARCHIVED PAGE FOR AMBER v12 GPU SUPPORT

IT IS NOT ACTIVELY BEING UPDATED

FOR THE LATEST VERSION OF THESE PAGES PLEASE GO HERE


 

Benchmarks

Benchmarks timings by Mike Wu and Ross Walker.

Download AMBER Benchmark Suite

Please Note: The current benchmark timings here are for AMBER 12 up to and including Bugfix.19 (GPU support revision 12.3.1, Aug 15th 2013).

Machine Specs

Machine 1
CPU = Dual x 6 Core Intel E5-2640 (2.5GHz)
MPICH2 v1.5 - GNU v4.4.7-3
GPU = GTX580 (1.5GB) / GTX680 (4.0GB) / GTX770 (4.0GB) / GTX780 (3.0GB) / GTX-Titan (6.0GB)
nvcc v5.0
NVIDIA Driver Linux 64 - 325.15

Machine 2
CPU = Dual x 8 Core Intel E5-2687W @ 3.10 GHz
Motherboard = SuperMicro X9DR3-F Motherboard
GPU = K10 (2x4GB) / K20 (5GB) / K20X (6GB) / K40 (12GB)
ECC = OFF
nvcc v5.0
NVIDIA Driver Linux 64 - 304.51

Machine 3 (SDSC Gordon)
CPU = Dual x 8 Core Intel E5-2670 @ 2.60GHz
MVAPICH2 v1.8a1p1
Intel Compilers v12.1.0
QDR IB Interconnect

K10 Note: The K10 naming is a little confusing. In these plots we have chosen to refer to K10's as the number of GPUs exposed to the operating system. Thus 2 x K10 is actually a single K10 card and 8 x K10 means 4 K10 cards.

Code Base = AMBER 12 Release + Bugfixes 1 to 19 - GPU code v12.3.1 (Aug 2013)

Precision Model = SPFP (GPU), Double Precision (CPU)

Benchmarks were run with ECC turned OFF on GTX/M2090/K10/K20/K40 cards - we have seen no issues with AMBER reliability related to ECC being on or off. If you see approximately 10% less performance than the numbers here then run the following (for each GPU) as root:

nvidia-smi -g 0 --ecc-config=0    (repeat with -g x for each GPU ID)

Boost Clocks: Some of the latest NVIDIA GPUs, such as the K40, support boost clocks which increase the clock speed if power and temperature headroom is available. This should be turned on as follows to enable optimum performance with AMBER:

sudo nvidia-smi -i 0 -ac 3004,875       which puts device 0 into the highest boost state.

To return to normal do: sudo nvidia-smi -rac

To enable this setting without being root do: nvidia-smi -acp 0

Segfaults in Parallel: If you find that runs across multiple nodes (i.e. using the infiniband adapter) segfault almost immediately then this is most likely an issue with GPU Direct v2 (CUDA v4.2/5.0) not being properly supported by your hardware and driver installations. In most cases setting the following environment variable on all nodes (put it in your .bashrc) will fix the problem:

export CUDA_NIC_INTEROP=1

List of Benchmarks

Explicit Solvent (PME)

  1. DHFR NVE = 23,558 atoms
  2. DHFR NPT = 23,558 atoms
  3. FactorIX NVE = 90,906 atoms
  4. FactorIX NPT = 90,906 atoms
  5. Cellulose NVE = 408,609 atoms
  6. Cellulose NPT = 408,609 atoms

Implicit Solvent (GB)

  1. TRPCage = 304 atoms
  2. Myoglobin = 2,492 atoms
  3. Nucleosome = 25,095 atoms

You can download a tar file containing the input files for all these benchmarks here (50.3 MB)

Individual vs Aggregate Performance
A unique feature of AMBER's GPU support that sets it apart from the likes of Gromacs and NAMD is that it does NOT rely on the CPU to enhance performance while running on a GPU. This allows one to make extensive use of all of the GPUs in a multi-GPU node with maximum efficiency. It also means one can purchase low cost CPUs making GPU accelerated runs with AMBER substantially more cost effective than similar runs with other GPU accelerated MD codes.

For example, suppose you have a node with 4 GTX-Titan GPUs in it. With a lot of other MD codes you can use one to four of those GPUs, plus a bunch CPU cores for a single job. However, the remaining GPUs are not available for additional jobs without hurting the performance of the first job since the PCI-E bus and CPU cores are already fully loaded. AMBER is different. During a single GPU run the CPU and PCI-E bus are barely used. Thus you have the choice of running a single MD run across multiple GPUs, to maximize throughput on a single calculation, or alternatively you could run four completely independent jobs one on each GPU. In this case each individual run, unlike a lot of other GPU MD codes, will run at full speed. For this reason AMBER's aggregate throughput on cost effective multi-GPU needs massively exceeds that of other codes that rely on constant CPU to GPU communication. This is illustrated below in the plots showing 'aggregate' throughput.

^


Explicit Solvent PME Benchmarks

1) DHFR NVE = 23,558 atoms

 Typical Production MD NVE with
 GOOD energy conservation.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=10000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,
 /

Single job throughput
(a single run on one or more GPUs and one or more nodes)


 

Aggregate throughput (GTX-Titan)
 (individual runs at the same time on the same node)


 

2) DHFR NPT = 23,558 atoms

Typical Production MD NPT
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, 
   nstlim=10000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=1, tautp=10.0,
   temp0=300.0,
   ntb=2, ntp=1, taup=10.0,
   ioutfm=1,
 /

Single job throughput
(a single run on one or more GPUs and one or more nodes)


 

3) Factor IX NVE = 90,906 atoms

 Typical Production MD NVE with
 GOOD energy conservation.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=10000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,nfft1=128,nfft2=64,nfft3=64,
 /
 
Single job throughput
(a single run on one or more GPUs and one or more nodes)


 

4) Factor IX NPT = 90,906 atoms

Typical Production MD NVT
&cntrl
 ntx=5, irest=1,
 ntc=2, ntf=2, 
 nstlim=10000, 
 ntpr=1000, ntwx=1000,
 ntwr=10000, 
 dt=0.002, cut=8.,
 ntt=1, tautp=10.0,
 temp0=300.0,
 ntb=2, ntp=1, taup=10.0,
 ioutfm=1,
/
 
Single job throughput
(a single run on one or more GPUs and one or more nodes)

 

5) Cellulose NVE = 408,609 atoms

Typical Production MD NVE with
GOOD energy conservation.
 &cntrl
   ntx=5, irest=1,
   ntc=2, ntf=2, tol=0.000001,
   nstlim=10000, 
   ntpr=1000, ntwx=1000,
   ntwr=10000, 
   dt=0.002, cut=8.,
   ntt=0, ntb=1, ntp=0,
   ioutfm=1,
 /
 &ewald
  dsum_tol=0.000001,
 /
 
Single job throughput
(a single run on one or more GPUs and one or more nodes)

 

6) Cellulose NPT = 408,609 atoms

Typical Production MD NPT
 &cntrl
  ntx=5, irest=1,
  ntc=2, ntf=2, 
  nstlim=10000, 
  ntpr=1000, ntwx=1000,
  ntwr=10000, 
  dt=0.002, cut=8.,
  ntt=1, tautp=10.0,
  temp0=300.0,
  ntb=2, ntp=1, taup=10.0,
  ioutfm=1,
 /

 

Single job throughput
(a single run on one or more GPUs and one or more nodes)

 

^


Implicit Solvent GB Benchmarks

1) TRPCage = 304 atoms

&cntrl
  imin=0,irest=1,ntx=5,
  nstlim=100000,dt=0.002,ntb=0,
  ntf=2,ntc=2,tol=0.000001,
  ntpr=1000, ntwx=1000, ntwr=50000,
  cut=9999.0, rgbmax=15.0,
  igb=1,ntt=0,nscm=0,
/
Note: The TRPCage test is too small to make effective use of the very latest GK110 (GTX780/Titan/K20) GPUs hence performance on these cards is not as pronounced over early generation cards as it is for larger GB systems and PME runs.


 

2) Myoglobin = 2,492 atoms

&cntrl
  imin=0,irest=1,ntx=5,
  nstlim=10000,dt=0.002,ntb=0,
  ntf=2,ntc=2,tol=0.000001,
  ntpr=1000, ntwx=1000, ntwr=50000,
  cut=9999.0, rgbmax=15.0,
  igb=1,ntt=0,nscm=0,
/

 


 

3) Nucleosome = 25095 atoms

&cntrl
  imin=0,irest=1,ntx=5,
  nstlim=1000,dt=0.002,ntb=0,
  ntf=2,ntc=2,tol=0.000001,
  ntpr=100, ntwx=100, ntwr=50000,
  cut=9999.0, rgbmax=15.0,
  igb=1,ntt=0,nscm=0,
/

 


 

^