AMBER 12 NVIDIA GPU
ACCELERATION SUPPORT

| Background | Authorship & Support | Supported Features | Supported GPUs |
| Accuracy Considerations | Installation and Testing | Running GPU Accelerated Simulations |
| Considerations for Maximizing GPU Performance | Benchmarks |
| Recommended Hardware & Test Drives |
| Return to Main Amber Page |

THIS IS AN ARCHIVED PAGE FOR AMBER v12 GPU SUPPORT

IT IS NOT ACTIVELY BEING UPDATED

FOR THE LATEST VERSION OF THESE PAGES PLEASE GO HERE

Benchmarks

Benchmarks timings by Mike Wu and Ross Walker.

Please Note: The current benchmark timings here are for AMBER 12 up to and including Bugfix.19 (GPU support revision 12.3.1, Aug 15th 2013).

Machine Specs

Machine 1
CPU = Dual x 6 Core Intel E5-2640 (2.5GHz)
MPICH2 v1.5 - GNU v4.4.7-3
GPU = GTX580 (1.5GB) / GTX680 (4.0GB) / GTX770 (4.0GB) / GTX780 (3.0GB) / GTX-Titan (6.0GB)
nvcc v5.0
NVIDIA Driver Linux 64 - 325.15

Machine 2
CPU = Dual x 8 Core Intel E5-2687W @ 3.10 GHz
Motherboard = SuperMicro X9DR3-F Motherboard
GPU = K10 (2x4GB) / K20 (5GB) / K20X (6GB) / K40 (12GB)
ECC = OFF
nvcc v5.0
NVIDIA Driver Linux 64 - 304.51

Machine 3 (SDSC Gordon)
CPU = Dual x 8 Core Intel E5-2670 @ 2.60GHz
MVAPICH2 v1.8a1p1
Intel Compilers v12.1.0
QDR IB Interconnect

K10 Note: The K10 naming is a little confusing. In these plots we have chosen to refer to K10's as the number of GPUs exposed to the operating system. Thus 2 x K10 is actually a single K10 card and 8 x K10 means 4 K10 cards.

Code Base = AMBER 12 Release + Bugfixes 1 to 19 - GPU code v12.3.1 (Aug 2013)

Precision Model = SPFP (GPU), Double Precision (CPU)

Benchmarks were run with ECC turned OFF on GTX/M2090/K10/K20/K40 cards - we have seen no issues with AMBER reliability related to ECC being on or off. If you see approximately 10% less performance than the numbers here then run the following (for each GPU) as root:

nvidia-smi -g 0 --ecc-config=0 (repeat with -g x for each GPU ID)

Boost Clocks: Some of the latest NVIDIA GPUs, such as the K40, support boost clocks which increase the clock speed if power and temperature headroom is available. This should be turned on as follows to enable optimum performance with AMBER:

sudo nvidia-smi -i 0 -ac 3004,875 which puts device 0 into the highest boost state.

To return to normal do: sudo nvidia-smi -rac

To enable this setting without being root do: nvidia-smi -acp 0

Segfaults in Parallel: If you find that runs across multiple nodes (i.e. using the infiniband adapter) segfault almost immediately then this is most likely an issue with GPU Direct v2 (CUDA v4.2/5.0) not being properly supported by your hardware and driver installations. In most cases setting the following environment variable on all nodes (put it in your .bashrc) will fix the problem:

export CUDA_NIC_INTEROP=1

List of Benchmarks

Explicit Solvent (PME)

DHFR NVE = 23,558 atoms
DHFR NPT = 23,558 atoms
FactorIX NVE = 90,906 atoms
FactorIX NPT = 90,906 atoms
Cellulose NVE = 408,609 atoms
Cellulose NPT = 408,609 atoms

Implicit Solvent (GB)

TRPCage = 304 atoms
Myoglobin = 2,492 atoms
Nucleosome = 25,095 atoms

You can download a tar file containing the input files for all these benchmarks here (50.3 MB)

Individual vs Aggregate Performance
A unique feature of AMBER's GPU support that sets it apart from the likes of Gromacs and NAMD is that it does NOT rely on the CPU to enhance performance while running on a GPU. This allows one to make extensive use of all of the GPUs in a multi-GPU node with maximum efficiency. It also means one can purchase low cost CPUs making GPU accelerated runs with AMBER substantially more cost effective than similar runs with other GPU accelerated MD codes.

For example, suppose you have a node with 4 GTX-Titan GPUs in it. With a lot of other MD codes you can use one to four of those GPUs, plus a bunch CPU cores for a single job. However, the remaining GPUs are not available for additional jobs without hurting the performance of the first job since the PCI-E bus and CPU cores are already fully loaded. AMBER is different. During a single GPU run the CPU and PCI-E bus are barely used. Thus you have the choice of running a single MD run across multiple GPUs, to maximize throughput on a single calculation, or alternatively you could run four completely independent jobs one on each GPU. In this case each individual run, unlike a lot of other GPU MD codes, will run at full speed. For this reason AMBER's aggregate throughput on cost effective multi-GPU needs massively exceeds that of other codes that rely on constant CPU to GPU communication. This is illustrated below in the plots showing 'aggregate' throughput.


Explicit Solvent PME Benchmarks
1) DHFR NVE = 23,558 atoms Typical Production MD NVE with GOOD energy conservation. &cntrl ntx=5, irest=1, ntc=2, ntf=2, tol=0.000001, nstlim=10000, ntpr=1000, ntwx=1000, ntwr=10000, dt=0.002, cut=8., ntt=0, ntb=1, ntp=0, ioutfm=1, / &ewald dsum_tol=0.000001, /
Single job throughput (a single run on one or more GPUs and one or more nodes)
Aggregate throughput (GTX-Titan) (individual runs at the same time on the same node)
2) DHFR NPT = 23,558 atoms Typical Production MD NPT &cntrl ntx=5, irest=1, ntc=2, ntf=2, nstlim=10000, ntpr=1000, ntwx=1000, ntwr=10000, dt=0.002, cut=8., ntt=1, tautp=10.0, temp0=300.0, ntb=2, ntp=1, taup=10.0, ioutfm=1, /
Single job throughput (a single run on one or more GPUs and one or more nodes)
3) Factor IX NVE = 90,906 atoms Typical Production MD NVE with GOOD energy conservation. &cntrl ntx=5, irest=1, ntc=2, ntf=2, tol=0.000001, nstlim=10000, ntpr=1000, ntwx=1000, ntwr=10000, dt=0.002, cut=8., ntt=0, ntb=1, ntp=0, ioutfm=1, / &ewald dsum_tol=0.000001,nfft1=128,nfft2=64,nfft3=64, /
Single job throughput (a single run on one or more GPUs and one or more nodes)
4) Factor IX NPT = 90,906 atoms Typical Production MD NVT &cntrl ntx=5, irest=1, ntc=2, ntf=2, nstlim=10000, ntpr=1000, ntwx=1000, ntwr=10000, dt=0.002, cut=8., ntt=1, tautp=10.0, temp0=300.0, ntb=2, ntp=1, taup=10.0, ioutfm=1, /
Single job throughput (a single run on one or more GPUs and one or more nodes)
5) Cellulose NVE = 408,609 atoms Typical Production MD NVE with GOOD energy conservation. &cntrl ntx=5, irest=1, ntc=2, ntf=2, tol=0.000001, nstlim=10000, ntpr=1000, ntwx=1000, ntwr=10000, dt=0.002, cut=8., ntt=0, ntb=1, ntp=0, ioutfm=1, / &ewald dsum_tol=0.000001, /
Single job throughput (a single run on one or more GPUs and one or more nodes)
6) Cellulose NPT = 408,609 atoms Typical Production MD NPT &cntrl ntx=5, irest=1, ntc=2, ntf=2, nstlim=10000, ntpr=1000, ntwx=1000, ntwr=10000, dt=0.002, cut=8., ntt=1, tautp=10.0, temp0=300.0, ntb=2, ntp=1, taup=10.0, ioutfm=1, /
Single job throughput (a single run on one or more GPUs and one or more nodes)
^
Implicit Solvent GB Benchmarks
1) TRPCage = 304 atoms &cntrl imin=0,irest=1,ntx=5, nstlim=100000,dt=0.002,ntb=0, ntf=2,ntc=2,tol=0.000001, ntpr=1000, ntwx=1000, ntwr=50000, cut=9999.0, rgbmax=15.0, igb=1,ntt=0,nscm=0, / Note: The TRPCage test is too small to make effective use of the very latest GK110 (GTX780/Titan/K20) GPUs hence performance on these cards is not as pronounced over early generation cards as it is for larger GB systems and PME runs.

2) Myoglobin = 2,492 atoms &cntrl imin=0,irest=1,ntx=5, nstlim=10000,dt=0.002,ntb=0, ntf=2,ntc=2,tol=0.000001, ntpr=1000, ntwx=1000, ntwr=50000, cut=9999.0, rgbmax=15.0, igb=1,ntt=0,nscm=0, /

3) Nucleosome = 25095 atoms &cntrl imin=0,irest=1,ntx=5, nstlim=1000,dt=0.002,ntb=0, ntf=2,ntc=2,tol=0.000001, ntpr=100, ntwx=100, ntwr=50000, cut=9999.0, rgbmax=15.0, igb=1,ntt=0,nscm=0, /

AMBER 12 NVIDIA GPUACCELERATION SUPPORT

| Background | Authorship & Support | Supported Features | Supported GPUs || Accuracy Considerations | Installation and Testing | Running GPU Accelerated Simulations || Considerations for Maximizing GPU Performance | Benchmarks | | Recommended Hardware & Test Drives || Return to Main Amber Page |

Benchmarks

Benchmarks timings by Mike Wu and Ross Walker.

Machine Specs

List of Benchmarks

Explicit Solvent PME Benchmarks

Implicit Solvent GB Benchmarks

AMBER 12 NVIDIA GPU
ACCELERATION SUPPORT