Amber masthead
Filler image AmberTools18 Amber18 Manuals Tutorials Force Fields Contacts History
Filler image

Useful links:

Amber Home
Download Amber
Installation
News
Amber Citations
GPU Support
Features
Get Started
Benchmarks
Hardware
Logistics
Patches
Intel Support
Updates
Mailing Lists
For Educators
File Formats

Amber18: pmemd.cuda performance information

A lot has been done to improve the Amber18 code even as it has accumulated new features. Since 2017, our efforts to speed up periodic, explicit solvent simulations have focused on the non-bonded interactions, particle to mesh mapping prior to the reciprocal space calculation, and clustering bonded terms to streamline memory access. We have gained speed from both improving the way we do the math as well as trimming the precision model where it does impact the accuracy of the MD trajectories.

The code remains deterministic--a given GPU model will produce the same results from the same input time and again. This feature is critical for debugging, has allowed to validate consumer GeForce cards as suitable for use with AMBER and has helped us uncover hardware design bugs in several NVIDIA GPU models at the time of their release. As shown on the GPU Numerics page, the precision model is able to conserve energy over long timescales with tight parameters. Throughout the main code and the Thermodynamic Integration extensions we have chosen equations and workflows with good numerical conditioning suitable for the 32-bit floating point math that dominates our calculations. Most production simulations will not conserve energy because the input parameters aren't good enough, not because the precision model fails. But, as we say, ntt >= 2!

We present here two complementary set of benchmarks, which we identify by the names of their primary authors. The Ross Walker set has been run on nearly all versions of pmemd.cuda, and on many different GPUs. Users may be used to these, and results are available for previous versions. A more recent set of benchmarks has been created by Dave Cerutti, which use settings more like those most commonly used in production today.

The "Ross Walker" benchmarks

These are available here, for various versions of Amber:

The "Dave Cerutti" benchmarks

We take as benchmarks four periodic systems spanning a range of system sizes and compositions. The smallest Dihydrofolate Reductase (DHFR) case is a 159-residue protein in water, weighing in at 23,588 atoms. Next, from the human blood clotting system, Factor IX is a 379-residue protein also in a box of water, total 90,906 atoms. The larger cellulose system, 408,609 atoms, has a greater content of macromolecules in it: the repeating sugar polymer constitutes roughly a sixth of the atoms in the system. Finally, the very large simulation of satellite tobacco mosaic virus (STMV), a gargantuan 1,067,095 atom system, also has an appreciable macromolecule content, but is otherwise another collection of proteins in water.

Download the Amber 18 Benchmark Suite.

The figure below indicates the relative performance of Amber16 and Amber18 with default settings as well as "boosted" settings. We take the unbiased metric of trillions of moves made against individual atoms per day. This puts most systems on equal footing, but for a few factors. The identity of each system and its atom count are given on the right hand side of the plot. All simulations were run with a 9Å cutoff and a 4fs time step. We maintained the isothermal, isobaric ensemble with a Berendsen thermostat and Monte-Carlo barostat (trial moves to update the system volume every 100 steps). The production rates of each GPU running Amber16 are given by the solid color bars. Additional production from Amber18 is shown as a pink extension of each bar. Further performance enhancements gained through new input settings described below are shown as black extensions.

In the benchmarks above, Amber18 is up against some headwinds: first, Amber18 includes a new default setting to reduce and eliminate net force on the system after each time step. This is in keeping with the operations of the CPU code, so that the system does not wander until its net momentum is stopped every nscm steps and thermostats do their job right. But, it costs about 3% of the total run time. Furthermore, the relative cost of the Monte-Carlo barostat used in the NPT runs gets higher for the newer, faster GPUs. This may reflect the fact that some of the calculation happens on the CPU core, which isn't getting any faster even as the GPUs have tripled in capacity even over the limited history we are showing. However, there does appear to be an additional MC barostat performance cost of about 2% that we are working to understand.

Amber18 now allows a limited form of "leaky pair lists" that other codes have employed for some time. There are cutoffs on both electrostatic and van-der Waals interactions that neglect about 1 part in 100,000 or 1 part in 1,000 of these energies, respectively. By default, we are careful to refresh the list of all non-bonded interactions whenever it is conceivable that we might start to miss something within the cutoff. However, in practice, we can allow particles to travel further before refreshing the pair list, and Amber18 enables this with the skin_permit feature. A value of 0.5 will ensure that every pair interaction is calculated. A value of 1.0 (the maximum allowed) will cause about one interaction out of a million to be missed, and a mid-range value of 0.75 will miss about one interaction in 50 million. Note, there are real effects on energy conservation if one is looking hard. The input below will omit Amber18's net force correction (like Amber16) and then reduce the pair list refresh rate by an amount that we think is safe, granting 8-12% additional speedup over Amber16 in most systems.

&ewald
  netfrc = 0,
  skin_permit = 0.75,
&end

The chart above shows an important general result about system size and throughput. Up until the advent of GP100, Titan V, and Volta, the run speed of a system of 25,000 atoms or more was pretty constant as a function of size (all other settings being equal). However, for the newer and bigger cards, the fault lines in the algorithm are becoming apparent: the faster Pascal and Volta chips hit their stride for systems of perhaps 75,000 to half a million atoms, lose some performance outside that range, and are under-utilized by small systems. (We've even found that it can improve overall throughput to run multiple GPU simulations of small systems simultaneously on these chips--each simulation will run faster than half its normal speed.)

We've put the relative performance benchmarks up front to assure you that we are working hard to make Amber faster and keep up with hardware trends. The improvements can be even greater with a shorter 8Å cutoff used in previous benchmarks, but we wanted to present the most relevant examples. It's hard to compare codes based on benchmarks, another reason we chose the per-atom metric above. For those that just want to see ns/day, the numbers are below. Note RTX-2080 and RTX-2080TI numbers are provisional for now pending optimizations.

Dihydrofolate Reductase (JAC, 23,558 atoms)
GPU Type NVE Ensemble NPT Ensemble
Amber16 Amber18, Default Amber18, Boost Amber16 Amber18, Default Amber18, Boost
V100 885 915 1051 826 823 933
Titan-V 827 862 985 760 765 859
GP100 647 707 782 611 662 726
Titan-X 569 631 692 528 587 639
RTX-2080TI N/A 915 1033 N/A 847 948
RTX-2080 N/A 751 830 N/A 691 773
GTX-1080Ti 564 636 701 539 601 655
GTX-980Ti 308 361 392 298 341 372
8X CPU 18.7 17.0 GPU Only 19.2 18.6 GPU Only
1X CPU 2.9 2.8 GPU Only 2.9 2.8 GPU Only

Factor IX (90,906 atoms)
GPU Type NVE Ensemble NPT Ensemble
Amber16 Amber18, Default Amber18, Boost Amber16 Amber18, Default Amber18, Boost
V100 348 374 420 325 334 370
Titan-V 309 334 372 286 291 322
GP100 197 231 252 189 219 238
Titan-X 160 184 187 153 172 196
RTX-2080Ti N/A 320 354 N/A 300 329
RTX-2080 N/A 233 257 N/A 221 240
GTX-1080Ti 160 183 187 154 174 198
GTX-980Ti 89 107 116 86 103 112
8X CPU 4.4 4.3 GPU Only 4.4 4.3 GPU Only
1X CPU 0.65 0.64 GPU Only 0.63 0.64 GPU Only

Cellulose (408,609 atoms)
GPU Type NVE Ensemble NPT Ensemble
Amber16 Amber18, Default Amber18, Boost Amber16 Amber18, Default Amber18, Boost
V100 81.6 93.0 101.0 77.2 83.9 89.9
Titan-V 71.9 80.8 87.4 66.8 71.3 76.6
GP100 43.3 52.8 56.7 41.7 50.4 53.7
Titan-X 34.2 39.0 41.1 33.6 37.8 40.6
RTX-2080Ti N/A 68.0 72.6 N/A 64.8 68.6
RTX-2080 N/A 48.2 51.4 N/A 46.1 48.8
GTX-1080Ti 34.3 39.4 41.8 33.4 38.2 40.8
GTX-980Ti 19.1 23.2 24.7 18.5 22.4 24.0
8X CPU 0.74 0.87 GPU Only 0.80 0.78 GPU Only
1X CPU 0.13 0.13 GPU Only 0.13 0.13 GPU Only

(Satellite Tobacco Mosaic Virus 1,067,095 atoms)
GPU Type NVE Ensemble NPT Ensemble
Amber16 Amber18, Default Amber18, Boost Amber16 Amber18, Default Amber18, Boost
V100 29.0 32.0 35.2 27.1 28.5 31.0
Titan-V 25.3 27.7 30.2 23.0 23.5 25.3
GP100 15.9 18.8 20.6 15.2 17.9 19.4
RTX-2080Ti N/A 23.4 25.5 N/A 22.0 23.7
RTX-2080 N/A 16.8 18.1 N/A 15.7 17.0
Titan-X 12.6 14.1 14.9 12.1 13.2 14.2
GTX-1080Ti 12.5 14.0 15.0 12.1 13.4 14.3
GTX-980Ti 7.1 8.5 9.2 6.8 8.1 8.8
8X CPU 0.29 0.30 GPU Only 0.31 0.26 GPU Only
1X CPU 0.05 0.05 GPU Only 0.05 0.05 GPU Only

While we sought the best possible results on GPUs we had available during most of the development, the new Volta cards are tricky beasts. In order to get the best performance out of V100 and Titan V, we used a special compilation with -volta when configuring the Amber installation. This special compilation reduces the dependence on texture memory in favor of direct L1 caching, which isn't available on all cards we support. With a last-minute tweak, we were able to conserve some of the performance enhancements we had made on Pascal architectures under CUDA 8 and mitigate the losses experienced on Volta architectures if the special -volta flag is not configured. However, we still recommend putting together a special executable for V100 and Titan-V. We hope to be able to remove the need for this special flag for the Volta architecture with future updates.

Another result that is obvious from the tables above is that the GPU code is vastly superior to CPU execution (here, we used Intel E5-2640 v4 2.4GHz CPUs). There are efforts underway in a collaboration with Intel corporation to improve the code's scalability on massively parallel CPU architectures. New advances have been able to exceed the throughput of the GPU code on clusters of 1200 processor cores or more. However, even under the most favorable circumstances the price performance comparisson to NVIDIA GeForce cards is still extremely unfavorable.` The results above pertain to a "real-world" setup (multiple serial runs were performed concurrently on separate cores of the same CPU, and the GNU compiler we used results in about 30% lower performance on the Intel CPUs). For parallel runs on eight or more CPU cores, the compiler has negligible effects on performance.

The code that executes the non-periodic, implicit solvent GB in Amber18 has not changed much since Amber16, nor has its performance. The way we do bonded interactions results in a mild decrease in GB performance for very small systems, but the big story is the inverse squared relationship in the system size to GB performance as a whole, compared to the steady production rates one gets out of PME simulations. A table of numbers in ns/day is given below. Note that all benchmarks have been updated to run on a 4fs time step.

Generalized Born Timings in Amber18
GPU Type Trp Cage (304 atoms) Myoglobin (2,492 atoms) Nucleosome (25,095 atoms)
V100 2703 1798 52.6
Titan-V 2543 1736 52.7
GP100 2896 1223 25.6
RTX-2080Ti 2521 1252 30.2
RTX-2080 2400 1017 21.1
Titan-X 2690 870 15.7
GTX-1080Ti 2867 892 16.4
GTX-980Ti 1815 596 11.1
8X CPU 422 9.3 0.1
1X CPU 70.8 1.5 0.02

For production MD, if one merely wants sampling, these timings would suggest that the crossover point lies somewhere in the realm of a 4,000 atom protein. In explicit solvent, such a system would contain roughly 40,000 atoms and would be expected to run about 700 ns/day in an NVT emsemble. The corresponding GB implicit solvent simulation would also produce about as much. GB simulations carry an added sampling advantage in that they remove solvent friction to allow more extensive side chain sampling, so even larger biomolecules, perhaps even the 25,095 atom nucleosome, will sample their conformational spaces faster in GB simulations than in explicit solvent. For more details see: Anandakrishnan, R., Drozdetski, A., Walker, R.C., Onufriev, A.V., "Speed of Conformational Change: Comparing Explicit and Implicit Solvent Molecular Dynamics Simulations", Biophysical Journal, 2015, 108, 1153-1164, DOI: 10.1016/j.bpj.2014.12.047.
If the accuracy of the solvent conditions is a concern, explicit solvent conditions are preferable.

The "Ross Walker" benchmarks

This benchmark suite was originally developed for AMBER10, (support for 4fs HMR was introduced with AMBER14) and while some of the settings for production MD simulations have evolved over time this benchmark set is still useful for comparing with historical versions of the AMBER software and previous hardware generations.

Results for this benchmark set using previous versions of AMBER and GPU hardware are available here:

List of Benchmarks

Explicit Solvent

  1. DHFR NVE HMR 4fs = 23,558 atoms
  2. DHFR NPT HMR 4fs = 23,558 atoms
  3. DHFR NVE 2fs = 23,558 atoms
  4. DHFR NPT 2fs = 23,558 atoms
  5. FactorIX NVE 2fs = 90,906 atoms
  6. FactorIX NPT 2fs = 90,906 atoms
  7. Cellulose NVE 2fs= 408,609 atoms
  8. Cellulose NPT 2fs = 408,609 atoms
  9. STMV NPT HMR 4fs = 1,067,095 atoms

Implicit Solvent

  1. TRPCage 2fs = 304 atoms
  2. Myoglobin = 2,492 atoms
  3. Nucleosome = 25,095 atoms

You can download a tar file of these benchmarks here (84.1 MB)

Click Images for Larger Versions

Price / Performance

Before looking at the raw throughput performance of each of the various benchmarks on different GPU models it is useful to consider the price/performance since NVIDIA GPUs prices span a very large range from the cost effective GeForce cards to the latest eye wateringly expensive Tesla V100 cards. The following plot shows the price / performance ratio relative to the GTX1080 GPU for current GeForce and Tesla GPUs at prices as of Oct 2018. Smaller is better.

Explicit Solvent PME Benchmarks

1) DHFR NVE HMR 4fs = 23,558 atoms

 Typical Production MD NVE with
    reasonable energy conservation, HMR, 4fs.
    &cntrl
      ntx=5, irest=1,
      ntc=2, ntf=2, tol=0.000001,
      nstlim=75000,
      ntpr=1000, ntwx=1000,
      ntwr=10000,
      dt=0.004, cut=8.,
      ntt=0, ntb=1, ntp=0,
      ioutfm=1,
    /
    &ewald
      dsum_tol=0.000001,
    /

Single job throughput
(a single run on one or more GPUs within a single node)


 

Aggregate throughput (RTX-2080TI)
  (individual runs at the same time on the same node)


 

2) DHFR NPT HMR 4fs = 23,558 atoms

 Typical Production MD NPT, MC Bar 4fs HMR
    &cntrl
      ntx=5, irest=1,
      ntc=2, ntf=2,
      nstlim=75000,
      ntpr=1000, ntwx=1000,
      ntwr=10000,
      dt=0.004, cut=8.,
      ntt=1, tautp=10.0,
      temp0=300.0,
      ntb=2, ntp=1, barostat=2,
      ioutfm=1,
    /
    

Single job throughput
(a single run on one or more GPUs within a single node)


 

3) DHFR NVE 2fs = 23,558 atoms

 Typical Production MD NVE with
    reasonable energy conservation 2fs.
    &cntrl
      ntx=5, irest=1,
      ntc=2, ntf=2, tol=0.000001,
      nstlim=75000,
      ntpr=1000, ntwx=1000,
      ntwr=10000,
      dt=0.002, cut=8.,
      ntt=0, ntb=1, ntp=0,
      ioutfm=1,
    /
    &ewald
      dsum_tol=0.000001,
    /
    

Single job throughput
(a single run on one or more GPUs within a single node)


 

4) DHFR NPT 2fs = 23,558 atoms

 Typical Production MD NPT, MC Bar 2fs
    &cntrl
      ntx=5, irest=1,
      ntc=2, ntf=2,
      nstlim=75000,
      ntpr=1000, ntwx=1000,
      ntwr=10000,
      dt=0.00xi24, cut=8.,
      ntt=1, tautp=10.0,
      temp0=300.0,
      ntb=2, ntp=1, barostat=2,
      ioutfm=1,
    /
    

Single job throughput
(a single run on one or more GPUs within a single node)


 

5) Factor IX NVE = 90,906 atoms

 Typical Production MD NVE with
    reasonable energy conservation.
    &cntrl
      ntx=5, irest=1,
      ntc=2, ntf=2, tol=0.000001,
      nstlim=15000,
      ntpr=1000, ntwx=1000,
      ntwr=10000,
      dt=0.002, cut=8.,
      ntt=0, ntb=1, ntp=0,
      ioutfm=1,
    /
    &ewald
      dsum_tol=0.000001,
      nfft1=128.nfft2=64,nfft3=64,
    /
    

Single job throughput
(a single run on one or more GPUs within a single node)


 

6) Factor IX NPT 2fs = 90,906 atoms

 Typical Production MD NPT, MC Bar 2fs
    &cntrl
      ntx=5, irest=1,
      ntc=2, ntf=2,
      nstlim=15000,
      ntpr=1000, ntwx=1000,
      ntwr=10000,
      dt=0.002, cut=8.,
      ntt=1, tautp=10.0,
      temp0=300.0,
      ntb=2, ntp=1, barostat=2,
      ioutfm=1,
    /
    

Single job throughput
(a single run on one or more GPUs within a single node)


 

7) Cellulose NVE = 408,609 atoms

 Typical Production MD NVE with
    reasonable energy conservation.
    &cntrl
      ntx=5, irest=1,
      ntc=2, ntf=2, tol=0.000001,
      nstlim=10000,
      ntpr=1000, ntwx=1000,
      ntwr=10000,
      dt=0.002, cut=8.,
      ntt=0, ntb=1, ntp=0,
      ioutfm=1,
    /
    &ewald
      dsum_tol=0.000001,
    /
    

Single job throughput
(a single run on one or more GPUs within a single node)


 

8) Cellulose NPT 2fs = 408,609 atoms

 Typical Production MD NPT, MC Bar 2fs
    &cntrl
      ntx=5, irest=1,
      ntc=2, ntf=2,
      nstlim=10000,
      ntpr=1000, ntwx=1000,
      ntwr=10000,
      dt=0.002, cut=8.,
      ntt=1, tautp=10.0,
      temp0=300.0,
      ntb=2, ntp=1, barostat=2,
      ioutfm=1,
    /
    

Single job throughput
(a single run on one or more GPUs within a single node)


 

9) STMV NPT HMR 4fs = 1,067,095 atoms

 Typical Production MD NPT, MC Bar 4fs HMR
    &cntrl
      ntx=5, irest=1,
      ntc=2, ntf=2,
      nstlim=4000,
      ntpr=1000, ntwx=1000,
      ntwr=4000,
      dt=0.004, cut=8.,
      ntt=1, tautp=10.0,
      temp0=300.0,
      ntb=2, ntp=1, barostat=2,
      ioutfm=1,
    /
    

Single job throughput
(a single run on one or more GPUs within a single node)


 

Implicit Solvent GB Benchmarks

1) TRPCage = 304 atoms


    &cntrl
      imin=0, irest=1, ntx=5,
      nstlim=500000, dt=0.002,
      ntc=2, ntf=2,
      ntt=1, taup=0.5,
      tempi=325.0, temp0=325.0,
      ntpr=1000, ntwx=1000,ntwr=50000,
      ntb=0, igb=1,
      cut=9999., rgbmax=9999.,
    /
    
Note: The TRPCage test is too small to make effective use of the very latest GPUs hence performance on these cards is not as pronounced over early generation cards as it is for larger GB systems and PME runs. This system is also too small to run effectively over multiple GPUs.

Single job throughput
(a single run on one or more GPUs within a single node)


 

2) Myoglobin = 2,492 atoms

 
    &cntrl
      imin=0, irest=1, ntx=5,
      nstlim=50000, dt=0.002, ntb=0,
      ntc=2, ntf=2, 
      ntpr=1000, ntwx=1000, ntwr=10000,
      cut=9999.0, rgbmax=15.0,
      igb=1, ntt=3, gamma_ln=1.0, nscm=0,
      temp0=300.0, ig=-1,
    /
    
Note: This test case is too small to make effective use of multiple GPUs when using the latest hardware.

Single job throughput
(a single run on one or more GPUs within a single node)


 

3) Nucleosome = 25,095 atoms


    &cntrl
      imin=0, irest=1, ntx=5,
      nstlim=1000, dt=0.002,
      ntc=2, ntf=2, ntb=0,
      igb=5, cut=9999.0, rgbmax=15.0,
      ntpr=200, ntwx=200, ntwr=1000,
      saltcon=0.1,
      ntt=1, tautp=1.0,
      nscm=0,
    / 
    

Single job throughput
(a single run on one or more GPUs within a single node)


 

"How's that for maxed out?"

Last modified: Oct 31, 2018