Amber masthead
Filler image AmberTools20 Amber20 Manuals Tutorials Force Fields Contacts History
Filler image

Useful links:

Amber Home
Download Amber
Installation
News
Amber Citations
GPU Support
Features
Get Started
Benchmarks
Hardware
Logistics
Intel Support
Updates
Mailing Lists
For Educators
File Formats

Amber20: pmemd.cuda performance information

A lot has been done to improve the Amber20 code, and its performance relative to Amber18 is steady despite accumulating new features. Your results may vary, particularly depending on the configuration or operating status of the machines where you have your GPUs. If you are running a solitary GPU with good cooling, you can expect to get numbers up to or perhaps a little better than the ones posted here. If you are running the code on a dense server operating at capacity, you can expect the heating sensors to throttle back the power and the performance will suffer a few percent. The code remains deterministic—a given GPU model will produce the same results from the same input time and again. This feature is critical for debugging and has allowed us to validate consumer GeForce cards as suitable for use with AMBER. The precision model is able to conserve energy over long timescales with tight parameters. Throughout the main code and the Thermodynamic Integration extensions we have chosen equations and workflows with good numerical conditioning suitable for the 32-bit floating point math that dominates our calculations. Most production simulations will not conserve energy because the input parameters aren't good enough, not because the precision model fails. But, as we say, ntt >= 2!


A bit of history

Amber developers have been working on GPUs for more than a decade. Here is an overview of how performance has changed, due to both software and hardware improvements, over than time span. We use a smallish benchmark here, with 23,000 atoms.


We present here two complementary set of benchmarks, which we identify by the names of their primary authors. The older set, created by Ross Walker, has been run on nearly all versions of pmemd.cuda, and on many different GPUs dating to 2010. Users may be accustomed to these, and results are available for previous versions. A more recent set of benchmarks has been created by Dave Cerutti, which use settings more like those more common in production today.

Recent Turing and Ampere benchmarks from Exxact

Some October, 2020 benchmarks for Ampere and Turing cards can be found here:

Original benchmarks (from Ross Walker)

These are available here, for various versions of Amber:

Updated benchmarks (from Dave Cerutti)

We take as benchmarks four periodic systems spanning a range of system sizes and compositions. The smallest Dihydrofolate Reductase (DHFR) case is a 159-residue protein in water, weighing in at 23,588 atoms. Next, from the human blood clotting system, Factor IX is a 379-residue protein also in a box of water, total 90,906 atoms. The larger cellulose system, 408,609 atoms, has a greater content of macromolecules in it: the repeating sugar polymer constitutes roughly a sixth of the atoms in the system. Finally, the very large simulation of satellite tobacco mosaic virus (STMV), a gargantuan 1,067,095 atom system, also has an appreciable macromolecule content, but is otherwise another collection of proteins in water.

Download the Amber 20 Benchmark Suite.

The figure below indicates the relative performance of Amber16 and Amber18, which is still representative of Amber20, with default settings as well as "boosted" settings. We take the unbiased metric of trillions of moves made against individual atoms per day. This puts most systems on equal footing, but for a few factors. The identity of each system and its atom count are given on the right hand side of the plot. All simulations were run with a 9Å cutoff and a 4fs time step. We maintained the isothermal, isobaric ensemble with a Berendsen thermostat and Monte-Carlo barostat (trial moves to update the system volume every 100 steps). The production rates of each GPU running Amber16 are given by the solid color bars. Additional production from Amber18 is shown as a pink extension of each bar. Further performance enhancements gained through the "boost" input settings described below are shown as black extensions.

Since Amber18, the code has included a default setting to eliminate net force on the system after each time step. This is in keeping with the operations of the CPU code, so that the system does not wander until its net momentum is stopped every nscm steps and thermostats do their job right. In the current release, it costs about 3% of the total run time, but the deleterious effects it is preventing may already be mitigated by conservative settings in the force calculation, particularly those affecting the accuracy of the PME grid. One of our developers is working on what we believe is a 'best of both worlds' solution which will correct net forces during the PME grid calculation specifically, and will be obligatory for negligible additional cost. This feature will be available soon.

Since release 18, Amber has allowed a limited form of "leaky pair lists" that other codes have employed for some time. There are (always) cutoffs on both electrostatic and van-der Waals interactions that neglect about 1 part in 100,000 or 1 part in 1,000 of these energies, respectively. By default, we are careful to refresh the list of all non-bonded interactions whenever it is conceivable that we might start to miss something within the cutoff. However, in practice, we can allow particles to travel further before refreshing the pair list, and Amber18 enables this with the skin_permit feature. A value of 0.5 will ensure that every pair interaction is calculated. A value of 1.0 (the maximum allowed) will cause about one interaction out of a million to be missed, and a mid-range value of 0.75 will miss about one interaction in 50 million. Note, there are real effects on energy conservation if one is looking hard. The input below will omit the net force correction and then reduce the pair list refresh rate by an amount that we think is safe, granting 8-12% additional speedup in most systems.

&ewald
  netfrc = 0,
  skin_permit = 0.75,
&end

The chart above shows an important general result about system size and throughput. Up until the advent of GP100, Titan V, and Volta, the run speed of a system of 25,000 atoms or more was pretty constant as a function of size (all other settings being equal). However, for the newer and bigger cards, the fault lines in the algorithm are becoming apparent: the faster Pascal and Volta chips hit their stride for systems of perhaps 75,000 to half a million atoms, lose some performance outside that range, and are under-utilized by small systems. (We've even found that it can improve overall throughput to run multiple GPU simulations of small systems simultaneously on these chips--each simulation will run faster than half its normal speed. The Multi-Process Service (MPS) feature of CUDA makes this work the best, although it's only effective on the newest architectures (Volta, Turing). With MPS enabled and multiple replicas engaged on the same GPU, the smaller DHFR benchmark surges to the head of the pack in terms of atoms moved per time in the chart above. The benefits of MPS parallelism vanish for systems larger than perhaps 100,000 atoms. )

We've put the relative performance benchmarks up front to assure you that we are working hard to make Amber faster and keep up with hardware trends. The improvements can be even greater with a shorter 8Å cutoff used in previous benchmarks, but we wanted to present the most relevant examples. Because of the particular settings, it's not as easy as it seems to compare codes based on benchmarks, another reason we chose to compare performance across system sizes with the per-atom metric above. For those that just want to see ns/day, the numbers are below.

Dihydrofolate Reductase (JAC, 23,558 atoms)
GPU Type NVE Ensemble NPT Ensemble
Amber20, Default Amber20, Boost Amber20, Default Amber20, Boost
V100 934 1059 895 1003
Titan-V 920 1048 893 1011
RTX-5000 683 754 666 734
GP100 745 836 720 801
Titan-X 631 692 587 639
RTX-2080TI 915 1033 847 948
RTX-2080 751 830 691 773
GTX-1080Ti 680 752 661 721
GTX-980Ti 361 392 341 372
8X CPU 17.0 GPU Only 18.6 GPU Only
1X CPU 2.8 GPU Only 2.8 GPU Only

Factor IX (90,906 atoms)
GPU Type NVE Ensemble NPT Ensemble
Amber20, Default Amber20, Boost Amber20, Default Amber20, Boost
V100 365 406 345 384
Titan-V 345 387 328 365
RTX-5000 218 235 212 227
GP100 236 260 229 251
Titan-X 184 187 172 196
RTX-2080Ti 320 354 300 329
RTX-2080 233 257 221 240
GTX-1080Ti 196 213 189 204
GTX-980Ti 107 116 103 112
8X CPU 4.3 GPU Only 4.3 GPU Only
1X CPU 0.64 GPU Only 0.64 GPU Only

Cellulose (408,609 atoms)
GPU Type NVE Ensemble NPT Ensemble
Amber20, Default Amber20, Boost Amber20, Default Amber20, Boost
V100 88.9 96.2 84.3 90.8
Titan-V 82.2 87.9 78.1 83.2
RTX-5000 46.6 48.7 45.1 47.3
GP100 53.8 57.5 52.0 55.3
Titan-X 39.0 41.1 37.8 40.6
RTX-2080Ti 68.0 72.6 64.8 68.6
RTX-2080 48.2 51.4 46.1 48.8
GTX-1080Ti 42.9 45.3 40.8 43.2
GTX-980Ti 23.2 24.7 22.4 24.0
8X CPU 0.87 GPU Only 0.78 GPU Only
1X CPU 0.13 GPU Only 0.13 GPU Only

(Satellite Tobacco Mosaic Virus 1,067,095 atoms)
GPU Type NVE Ensemble NPT Ensemble
Amber20, Default Amber20, Boost Amber20, Default Amber20, Boost
V100 30.4 33.5 28.1 31.3
Titan-V 28.0 30.5 26.4 28.7
RTX-5000 16.3 17.4 15.6 16.6
GP100 19.1 20.9 18.3 19.9
RTX-2080Ti 23.4 25.5 22.0 23.7
RTX-2080 16.8 18.1 15.7 17.0
Titan-X 14.1 14.9 13.2 14.2
GTX-1080Ti 14.4 15.4 13.9 14.6
GTX-980Ti 8.5 9.2 8.1 8.8
8X CPU 0.30 GPU Only 0.26 GPU Only
1X CPU 0.05 GPU Only 0.05 GPU Only

We have always sought the best possible results on a range of GPU hardware. In early versions of Amber18, there was a special -volta flag for optimizing performance on the advanced V100 and Titan-V cards. This is no longer needed in Amber 20--standard compilation will always deliver the best results.

The code that executes the non-periodic, implicit solvent GB in Amber20 has not changed much since Amber16, nor has its performance. The way we do bonded interactions results in a mild decrease in GB performance for very small systems, but the big story is the inverse squared relationship in the system size to GB performance as a whole, compared to the steady production rates one gets out of PME simulations. A table of numbers in ns/day is given below. Note that all benchmarks have been updated to run on a 4fs time step.

Generalized Born Timings in Amber20
GPU Type Trp Cage (304 atoms) Myoglobin (2,492 atoms) Nucleosome (25,095 atoms)
V100 2801 1725 48.5
Titan-V 2782 1681 43.9
RTX-5000 2482 1060 23.3
GP100 2986 1223 25.0
RTX-2080Ti 2521 1252 30.2
RTX-2080 2400 1017 21.1
Titan-X 2690 870 15.7
GTX-1080Ti 2933 875 18.4
GTX-980Ti 1815 596 11.1
8X CPU 422 9.3 0.1
1X CPU 70.8 1.5 0.02

For production MD, if one merely wants sampling, these timings would suggest that the crossover point lies somewhere in the realm of a 4,000 atom protein. In explicit solvent, such a system would contain roughly 40,000 atoms and would be expected to run about 700 ns/day in an NVT ensemble. The corresponding GB implicit solvent simulation would also produce about as much. GB simulations carry an added sampling advantage in that they remove solvent friction to allow more extensive side chain sampling, so larger biomolecules, perhaps even the 25,095 atom nucleosome, will sample their conformational spaces faster in GB simulations than in explicit solvent. For more details see: Anandakrishnan, R., Drozdetski, A., Walker, R.C., Onufriev, A.V., "Speed of Conformational Change: Comparing Explicit and Implicit Solvent Molecular Dynamics Simulations", Biophysical Journal, 2015, 108, 1153-1164, DOI: 10.1016/j.bpj.2014.12.047. If the accuracy of the solvent conditions is a concern, explicit solvent conditions are preferable.

Original benchmarks (from Ross Walker)

This benchmark suite was originally developed for AMBER10, (support for 4fs HMR was introduced with AMBER14) and while some of the settings for production MD simulations have evolved over time this benchmark set is still useful for comparing with historical versions of the AMBER software and previous hardware generations.

Results for this benchmark set using previous versions of AMBER and GPU hardware are available here:

List of Benchmarks

Explicit Solvent

  1. DHFR NVE HMR 4fs = 23,558 atoms
  2. DHFR NPT HMR 4fs = 23,558 atoms
  3. DHFR NVE 2fs = 23,558 atoms
  4. DHFR NPT 2fs = 23,558 atoms
  5. FactorIX NVE 2fs = 90,906 atoms
  6. FactorIX NPT 2fs = 90,906 atoms
  7. Cellulose NVE 2fs= 408,609 atoms
  8. Cellulose NPT 2fs = 408,609 atoms
  9. STMV NPT HMR 4fs = 1,067,095 atoms

Implicit Solvent

  1. TRPCage 2fs = 304 atoms
  2. Myoglobin = 2,492 atoms
  3. Nucleosome = 25,095 atoms

You can download a tar file of these benchmarks here (84.1 MB)

Click Images for Larger Versions

Price / Performance

Before looking at the raw throughput performance of each of the various benchmarks on different GPU models it is useful to consider the price/performance since NVIDIA GPUs prices span a very large range from the cost effective GeForce cards to the latest eye-wateringly expensive Tesla V100 cards. The following plot shows the price / performance ratio relative to the GTX1080 GPU for current GeForce and Tesla GPUs at prices as of Oct 2018. Smaller is better.

Explicit Solvent PME Benchmarks

1) DHFR NVE HMR 4fs = 23,558 atoms

Typical Production MD NVE with
  reasonable energy conservation, HMR, 4fs.
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2, tol=0.000001,
    nstlim=75000,
    ntpr=1000, ntwx=1000,
    ntwr=10000,
    dt=0.004, cut=8.,
    ntt=0, ntb=1, ntp=0,
    ioutfm=1,
  &end
  &ewald
    dsum_tol=0.000001,
  &end
  

Single job throughput
(a single run on one or more GPUs within a single node)


 

Aggregate throughput (RTX-2080TI)
  (individual runs at the same time on the same node)


 

2) DHFR NPT HMR 4fs = 23,558 atoms

Typical Production MD NPT, MC Bar 4fs HMR
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2,
    nstlim=75000,
    ntpr=1000, ntwx=1000,
    ntwr=10000,
    dt=0.004, cut=8.,
    ntt=1, tautp=10.0,
    temp0=300.0,
    ntb=2, ntp=1, barostat=2,
    ioutfm=1,
  &end
  

Single job throughput
(a single run on one or more GPUs within a single node)


 

3) DHFR NVE 2fs = 23,558 atoms

 Typical Production MD NVE with
  reasonable energy conservation 2fs.
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2, tol=0.000001,
    nstlim=75000,
    ntpr=1000, ntwx=1000,
    ntwr=10000,
    dt=0.002, cut=8.,
    ntt=0, ntb=1, ntp=0,
    ioutfm=1,
  &end
  &ewald
    dsum_tol=0.000001,
  &end
  

Single job throughput
(a single run on one or more GPUs within a single node)


 

4) DHFR NPT 2fs = 23,558 atoms

 Typical Production MD NPT, MC Bar 2fs
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2,
    nstlim=75000,
    ntpr=1000, ntwx=1000,
    ntwr=10000,
    dt=0.00xi24, cut=8.,
    ntt=1, tautp=10.0,
    temp0=300.0,
    ntb=2, ntp=1, barostat=2,
    ioutfm=1,
  &end
  

Single job throughput
(a single run on one or more GPUs within a single node)


 

5) Factor IX NVE = 90,906 atoms

 Typical Production MD NVE with
  reasonable energy conservation.
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2, tol=0.000001,
    nstlim=15000,
    ntpr=1000, ntwx=1000,
    ntwr=10000,
    dt=0.002, cut=8.,
    ntt=0, ntb=1, ntp=0,
    ioutfm=1,
  &end
  &ewald
    dsum_tol=0.000001,
    nfft1=128.nfft2=64,nfft3=64,
  &end
  

Single job throughput
(a single run on one or more GPUs within a single node)


 

6) Factor IX NPT 2fs = 90,906 atoms

 Typical Production MD NPT, MC Bar 2fs
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2,
    nstlim=15000,
    ntpr=1000, ntwx=1000,
    ntwr=10000,
    dt=0.002, cut=8.,
    ntt=1, tautp=10.0,
    temp0=300.0,
    ntb=2, ntp=1, barostat=2,
    ioutfm=1,
  &end
  

Single job throughput
(a single run on one or more GPUs within a single node)


 

7) Cellulose NVE = 408,609 atoms

 Typical Production MD NVE with
  reasonable energy conservation.
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2, tol=0.000001,
    nstlim=10000,
    ntpr=1000, ntwx=1000,
    ntwr=10000,
    dt=0.002, cut=8.,
    ntt=0, ntb=1, ntp=0,
    ioutfm=1,
  &end
  &ewald
    dsum_tol=0.000001,
  &end
  

Single job throughput
(a single run on one or more GPUs within a single node)


 

8) Cellulose NPT 2fs = 408,609 atoms

 Typical Production MD NPT, MC Bar 2fs
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2,
    nstlim=10000,
    ntpr=1000, ntwx=1000,
    ntwr=10000,
    dt=0.002, cut=8.,
    ntt=1, tautp=10.0,
    temp0=300.0,
    ntb=2, ntp=1, barostat=2,
    ioutfm=1,
  &end
  

Single job throughput
(a single run on one or more GPUs within a single node)


 

9) STMV NPT HMR 4fs = 1,067,095 atoms

 Typical Production MD NPT, MC Bar 4fs HMR
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2,
    nstlim=4000,
    ntpr=1000, ntwx=1000,
    ntwr=4000,
    dt=0.004, cut=8.,
    ntt=1, tautp=10.0,
    temp0=300.0,
    ntb=2, ntp=1, barostat=2,
    ioutfm=1,
  &end
  

Single job throughput
(a single run on one or more GPUs within a single node)


 

Implicit Solvent GB Benchmarks

1) TRPCage = 304 atoms

  &cntrl
    imin=0, irest=1, ntx=5,
    nstlim=500000, dt=0.002,
    ntc=2, ntf=2,
    ntt=1, taup=0.5,
    tempi=325.0, temp0=325.0,
    ntpr=1000, ntwx=1000,ntwr=50000,
    ntb=0, igb=1,
    cut=9999., rgbmax=9999.,
  &end
  

Note: The TRPCage test is too small to make effective use of the very latest GPUs. Performance on these cards is not as pronounced over early generation cards as it is for larger GB systems and PME runs. This system is also too small to run effectively over multiple GPUs.

Single job throughput
(a single run on one or more GPUs within a single node)


 

2) Myoglobin = 2,492 atoms

 
  &cntrl
    imin=0, irest=1, ntx=5,
    nstlim=50000, dt=0.002, ntb=0,
    ntc=2, ntf=2, 
    ntpr=1000, ntwx=1000, ntwr=10000,
    cut=9999.0, rgbmax=15.0,
    igb=1, ntt=3, gamma_ln=1.0, nscm=0,
    temp0=300.0, ig=-1,
  &end
  

Note: This test case is too small to make effective use of multiple GPUs when using the latest hardware.

Single job throughput
(a single run on one or more GPUs within a single node)


 

3) Nucleosome = 25,095 atoms

  &cntrl
    imin=0, irest=1, ntx=5,
    nstlim=1000, dt=0.002,
    ntc=2, ntf=2, ntb=0,
    igb=5, cut=9999.0, rgbmax=15.0,
    ntpr=200, ntwx=200, ntwr=1000,
    saltcon=0.1,
    ntt=1, tautp=1.0,
    nscm=0,
  &end
  

Single job throughput
(a single run on one or more GPUs within a single node)


 

"How's that for maxed out?"

Last modified: Mar 24, 2021