AmberTools25

Amber24

Manuals

Tutorials

Force Fields

Contacts

History

Useful links:

Amber Home

Download Amber

Installation

Amber Citations

GPU Support

Features

Get Started

Benchmarks

Hardware

Logistics

Updates

Mailing Lists

For Educators

File Formats

Contributors

Workshops

Amber23: pmemd.cuda performance information

A lot has been done to improve the Amber23 code, and its performance relative to Amber20 is steady despite accumulating new features. Your results may vary, particularly depending on the configuration or operating status of the machines where you have your GPUs. If you are running a solitary GPU with good cooling, you can expect to get numbers up to or perhaps a little better than the ones posted here. If you are running the code on a dense server operating at capacity, you can expect the heating sensors to throttle back the power and the performance will suffer a few percent. The code remains deterministic—a given GPU model will produce the same results from the same input time and again. This feature is critical for debugging and has allowed us to validate consumer GeForce cards as suitable for use with AMBER. The precision model is able to conserve energy over long timescales with tight parameters. Throughout the main code and the Thermodynamic Integration extensions we have chosen equations and workflows with good numerical conditioning suitable for the 32-bit floating point math that dominates our calculations. Most production simulations will not conserve energy because the input parameters aren't good enough, not because the precision model fails. But, as we say, ntt >= 2!

A bit of history

Amber developers have been working on GPUs for more than a decade. Here is an overview of how performance has changed, due to both software and hardware improvements, over than time span. We use a smallish benchmark here, with 23,000 atoms.

We present here two complementary set of benchmarks, which we identify by the names of their primary authors. The older set, created by Ross Walker, has been run on nearly all versions of pmemd.cuda, and on many different GPUs dating to 2010. Users may be accustomed to these, and results are available for previous versions. A more recent set of benchmarks has been created by Dave Cerutti, which use settings more like those more common in production today.

Recent Ada and Hopper benchmarks from Ross Walker

These numbers are from Ross Walker, December, 2023. The figure below gives a quick overview for the fairly small "JAC" benchmark. For complete results, click here.

Recent Ampere benchmarks from Ross Walker

These numbers are from Ross Walker, February, 2023. See below for details about what the individual benchmarks do.


Following up on the 4090 and H100 numbers I now have benchmark numbers for
the 4070TI which are equally impressive and make it a good replacement for
the 3080 or even 3090.

JAC_PRODUCTION_NVE - 23,558 atoms PME 4fs
-----------------------------------------
1 x H100 GPU:      |         ns/day =    1479.32
1 x 4090 GPU:      |         ns/day =    1638.75
1 x 4070TI GPU:  |         ns/day =     1322.64
1 x 3090 GPU:      |         ns/day =    1196.50
1 x A100 GPU:      |         ns/day =    1199.22

JAC_PRODUCTION_NPT - 23,558 atoms PME 4fs
-----------------------------------------
1 x H100 GPU:      |         ns/day =    1424.90
1 x 4090 GPU:      |         ns/day =    1618.45
1 x 4070TI GPU:  |         ns/day =     1262.40
1 x 3090 GPU:      |         ns/day =    1157.76
1 x A100 GPU:      |         ns/day =    1194.50

JAC_PRODUCTION_NVE - 23,558 atoms PME 2fs
-----------------------------------------
1 x H100 GPU:      |         ns/day =     779.95
1 x 4090 GPU:      |         ns/day =     883.23
1 x 4070TI GPU:  |         ns/day =     701.09
1 x 3090 GPU:      |         ns/day =     632.19
1 x A100 GPU:      |         ns/day =     611.08

JAC_PRODUCTION_NPT - 23,558 atoms PME 2fs
-----------------------------------------
1 x H100 GPU:      |         ns/day =     741.10
1 x 4090 GPU:      |         ns/day =     842.69
1 x 4070TI GPU:  |         ns/day =     666.18
1 x 3090 GPU:      |         ns/day =     595.28
1 x A100 GPU:      |         ns/day =     610.09


FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME
-------------------------------------------
1 x H100 GPU:      |         ns/day =     389.18
1 x 4090 GPU:      |         ns/day =     466.44
1 x 4070TI GPU:  |         ns/day =     301.03
1 x 3090 GPU:      |         ns/day =     264.78
1 x A100 GPU:      |         ns/day =     271.36


FACTOR_IX_PRODUCTION_NPT - 90,906 atoms PME
-------------------------------------------
1 x H100 GPU:      |         ns/day =     357.88
1 x 4090 GPU:      |         ns/day =     433.24
1 x 4070TI GPU:  |         ns/day =     279.19
1 x 3090 GPU:      |         ns/day =     248.65
1 x A100 GPU:      |         ns/day =     252.87


CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME
--------------------------------------------
1 x H100 GPU:      |         ns/day =     119.27
1 x 4090 GPU:      |         ns/day =     129.63
1 x 4070TI GPU:  |         ns/day =     69.30
1 x 3090 GPU:      |         ns/day =      63.23
1 x A100 GPU:      |         ns/day =      85.23


CELLULOSE_PRODUCTION_NPT - 408,609 atoms PME
--------------------------------------------
1 x H100 GPU:      |         ns/day =     108.91
1 x 4090 GPU:      |         ns/day =     119.04
1 x 4070TI GPU:  |         ns/day =     64.65
1 x 3090 GPU:      |         ns/day =      58.30
1 x A100 GPU:      |         ns/day =      77.98

STMV_PRODUCTION_NPT - 1,067,095 atoms PME
-----------------------------------------
1 x H100 GPU:      |         ns/day =      70.15
1 x 4090 GPU:      |         ns/day =      78.90
1 x 4070TI GPU:  |         ns/day =     37.31
1 x 3090 GPU:      |         ns/day =      38.65
1 x A100 GPU:      |         ns/day =      52.02

TRPCAGE_PRODUCTION - 304 atoms GB
---------------------------------
1 x H100 GPU:      |         ns/day =    1413.28
1 x 4090 GPU:      |         ns/day =    1482.22
1 x 4070TI GPU:  |         ns/day =     1519.47
1 x 3090 GPU:      |         ns/day =    1225.53
1 x A100 GPU:      |         ns/day =    1040.61


MYOGLOBIN_PRODUCTION - 2,492 atoms GB
-------------------------------------
1 x H100 GPU:      |         ns/day =    1094.48
1 x 4090 GPU:      |         ns/day =     929.62
1 x 4070TI GPU:  |         ns/day =     757.91
1 x 3090 GPU:      |         ns/day =     621.73
1 x A100 GPU:      |         ns/day =     661.22



NUCLEOSOME_PRODUCTION - 25,095 atoms GB
---------------------------------------
1 x H100 GPU:      |         ns/day =      37.68
1 x 4090 GPU:      |         ns/day =      36.90
1 x 4070TI GPU:  |         ns/day =      21.34
1 x 3090 GPU:      |         ns/day =      21.08
1 x A100 GPU:      |         ns/day =      29.66

Recent Turing and Ampere benchmarks from Exxact

Some June, 2021 benchmarks for Ampere and Turing cards can be found here:

Amber20

Original benchmarks (from Ross Walker)

These are available here, for various versions of Amber:

Amber20 benchmarks (from Dave Cerutti)

We take as benchmarks four periodic systems spanning a range of system sizes and compositions. The smallest Dihydrofolate Reductase (DHFR) case is a 159-residue protein in water, weighing in at 23,588 atoms. Next, from the human blood clotting system, Factor IX is a 379-residue protein also in a box of water, total 90,906 atoms. The larger cellulose system, 408,609 atoms, has a greater content of macromolecules in it: the repeating sugar polymer constitutes roughly a sixth of the atoms in the system. Finally, the very large simulation of satellite tobacco mosaic virus (STMV), a gargantuan 1,067,095 atom system, also has an appreciable macromolecule content, but is otherwise another collection of proteins in water.

Download the Amber 20 Benchmark Suite.

The figure below indicates the relative performance of Amber16 and Amber18, which is still representative of Amber20, with default settings as well as "boosted" settings. We take the unbiased metric of trillions of moves made against individual atoms per day. This puts most systems on equal footing, but for a few factors. The identity of each system and its atom count are given on the right hand side of the plot. All simulations were run with a 9Å cutoff and a 4fs time step. We maintained the isothermal, isobaric ensemble with a Berendsen thermostat and Monte-Carlo barostat (trial moves to update the system volume every 100 steps). The production rates of each GPU running Amber16 are given by the solid color bars. Additional production from Amber18 is shown as a pink extension of each bar. Further performance enhancements gained through the "boost" input settings described below are shown as black extensions.

Since Amber18, the code has included a default setting to eliminate net force on the system after each time step. This is in keeping with the operations of the CPU code, so that the system does not wander until its net momentum is stopped every nscm steps and thermostats do their job right. In the current release, it costs about 3% of the total run time, but the deleterious effects it is preventing may already be mitigated by conservative settings in the force calculation, particularly those affecting the accuracy of the PME grid. One of our developers is working on what we believe is a 'best of both worlds' solution which will correct net forces during the PME grid calculation specifically, and will be obligatory for negligible additional cost. This feature will be available soon.

Since release 18, Amber has allowed a limited form of "leaky pair lists" that other codes have employed for some time. There are (always) cutoffs on both electrostatic and van-der Waals interactions that neglect about 1 part in 100,000 or 1 part in 1,000 of these energies, respectively. By default, we are careful to refresh the list of all non-bonded interactions whenever it is conceivable that we might start to miss something within the cutoff. However, in practice, we can allow particles to travel further before refreshing the pair list, and Amber18 enables this with the skin_permit feature. A value of 0.5 will ensure that every pair interaction is calculated. A value of 1.0 (the maximum allowed) will cause about one interaction out of a million to be missed, and a mid-range value of 0.75 will miss about one interaction in 50 million. Note, there are real effects on energy conservation if one is looking hard. The input below will omit the net force correction and then reduce the pair list refresh rate by an amount that we think is safe, granting 8-12% additional speedup in most systems.

&ewald
  netfrc = 0,
  skin_permit = 0.75,
&end

The chart above shows an important general result about system size and throughput. Up until the advent of GP100, Titan V, and Volta, the run speed of a system of 25,000 atoms or more was pretty constant as a function of size (all other settings being equal). However, for the newer and bigger cards, the fault lines in the algorithm are becoming apparent: the faster Pascal and Volta chips hit their stride for systems of perhaps 75,000 to half a million atoms, lose some performance outside that range, and are under-utilized by small systems. (We've even found that it can improve overall throughput to run multiple GPU simulations of small systems simultaneously on these chips--each simulation will run faster than half its normal speed. The Multi-Process Service (MPS) feature of CUDA makes this work the best, although it's only effective on the newest architectures (Volta, Turing). With MPS enabled and multiple replicas engaged on the same GPU, the smaller DHFR benchmark surges to the head of the pack in terms of atoms moved per time in the chart above. The benefits of MPS parallelism vanish for systems larger than perhaps 100,000 atoms. )

We've put the relative performance benchmarks up front to assure you that we are working hard to make Amber faster and keep up with hardware trends. The improvements can be even greater with a shorter 8Å cutoff used in previous benchmarks, but we wanted to present the most relevant examples. Because of the particular settings, it's not as easy as it seems to compare codes based on benchmarks, another reason we chose to compare performance across system sizes with the per-atom metric above. For those that just want to see ns/day, the numbers are below.

Dihydrofolate Reductase (JAC, 23,558 atoms)
GPU Type	NVE Ensemble		NPT Ensemble
	Amber20, Default	Amber20, Boost	Amber20, Default	Amber20, Boost
V100	934	1059	895	1003
Titan-V	920	1048	893	1011
RTX-5000	683	754	666	734
GP100	745	836	720	801
Titan-X	631	692	587	639
RTX-2080TI	915	1033	847	948
RTX-2080	751	830	691	773
GTX-1080Ti	680	752	661	721
GTX-980Ti	361	392	341	372
8X CPU	17.0	GPU Only	18.6	GPU Only
1X CPU	2.8	GPU Only	2.8	GPU Only

Factor IX (90,906 atoms)
GPU Type	NVE Ensemble		NPT Ensemble
	Amber20, Default	Amber20, Boost	Amber20, Default	Amber20, Boost
V100	365	406	345	384
Titan-V	345	387	328	365
RTX-5000	218	235	212	227
GP100	236	260	229	251
Titan-X	184	187	172	196
RTX-2080Ti	320	354	300	329
RTX-2080	233	257	221	240
GTX-1080Ti	196	213	189	204
GTX-980Ti	107	116	103	112
8X CPU	4.3	GPU Only	4.3	GPU Only
1X CPU	0.64	GPU Only	0.64	GPU Only

Cellulose (408,609 atoms)
GPU Type	NVE Ensemble		NPT Ensemble
	Amber20, Default	Amber20, Boost	Amber20, Default	Amber20, Boost
V100	88.9	96.2	84.3	90.8
Titan-V	82.2	87.9	78.1	83.2
RTX-5000	46.6	48.7	45.1	47.3
GP100	53.8	57.5	52.0	55.3
Titan-X	39.0	41.1	37.8	40.6
RTX-2080Ti	68.0	72.6	64.8	68.6
RTX-2080	48.2	51.4	46.1	48.8
GTX-1080Ti	42.9	45.3	40.8	43.2
GTX-980Ti	23.2	24.7	22.4	24.0
8X CPU	0.87	GPU Only	0.78	GPU Only
1X CPU	0.13	GPU Only	0.13	GPU Only

(Satellite Tobacco Mosaic Virus 1,067,095 atoms)
GPU Type	NVE Ensemble		NPT Ensemble
	Amber20, Default	Amber20, Boost	Amber20, Default	Amber20, Boost
V100	30.4	33.5	28.1	31.3
Titan-V	28.0	30.5	26.4	28.7
RTX-5000	16.3	17.4	15.6	16.6
GP100	19.1	20.9	18.3	19.9
RTX-2080Ti	23.4	25.5	22.0	23.7
RTX-2080	16.8	18.1	15.7	17.0
Titan-X	14.1	14.9	13.2	14.2
GTX-1080Ti	14.4	15.4	13.9	14.6
GTX-980Ti	8.5	9.2	8.1	8.8
8X CPU	0.30	GPU Only	0.26	GPU Only
1X CPU	0.05	GPU Only	0.05	GPU Only

We have always sought the best possible results on a range of GPU hardware. In early versions of Amber18, there was a special -volta flag for optimizing performance on the advanced V100 and Titan-V cards. This is no longer needed in Amber 20--standard compilation will always deliver the best results.

The code that executes the non-periodic, implicit solvent GB in Amber20 has not changed much since Amber16, nor has its performance. The way we do bonded interactions results in a mild decrease in GB performance for very small systems, but the big story is the inverse squared relationship in the system size to GB performance as a whole, compared to the steady production rates one gets out of PME simulations. A table of numbers in ns/day is given below. Note that all benchmarks have been updated to run on a 4fs time step.

Generalized Born Timings in Amber20
GPU Type	Trp Cage (304 atoms)	Myoglobin (2,492 atoms)	Nucleosome (25,095 atoms)
V100	2801	1725	48.5
Titan-V	2782	1681	43.9
RTX-5000	2482	1060	23.3
GP100	2986	1223	25.0
RTX-2080Ti	2521	1252	30.2
RTX-2080	2400	1017	21.1
Titan-X	2690	870	15.7
GTX-1080Ti	2933	875	18.4
GTX-980Ti	1815	596	11.1
8X CPU	422	9.3	0.1
1X CPU	70.8	1.5	0.02

For production MD, if one merely wants sampling, these timings would suggest that the crossover point lies somewhere in the realm of a 4,000 atom protein. In explicit solvent, such a system would contain roughly 40,000 atoms and would be expected to run about 700 ns/day in an NVT ensemble. The corresponding GB implicit solvent simulation would also produce about as much. GB simulations carry an added sampling advantage in that they remove solvent friction to allow more extensive side chain sampling, so larger biomolecules, perhaps even the 25,095 atom nucleosome, will sample their conformational spaces faster in GB simulations than in explicit solvent. For more details see: Anandakrishnan, R., Drozdetski, A., Walker, R.C., Onufriev, A.V., "Speed of Conformational Change: Comparing Explicit and Implicit Solvent Molecular Dynamics Simulations", Biophysical Journal, 2015, 108, 1153-1164, DOI: 10.1016/j.bpj.2014.12.047. If the accuracy of the solvent conditions is a concern, explicit solvent conditions are preferable.

Alternate (Amber18) benchmarks (from Ross Walker)

This benchmark suite was originally developed for AMBER10, (support for 4fs HMR was introduced with AMBER14) and while some of the settings for production MD simulations have evolved over time this benchmark set is still useful for comparing with historical versions of the AMBER software and previous hardware generations.

Results for this benchmark set using previous versions of AMBER and GPU hardware are available here:

List of Benchmarks

Explicit Solvent

DHFR NVE HMR 4fs = 23,558 atoms
DHFR NPT HMR 4fs = 23,558 atoms
DHFR NVE 2fs = 23,558 atoms
DHFR NPT 2fs = 23,558 atoms
FactorIX NVE 2fs = 90,906 atoms
FactorIX NPT 2fs = 90,906 atoms
Cellulose NVE 2fs= 408,609 atoms
Cellulose NPT 2fs = 408,609 atoms
STMV NPT HMR 4fs = 1,067,095 atoms

Implicit Solvent

TRPCage 2fs = 304 atoms
Myoglobin = 2,492 atoms
Nucleosome = 25,095 atoms

You can download a tar file of these benchmarks here (84.1 MB)

Click Images for Larger Versions

Price / Performance

Before looking at the raw throughput performance of each of the various benchmarks on different GPU models it is useful to consider the price/performance since NVIDIA GPUs prices span a very large range from the cost effective GeForce cards to the latest eye-wateringly expensive Tesla V100 cards. The following plot shows the price / performance ratio relative to the GTX1080 GPU for current GeForce and Tesla GPUs at prices as of Oct 2018. Smaller is better.

Explicit Solvent PME Benchmarks

1) DHFR NVE HMR 4fs = 23,558 atoms

Typical Production MD NVE with
  reasonable energy conservation, HMR, 4fs.
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2, tol=0.000001,
    nstlim=75000,
    ntpr=1000, ntwx=1000,
    ntwr=10000,
    dt=0.004, cut=8.,
    ntt=0, ntb=1, ntp=0,
    ioutfm=1,
  &end
  &ewald
    dsum_tol=0.000001,
  &end

Single job throughput
(a single run on one or more GPUs within a single node)

Aggregate throughput (RTX-2080TI)
(individual runs at the same time on the same node)

2) DHFR NPT HMR 4fs = 23,558 atoms

Typical Production MD NPT, MC Bar 4fs HMR
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2,
    nstlim=75000,
    ntpr=1000, ntwx=1000,
    ntwr=10000,
    dt=0.004, cut=8.,
    ntt=1, tautp=10.0,
    temp0=300.0,
    ntb=2, ntp=1, barostat=2,
    ioutfm=1,
  &end

Single job throughput
(a single run on one or more GPUs within a single node)

3) DHFR NVE 2fs = 23,558 atoms

 Typical Production MD NVE with
  reasonable energy conservation 2fs.
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2, tol=0.000001,
    nstlim=75000,
    ntpr=1000, ntwx=1000,
    ntwr=10000,
    dt=0.002, cut=8.,
    ntt=0, ntb=1, ntp=0,
    ioutfm=1,
  &end
  &ewald
    dsum_tol=0.000001,
  &end

Single job throughput
(a single run on one or more GPUs within a single node)

4) DHFR NPT 2fs = 23,558 atoms

 Typical Production MD NPT, MC Bar 2fs
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2,
    nstlim=75000,
    ntpr=1000, ntwx=1000,
    ntwr=10000,
    dt=0.00xi24, cut=8.,
    ntt=1, tautp=10.0,
    temp0=300.0,
    ntb=2, ntp=1, barostat=2,
    ioutfm=1,
  &end

Single job throughput
(a single run on one or more GPUs within a single node)

5) Factor IX NVE = 90,906 atoms

 Typical Production MD NVE with
  reasonable energy conservation.
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2, tol=0.000001,
    nstlim=15000,
    ntpr=1000, ntwx=1000,
    ntwr=10000,
    dt=0.002, cut=8.,
    ntt=0, ntb=1, ntp=0,
    ioutfm=1,
  &end
  &ewald
    dsum_tol=0.000001,
    nfft1=128.nfft2=64,nfft3=64,
  &end

Single job throughput
(a single run on one or more GPUs within a single node)

6) Factor IX NPT 2fs = 90,906 atoms

 Typical Production MD NPT, MC Bar 2fs
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2,
    nstlim=15000,
    ntpr=1000, ntwx=1000,
    ntwr=10000,
    dt=0.002, cut=8.,
    ntt=1, tautp=10.0,
    temp0=300.0,
    ntb=2, ntp=1, barostat=2,
    ioutfm=1,
  &end

Single job throughput
(a single run on one or more GPUs within a single node)

7) Cellulose NVE = 408,609 atoms

 Typical Production MD NVE with
  reasonable energy conservation.
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2, tol=0.000001,
    nstlim=10000,
    ntpr=1000, ntwx=1000,
    ntwr=10000,
    dt=0.002, cut=8.,
    ntt=0, ntb=1, ntp=0,
    ioutfm=1,
  &end
  &ewald
    dsum_tol=0.000001,
  &end

Single job throughput
(a single run on one or more GPUs within a single node)

8) Cellulose NPT 2fs = 408,609 atoms

 Typical Production MD NPT, MC Bar 2fs
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2,
    nstlim=10000,
    ntpr=1000, ntwx=1000,
    ntwr=10000,
    dt=0.002, cut=8.,
    ntt=1, tautp=10.0,
    temp0=300.0,
    ntb=2, ntp=1, barostat=2,
    ioutfm=1,
  &end

Single job throughput
(a single run on one or more GPUs within a single node)

9) STMV NPT HMR 4fs = 1,067,095 atoms

 Typical Production MD NPT, MC Bar 4fs HMR
  &cntrl
    ntx=5, irest=1,
    ntc=2, ntf=2,
    nstlim=4000,
    ntpr=1000, ntwx=1000,
    ntwr=4000,
    dt=0.004, cut=8.,
    ntt=1, tautp=10.0,
    temp0=300.0,
    ntb=2, ntp=1, barostat=2,
    ioutfm=1,
  &end

Single job throughput
(a single run on one or more GPUs within a single node)

Implicit Solvent GB Benchmarks

1) TRPCage = 304 atoms

  &cntrl
    imin=0, irest=1, ntx=5,
    nstlim=500000, dt=0.002,
    ntc=2, ntf=2,
    ntt=1, taup=0.5,
    tempi=325.0, temp0=325.0,
    ntpr=1000, ntwx=1000,ntwr=50000,
    ntb=0, igb=1,
    cut=9999., rgbmax=9999.,
  &end

Note: The TRPCage test is too small to make effective use of the very latest GPUs. Performance on these cards is not as pronounced over early generation cards as it is for larger GB systems and PME runs. This system is also too small to run effectively over multiple GPUs.

Single job throughput
(a single run on one or more GPUs within a single node)

2) Myoglobin = 2,492 atoms

 
  &cntrl
    imin=0, irest=1, ntx=5,
    nstlim=50000, dt=0.002, ntb=0,
    ntc=2, ntf=2, 
    ntpr=1000, ntwx=1000, ntwr=10000,
    cut=9999.0, rgbmax=15.0,
    igb=1, ntt=3, gamma_ln=1.0, nscm=0,
    temp0=300.0, ig=-1,
  &end

Note: This test case is too small to make effective use of multiple GPUs when using the latest hardware.

Single job throughput
(a single run on one or more GPUs within a single node)

3) Nucleosome = 25,095 atoms

  &cntrl
    imin=0, irest=1, ntx=5,
    nstlim=1000, dt=0.002,
    ntc=2, ntf=2, ntb=0,
    igb=5, cut=9999.0, rgbmax=15.0,
    ntpr=200, ntwx=200, ntwr=1000,
    saltcon=0.1,
    ntt=1, tautp=1.0,
    nscm=0,
  &end

Single job throughput
(a single run on one or more GPUs within a single node)

"How's that for maxed out?"

Last modified: Dec 25, 2023