Amber masthead
Filler image AmberTools18 Amber18 Manuals Tutorials Force Fields Contacts History
Filler image

Useful links:

Amber Home
Download Amber
Installation
News
Amber Citations
GPU Support
Features
Get Started
Benchmarks
Hardware
Logistics
Patches
Intel Support
Updates
Mailing Lists
For Educators
File Formats

Amber18: pmemd.cuda performance information

A lot has been done to improve the Amber18 code even as it has accumulated new features. Since 2017, our efforts to speed up periodic, explicit solvent simulations have focused on the non-bonded interactions, particle to mesh mapping prior to the reciprocal space calculation, and clustering bonded terms to streamline memory access. We have gained speed from both improving the way we do the math as well as trimming the precision model where it does not contribute to accuracy in the final numbers.

The code remains deterministic--a given card will produce the same results from the same input time and again, with the exception of the new Volta architecture which may introduce non-determinism by breaking warp synchronicity in a way that we are still investigating. As shown on the GPU Numerics page, the precision model is able to conserve energy over long timescales with tight parameters. Throughout the main code and the Thermodynamic Integration extensions, we have chosen equations and workflows with good numerical conditioning suitable for the 32-bit floating point math that dominates our calculations. Most production simulations will not conserve energy because the input parameters aren't good enough, not because the precision model fails. But, as we say, ntt >= 2!

We present here two complementary set of benchmarks, which we identify by the names of their primary authors. The Ross Walker set has been run on nearly all versions of pmemd.cuda, and on many different GPUs. Users may be used to these, and results are available for previous versions. A more recent set of benchmarks has been created by Dave Cerutti, which use settings more like those most commonly used in production today.

The "Ross Walker" benchmarks

These are available here, for various versions of Amber:

The "Dave Cerutti" benchmarks

We take as benchmarks four periodic systems spanning a range of system sizes and compositions. The smallest Dihydrofolate Reductase (DHFR) case is a 159-residue protein in water, weighing in at 23,588 atoms. Next, from the human blood clotting system, Factor IX is a 379-residue protein also in a box of water, total 90,906 atoms. The larger cellulose system, 408,609 atoms, has a greater content of macromolecules in it: the repeating sugar polymer constitutes roughly a sixth of the atoms in the system. Finally, the very large simulation of satellite tobacco mosaic virus (STMV), a gargantuan 1,067,095 atom system, also has an appreciable macromolecule content, but is otherwise another collection of proteins in water.

Download the Amber 18 Benchmark Suite.

The figure below indicates the relative performance of Amber16 and Amber18 with default settings as well as "boosted" settings. We take the unbiased metric of trillions of moves made against individual atoms per day. This puts most systems on equal footing, but for a few factors. The identity of each system and its atom count are given on the right hand side of the plot. All simulations were run with a 9Å cutoff and a 4fs time step. We maintained the isothermal, isobaric ensemble with a Berendsen thermostat and Monte-Carlo barostat (trial moves to update the system volume every 100 steps). The production rates of each GPU running Amber16 are given by the solid color bars. Additional production from Amber18 is shown as a pink extension of each bar. Further performance enhancements gained through new input settings described below are shown as black extensions.

In the benchmarks above, Amber18 is up against some headwinds: first, Amber18 includes a new default setting to reduce and eliminate net force on the system after each time step. This is in keeping with the operations of the CPU code, so that the system does not wander until its net momentum is stopped every nscm steps and thermostats do their job right. But, it costs about 3% of the total run time. Furthermore, the relative cost of the Monte-Carlo barostat used in the NPT runs gets higher for the newer, faster GPUs. This may reflect the fact that some of the calculation happens on the CPU core, which isn't getting any faster even as the GPUs have tripled in capacity even over the limited history we are showing. However, there does appear to be an additional MC barostat performance cost of about 2% that we are working to understand.

Amber18 now allows a limited form of "leaky pair lists" that other codes have employed for some time. There are cutoffs on both electrostatic and van-der Waals interactions that neglect about 1 part in 100,000 or 1 part in 1,000 of these energies, respectively. By default, we are careful to refresh the list of all non-bonded interactions whenever it is conceivable that we might start to miss something within the cutoff. However, in practice, we can allow particles to travel further before refreshing the pair list, and Amber18 enables this with the skin_permit feature. A value of 0.5 will ensure that every pair interaction is calculated. A value of 1.0 (the maximum allowed) will cause about one interaction out of a million to be missed, and a mid-range value of 0.75 will miss about one interaction in 50 million. Note, there are real effects on energy conservation if one is looking hard. The input below will omit Amber18's net force correction (like Amber16) and then reduce the pair list refresh rate by an amount that we think is safe, granting 8-12% additional speedup over Amber16 in most systems.

&ewald
  netfrc = 0,
  skin_permit = 0.75,
&end

The chart above shows an important general result about system size and throughput. Up until the advent of GP100, Titan V, and Volta, the run speed of a system of 25,000 atoms or more was pretty constant as a function of size (all other settings being equal). However, for the newer and bigger cards, the fault lines in the algorithm are becoming apparent: the faster Pascal and Volta chips hit their stride for systems of perhaps 75,000 to half a million atoms, lose some performance outside that range, and are under-utilized by small systems. (We've even found that it can improve overall throughput to run multiple GPU simulations of small systems simultaneously on these chips--each simulation will run faster than half its normal speed.)

We've put the relative performance benchmarks up front to assure you that we are working hard to make Amber faster and keep up with hardware trends. The improvements can be even greater with a shorter 8Å cutoff used in previous benchmarks, but we wanted to present the most relevant examples. It's hard to compare codes based on benchmarks, another reason we chose that per-atom metric above. For those that just want to see ns/day, the numbers are below.

Dihydrofolate Reductase (JAC, 23,558 atoms)
GPU Type NVE Ensemble NPT Ensemble
Amber16 Amber18, Default Amber18, Boost Amber16 Amber18, Default Amber18, Boost
V100 885 915 1051 826 823 933
Titan-V 827 862 985 760 765 859
GP100 647 707 782 611 662 726
Titan-X 569 631 692 528 587 639
GTX-1080Ti 564 636 701 539 601 655
GTX-980Ti 308 361 392 298 341 372
8X CPU 18.7 17.0 GPU Only 19.2 18.6 GPU Only
1X CPU 2.9 2.8 GPU Only 2.9 2.8 GPU Only

Factor IX (90,906 atoms)
GPU Type NVE Ensemble NPT Ensemble
Amber16 Amber18, Default Amber18, Boost Amber16 Amber18, Default Amber18, Boost
V100 348 374 420 325 334 370
Titan-V 309 334 372 286 291 322
GP100 197 231 252 189 219 238
Titan-X 160 184 187 153 172 196
GTX-1080Ti 160 183 187 154 174 198
GTX-980Ti 89 107 116 86 103 112
8X CPU 4.4 4.3 GPU Only 4.4 4.3 GPU Only
1X CPU 0.65 0.64 GPU Only 0.63 0.64 GPU Only

Cellulose (408,609 atoms)
GPU Type NVE Ensemble NPT Ensemble
Amber16 Amber18, Default Amber18, Boost Amber16 Amber18, Default Amber18, Boost
V100 81.6 93.0 101.0 77.2 83.9 89.9
Titan-V 71.9 80.8 87.4 66.8 71.3 76.6
GP100 43.3 52.8 56.7 41.7 50.4 53.7
Titan-X 34.2 39.0 41.1 33.6 37.8 40.6
GTX-1080Ti 34.3 39.4 41.8 33.4 38.2 40.8
GTX-980Ti 19.1 23.2 24.7 18.5 22.4 24.0
8X CPU 0.74 0.87 GPU Only 0.80 0.78 GPU Only
1X CPU 0.13 0.13 GPU Only 0.13 0.13 GPU Only

(Satellite Tobacco Mosaic Virus 1,067,095 atoms)
GPU Type NVE Ensemble NPT Ensemble
Amber16 Amber18, Default Amber18, Boost Amber16 Amber18, Default Amber18, Boost
V100 29.0 32.0 35.2 27.1 28.5 31.0
Titan-V 25.3 27.7 30.2 23.0 23.5 25.3
GP100 15.9 18.8 20.6 15.2 17.9 19.4
Titan-X 12.6 14.1 14.9 12.1 13.2 14.2
GTX-1080Ti 12.5 14.0 15.0 12.1 13.4 14.3
GTX-980Ti 7.1 8.5 9.2 6.8 8.1 8.8
8X CPU 0.29 0.30 GPU Only 0.31 0.26 GPU Only
1X CPU 0.05 0.05 GPU Only 0.05 0.05 GPU Only

While we sought the best possible results on GPUs we had available during most of the development, the new Volta cards are tricky beasts. In order to get the best performance out of V100 and Titan V, we used a special compilation with -volta when configuring the Amber installation. This special compilation reduces the dependence on texture memory in favor of direct L1 caching, which isn't available on all cards we support. With a last-minute tweak, we were able to conserve some of the performance enhancements we had made on Pascal architectures under CUDA 8 and mitigate the losses experienced on Volta architectures if the special -volta flag is not configured. However, we still recommend putting together a special executable for V100 and Titan-V. The architecture is evolving rapidly, and the upcoming RTX-2080 series (the Turing-based gaming solutions, incorporating aspects of the Volta architecture) will be one to watch. We have a list of things to figure out, but we are building in stress-testing and debugging features as part of our regular code development. We hope to deliver consistent, amazing performance on Volta and newer architectures, without any special compilation flags, in a future patch.

Another result that is obvious from the tables is that the GPU code is vastly superior to CPU execution (here, we used Intel E5-2640 v4 2.4GHz CPUs). There are efforts underway in a collaboration with Intel corporation to improve the code's scalability on massively parallel CPU architectures. New advances, making pmemd's CPU implementation more like NAMD, have been able to exceed the throughput of the GPU code on clusters of 1200 processor cores or more. However, even under the most favorable circumstances (Intel 18, 3.6GHz multicore processor, one core in use to monopolize the cache and memory bandwidth), the performance of a V100 card exceeds that of a single CPU core by 200 to 300 fold in PME simulations and up to 1000 fold in GB simulations. The results above pertain to a "real-world" setup (multiple serial runs were performed concurrently on separate cores of the same CPU, and the GNU compiler we used results in about 30% lower performance on the Intel CPUs). For parallel runs on eight or more CPU cores, the compiler has negligible effects on performance.

The code that executes the non-periodic, implicit solvent GB in Amber18 has not changed much since Amber16, nor has its performance. The way we do bonded interactions results in a mild decrease in GB performance for very small systems, but the big story is the inverse squared relationship in the system size to GB performance as a whole, compared to the steady production rates one gets out of PME simulations. A table of numbers in ns/day is given below. Note that all benchmarks have been updated to run on a 4fs time step.

Generalized Born Timings in Amber18
GPU Type Trp Cage (304 atoms) Myoglobin (2,492 atoms) Nucleosome (25,095 atoms)
V100 2703 1798 52.6
Titan-V 2543 1736 52.7
GP100 2896 1223 25.6
Titan-X 2690 870 15.7
GTX-1080Ti 2867 892 16.4
GTX-980Ti 1815 596 11.1
8X CPU 422 9.3 0.1
1X CPU 70.8 1.5 0.02

For production MD, if one merely wants sampling, these timings would suggest that the crossover point lies somewhere in the realm of a 4,000 atom protein. In explicit solvent, such a system would contain roughly 40,000 atoms and would be expected to run about 700 ns/day in an NVT emsemble. The corresponding GB implicit solvent simulation would also produce about as much. GB simulations carry an added sampling advantage in that they remove solvent friction to allow more extensive side chain sampling, so even larger biomolecules, perhaps even the 25,095 atom nucleosome, will sample their conformational spaces faster in GB simulations than in explicit solvent. If the accuracy of the solvent conditions is a concern, explicit solvent conditions are preferable.

"Insert clever motto here."

Last modified: Aug 30, 2018