|
This page has three sections:
Considerations for Maximizing GPU Performance
There are a number of considerations above and beyond those typically used on a CPU for
maximizing the performance achievable for a GPU accelerated PMEMD simulation. The following
provides some tips for ensuring good performance.
- Avoid using small values of NTPR, NTWX, NTWV, NTWE and NTWR. Writing to
the output, restart and trajectory files too often can hurt performance even on CPU runs,
and this is more acute for GPU accelerated simulations because there is a substantial cost
in copying data to and from the GPU. Performance is maximized when CPU to GPU memory
synchronizations are minimized. This is achieved by computing as much as possible on the
GPU and only copying back to CPU memory when absolutely necessary. There is an additional
overhead in that performance is boosted by only calculating the energies when absolutely
necessary, hence setting NTPR or NTWE to low values will result in excessive energy
calculations. You should not set any of these values to less than 100 (except 0 to disable
them) and ideally use values of 500 or more. For a million atom system, just writing the
restart file every 1000 steps can account for 10% of the run time with a modern
GPU (the GPU is faster, but the disk writing is not). Ideally, NTWR > 100000 is ideal, or
just let it default to nstlim.
- Avoid setting ntave != 0. Turning on the printing of running averages
results in the code means needing to calculate both energy and forces on every step. This
can lead to a performance loss of 8% or more when running on the GPU. This can also
affect performance on CPU runs although the difference is not as marked. Similar arguments
apply to setting the value of ene_avg_sampling to small values.
- Avoid using the NPT ensemble (NTB=2) when it is not required; if needed
make use of the Monte Carlo barostat (barostat=2). Performance will generally be
NVE > NVT > NPT. (NVT ~ NPT for barostat = 2, although modern GPUs may again outstrip
their CPU management and take a greater performance hit).
- Set
netfrc = 0 in the &ewald namelist if
feasible. This feature is new in Amber18 and is not something that will help energy
conservation in NVE simulations. If anything, netfrc = 1 will unmask upward
energy drift from other sources. It will help the thermostat to maintain the exact
specified temperature, but if that is not critical then shut it off and save a few percent
of the wall time in your calculation.
- Set
skin_permit = 0.75 , also in the &ewald
namelist. This new feature generalizes something that has always been in Amber. The
default setting, 0.5, produces the original behavior, meaning that the pair list will be
rebuilt immediately once any one particle has travelled up to half the margin of safety
(the "skin") from its initial position for which the pair list is valid. If one particle
has traveled half the distance, then another may have as well, and therefore it is
possible that the pair list is not counting every interaction within the specified cutoff.
Notice the wording there: even under the default setting, and historically, we can't
absolutely guarantee that the pair list has never been violated, because we only rebuild
the pair list after the possibility of a violation has been detected. It's
extremely unlikely: perhaps one interaction in a trillion has been missed. However, with
the new skin_permit input, we can ask that the pair list be rebuilt when a
particle has travelled 75% (0.75, recommended) or even 100% of the margin of
safety. Even if particles are allowed to travel that far, it takes two particles
travelling towards one another to form an interaction that the exiting pair list will not
know about. Our tests indicate that skin_permit = 0.75 will allow about one
in fifty million interactions at the periphery of the cutoff sphere to be missed, and
skin_permit = 1.00 will allow about one in a million to be missed. While we
can detect a marginal amount of heating if skin_permit is turned up to its
maximum value of 1.0, the lower value of 0.75 confers most of the benefit (fewer occasions
to rebuild the pair list, and a speedup of as much as 10%) for negligible risk. As
with any new feature, use with caution, but for very large systems (more chances for one
particle to trigger a pair list rebuild) and longer time steps, we were seeing pair list
rebuilds every other step, which we're confident is unnecessary.
- Avoid the use of GBSA in implicit solvent GB simulations unless required.
The GBSA term is calculated on the CPU and thus requires a synchronization between GPU and
CPU memory on every MD step as opposed to every NTPR or NTWX steps when running without
this option.
- Do not assume that for small systems the GPU will always be faster.
Typically for GB simulations of less than 150 atoms and PME simulations of less than 4,000
atoms it is not uncommon for the CPU version of the code to outperform the GPU version on a
single node. Typically the performance differential between GPU and CPU runs will increase
as atom count increases.
- For some small systems, the GPU program will refuse to start unless
supplied with an additional command line argument "-AllowSmallBox". This is
deliberate: there is a known vulnerability in the pair list that can cause systems which
are less than three times the extended cutoff in any direction to miss non-bonded
interactions. For this reason, such systems are flagged. The CPU code will likely run
them faster anyway given that the atom count is probably low anyway.
- Turn off ECC (Tesla models C2050 and later). ECC can cost you up to
10% in performance. You should verify that your GPUs are not giving ECC errors
before attempting this. You can turn this off on Tesla C2050 based cards and later by
running the following command for each GPU ID as root, followed by a reboot:
nvidia-smi -g 0 --ecc-config=0 (repeat with -g x for each GPU ID)
- Extensive testing of AMBER on a wide range of hardware has established that
ECC has little to no benefit on the reliability of AMBER simulations. This is part of the
reason it is acceptable (see recommended hardware) to use and why we recommend the use of
GeForce gaming cards for AMBER simulations. For more details of ECC and MD simulations see
the following paper: Betz, R.M., DeBardeleben, N.A., Walker, R.C., "An Investigation of the
effects of hard and soft errors on graphics processing unit-accelerated molecular dynamics
simulations", Concurrency and Computation: Practice and Experience, 2014, DOI:
10.1002/cpe.3232
(pdf).
- If you see that performance when running more than one multi-GPU runs is
bad, make sure you turn off thread affinity within your MPI implementation or at least set
each MPI thread to use a difference core. For example, if you run two dual-GPU jobs and
they don't both run at the speed of either job given free reign over the machine, you may
need to make this change. MPICH may not have this on by default, and so no special
settings are needed. However, both MVAPICH and OpenMPI set thread affinity by default.
This would actually be useful if they did it in an intelligent way. However, it seems they
pay no attention to load or even other MVAPICH or OpenMPI runs and always just assign from
core 0. So 2 x 2 GPU jobs are, rather foolishly, assigned to cores 0 and 1 in both cases.
The simplest solution here is to just disable thread affinity as follows:
MVAPICH: export MV2_ENABLE_AFFINITY=0; mpirun -np 2 ...
OpenMPI: mpirun --bind-to none -np 2 ...
System Size Limits
In order to obtain the extensive speedups that we see with GPUs it is critical that all of
the calculation take place on the GPU within the GPU memory. This avoids the performance hit
that one takes copying to and from the GPU and also allows us to achieve extensive speedups
for systems of practical sizes. The consequence is that the entire calculation must fit
within the GPU memory. We also make use of a number of scratch arrays to achieve high
performance, meaning that the GPU memory usage can be higher than a typical CPU run. It also
means, due to the way we had to initially implement parallel GPU support that the memory
usage per GPU does NOT decrease as you increase the number of GPUs. This is something
we hope to fix in the future but for the moment the atom count limitations imposed on systems
by the GPU memory is roughly constant whether you run in serial or in parallel.
Since, unlike CPUs it is not possible to add more memory to a GPU (without replacing it
entirely) and there is no concept of swap as there is on the CPU the size of the GPU memory
imposes hard limits on the number of atoms supported in a simulation. Early on within the
mdout file you will find information on the GPU being used and an estimate of the amount of
GPU and CPU memory required:
|------------------- GPU DEVICE INFO --------------------
|
| CUDA Capable Devices Detected: 1
| CUDA Device ID in use: 0
| CUDA Device Name: Tesla C2070
| CUDA Device Global Mem Size: 6143 MB
| CUDA Device Num Multiprocessors: 14
| CUDA Device Core Freq: 1.15 GHz
|
|--------------------------------------------------------
...
| GPU memory information:
| KB of GPU memory in use: 4638979
| KB of CPU memory in use: 790531
The reported GPU memory usage is likely an underestimate and meant for guidance only to
give you an idea of how close you are to the GPU's memory limit. Just because it is less than
the available Device Global Mem Size does not necessarily mean that it will run. You should
also be aware that the GPU's available memory is reduced by 1/9th if you have ECC turned
on.
Memory usage is affected by the run parameters. In particular the size of the cutoff,
larger cutoffs needing more memory, and the ensemble being used. Additionally the physical
GPU hardware affects memory usage since the optimizations used are non-identical for
different GPU types. Typically, for PME runs, memory usage runs:
- NPT > NVT > NVE
- NTT = 3 > NTT = 1 or NTT = 2 > NTT = 0
- Barostat = 1 > Barostat = 2
Other than size, the density of your system and the non-bonded cutoff are perhaps the most
important factors in memory usage. They drive the amount of space allocated for pair list
building, and can be thought of as two sides of the same coin. A very dense system with a
short cutoff packs as many particles within the cutoff sphere as a lower density system with
a long cutoff, and a dense system with a long cutoff is double trouble. The following table
provides an approximate UPPER BOUND to the number of atoms supported as a function of GPU
model. These numbers were estimated using boxes of TIP3P water (PME) and solvent caps of
TIP3P water (GB). These had lower than optimum densities, so the actual limits for dense
solvated proteins may be around 20% less than the numbers here. Nevertheless, these
should provide reasonable estimates to work from.
GPU Type |
Memory |
Max Atom Count |
GTX580 |
3.0 GB |
1,240,000 |
M2090 (ecc off) |
6.0 GB |
2,600,000 |
GTX680 |
2.0 GB |
920,000 |
K10 (ecc off) / GTX780 / GTX980 |
4.0 GB |
1,810,000 |
K20X (ecc off) / GTX-Titan / GTX-Titan-Black |
6.0 GB |
2,600,000 |
K40 / K80 / M40(ecc off) / GTX-Titan-X / GTX980TI / Titan-XP |
12.0 GB |
3,520,000 |
GTX-1080 |
8.0 GB |
2,800,000 |
Statistics are only provided for periodic systems. In theory, a GB implicit solvent
simulation system has memory requirements that could exceed the card, but the systems would
have to be so large (hundreds of thousands to millions of atoms) that simulating them with
GB solvent (effort growing with the square of the system size) would be impractical for other
reasons.
Accuracy Considerations
In the current generation of GPUs, single precision arithmetic gets substantially more
throughput. This is important when trying to obtain good performance from GPUs. The CPU
code in Amber has always used double precision throughout the calculation. While this full
double precision approach has been implemented in the GPU code (read on), it gives very poor
performance and so the default precision model used when running on GPUs is a combination of
single and fixed precision, termed hybrid precision (SPFP), that is discussed in further
detail in the following reference:
Scott Le Grand; Andreas W. Goetz; & Ross C. Walker* "SPFP: Speed without compromise - a
mixed precision model for GPU accelerated molecular dynamics simulations.", Comp. Phys.
Comm, 2013, 184, pp374-380, DOI: 10.1016/j.cpc.2012.09.022.
This approach uses single precision for individual calculations within the simulation but
fixed scaled integer precision for all accumulations. It also uses fixed precision for
SHAKE calculations and for other parts of the code where loss of precision was deemed to be
unacceptable. Tests have shown that energy conservation is equivalent to the full double
precision code and specific ensemble properties, such as order parameters, match the full
double precision CPU code. Previous acceleration approaches, such as the MDGRAPE-accelerated sander, have used similar hybrid precision models and thus we believe that this is a
reasonable compromise between accuracy and performance. The user should understand that this
approach leads to rapid divergence between GPU and CPU simulations. The CPU code's
simulations will diverge from one another if run in parallel because there is floating point
accumulation (albeit 64-bit) and no control over which thread adds to each number in order.
The GPU code, in contrast, is tremendously parallel but the integer fixed-precision
accumulation ensures that its simulations are self-consistent. The divergence is cosmetic in
the end: any statistical mechanical properties should converge to the same values.
Another precision mode, Double-Precision calculation / Fixed Precision accumulation
(DPFP), is also built at compile time. This mode approximates the CPU code, with the
added benefit of reproducible results thanks to the integer accumulation. It would be the
model of choice if it weren't three to four times slower on the expensive Tesla cards (i.e.
GP100) and ten times slower on the GeForce (i.e. GTX-1080Ti) cards. It primarily exists for
testing and debugging purposes.
We recommend using the CPU code for energy minimization
One limitation of either GPU precision model is that forces can be truncated or, in the
worst case, overflow the fixed precision representation. This should never be a problem
during MD simulations for any well behaved system. However, for minimization or very early
in the heating phase it can present a problem. This is especially true if two atoms are
close to each other and thus have large VDW repulsions. We recommended using the CPU version
of the code for this short process. Only in situations where the structure is guaranteed to
be reasonable, for example if it was a snapshot from dynamics, should you use the GPU code
for minimization.
|
|
|