April 2018: PMEMD midpoint is now part of Amber 18 release
Oct 2017: PMEMD midpoint beta released.
BackgroundAbout the new Midpoint method implementation in Amber 18The Midpoint method for Molecular Dynamics was developed by researchers at IBM as part of their BlueMatter project with BlueGene and subsequently expanded upon and described publically by researchers at D. E. Shaw Research, LLC (see this 2006 paper published in The Journal of Chemical Physics). The Midpoint method is based on domain decomposition and provides anefficient approach to significantly reduce “data distribution” time with increased node count. The San Diego Computing Center (SDSC) at UC San Diego in collaboration with Intel© Corporation have developed a prototype implementation of the Midpoint method within the Amber PMEMD software. This implementation has been thoroughly tested with three key Amber benchmarks. The midpoint implementation will continue to be refined and the performance improved over the coming months through patches in Amber 18. Motivation for re-architecting Amber 16 PMEMD CPU codeThe aim to re-architect Amber 16’s PMEMD code is to continue to improvethe CPU performance and cluster scalability. For example, the following figures show thatthe current atom decomposition approach in Amber 16 does not scale well for theCellulose andSTMV benchmarks (roughly 400K and 1 Million atoms, respectively) beyond 8 nodes each containing a single Intel© Xeon Phiprocessor 7250 (Knights Landing). It was also observed that the data distribution time grew exponentially with increased node count. Figure 1: Scaling of existing Amber 16 PMEMD code on Intel Xeon Phi 7250 equipped nodes connected with Intel Omni Path.
Authorship & SupportPMEMD Midpoint Implementations:
Citing the Midpoint Code If you make use of any of the Amber midpoint code in your work please include the following citations (in addition to the standard Amber citation):
|
|
In addition to the traditional full double precision implementation used in the PMEMD CPU code the Amber 18 midpoint also introduces a mixed prevision model, first pioneered with the Amber GPU implementation, termed SPDP. This precision model uses single precision for each particle-particle interaction but sums the resulting forces into double precision accumulators. This precision model was established as sufficient for accurate MD simulations and shown to conserve energy in the original Amber GPU publications and has been tested to the same degree of tolerance as part of this new CPU optimization. The benefit of using mixed-precision is that it provides improved performance, both serial and parallel, since all modern Intel CPUs can perform two single precision floating point calculations for the cost of a single double precision floating point calculation. The figures and table below provide the validation data for Amber 18 DPDP and SPDP compared with Amber 16 DPDP as reference.
Figure 1: Cellulose |
Figure 2: DHFR |
Figure 3: FactorIX |
DHFR |
Factor IX |
Cellulose |
STMV |
|
Max deviation |
||||
A18 SPDP |
8.6E-04 |
1.9E-03 |
3.8E-03 |
4.0E-03 |
A18 DPDP |
5.0E-08 |
4.7E-07 |
4.9E-07 |
5.0E-07 |
RMS deviation |
||||
A18 SPDP |
5.4E-05 |
1.1E-04 |
1.4E-04 |
1.9E-04 |
A18 DPDP |
1.9E-08 |
2.0E-08 |
2.0E-08 |
2.0E-08 |
Benchmark Downloads: STMV | Cellulose | Poliovirus
The new midpoint based parallel version of PMEMD in Amber 18 beta is now faster and scalesbetter thanPMEMD from the Amber 16 released version. The following figures show that the Amber 18 beta is consistently faster than Amber 16 in the primary benchmarks (Cellulose and STMV) on current Intel processors[Intel© Xeon© Gold 6148 (Skylake) and Intel© Xeon Phi7250(Knights Landing) processors]. At larger number of nodes, the performance gap between Amber 18 beta and Amber 16 increases, as the former has better scaling efficiency. For example, the figure below shows that Amber 18 beta is 2.2x faster than the Amber 16 released code when running the STMV benchmark on a single node Intel© Xeon Phi processor 7250.Moreover, the Amber 18 beta provides 2.5x speedup over Amber 16 on 8 nodes of Intel© Xeon Phi processor 7250 and 3.1x speedup on 32 nodes of Intel© Xeon© Gold 6148 Processorfor the STMV benchmark. In addition to the traditional Cellulose and STMV benchmarks we have also introduced a larger, 4 million atom benchmark based on the polio virus. This represents a large simulation and is thus a good stress test of the midpoint decomposition approach.
Hardware configuration: Intel© Xeon Phi 7250 runs at 1.4GHz with 96GB of DRAM and 16GB MCDRAM in Quadrant Cache mode. Intel© Xeon© Gold 6148 Processor runs at 2.4GHz frequency with 192GB of DRAM memory capacity. The multinode results are using Omnipath (OPA) fabric on the cluster. The compilers used are Intel compiler version 2017 update2 and Intel MPI version 5.1.3.
The midpoint implementation in Amber 18 does not yet support
the full PMEMD functionality. We plan to add critical
missing functionality in the coming months.
The midpoint implementation of Amber 18 currently supports: NVE, NVT (Langevin Thermostat), and Shake.
The following
We advise the user to configure, compile and run on the same Computer. If you build on Intel© Xeon© Processor E7-4850 v2 processor (Products formerly Ivy Bridge), you will get AVX. If you build on Intel© Xeon© Processor E7-4850 v4 (Products formerly Broadwell) you will get AVX2 instruction. If you build on Intel© Xeon Phiprocessor 7250, you will get AVX512 instruction. Binaries compiled on Intel© Xeon Phiprocessor 7250, will not run on Haswell or previous generation processors. Currently Intel compilers version 2017 update2 and GCC compiler version 6.2.0 are supported and extensively tested in Amber18 midpoint implementation.
Enabling the Midpoint optimizations at runtime:
The user has a runtime choice (Using "midpoint" option in the cntrl namelist) to run the PMEMD simulation using the midpoint implementation by setting "midpoint=1", given that all the used features are suppored in the midpoint implementation. The original Amber 16 code can be selected by setting "midpoint=0", or removing the parameter.
Optimized code with mixed-precision (SPDP):
./configure -intelmpi -openmp -midpoint_spdp intel
export KMP_BLOCKTIME=0 # threads
sleep right away when idle
(e.g., on barriers)
export
KMP_STACKSIZE=200 M # Allocate
200 MB for each OpenMP private
stack
export
I_MPI_PIN_DOMAIN=core #
Restricts the threads of each
MPI rank in one physical CPU
export
OMP_NUM_THREADS=4 or 2 # 4 for
Xeon Phi 2 for Xeon
mpirun -np
{cores} $AMBERHOME/bin/pmemd.MPI
\
-O -i
mdin -o mdout -p prmtop -c
inpcrd
#note cores is number of physical cores not hyperthreads
Optimized code with full double precision (DPDP):
./configure -intelmpi -openmp intel
export KMP_BLOCKTIME=0 # threads sleep
right away when idle (e.g., on barriers)
export KMP_STACKSIZE=200 M # Allocate
200 MB for each OpenMP private stack
export I_MPI_PIN_DOMAIN=core # Restricts
the threads of each MPI rank in one
physical CPU
export OMP_NUM_THREADS=4 or 2 # 4 for
Xeon Phi 2 for Xeon
mpirun -np {cores} $AMBERHOME/bin/pmemd.MPI
\
-O -i mdin -o mdout -p prmtop -c inpcrd
#note cores is number of physical cores not hyperthreads
Without hardware specific or OpenMP optimizations:
./configure -intelmpi intel
mpirun -np {cores} $AMBERHOME/bin/pmemd.MPI
\
-O -i mdin -o mdout -p prmtop -c
inpcrd
#note cores is number of physical cores not hyperthreads
The following provides some additional resources that you may find useful.