Apr 2018: PMEMD Midpoint is now part of Amber 18 release.
Click here for details
Oct 2017: Beta: Beta release for midpoint version of pmemd (pmemd.midpoint).
Click here for details
Jun 2016: update.3 (Amber) & update.5 (AmberTools): Support for Intel Hardware
for improved performance for PMEMD on Intel Xeon (Broadwell) & Intel Xeon Phi (KNL).
Aug 2015:
Intel and Exxact jointly announce the Certified Xeon Phi Life
Sciences Solutions program.
Jun 2015: Ross Walker presents "Amber - The
How, What, and Why on Intel Xeon Phi" as ISC 2015. (Video available
here)
Dec 2014: update.9: Enhancements have been made
to the MIC offload implementation for improved performance.
July 2014: update.5: Adds MIC offload support to
pmemd. This update enables use of Intel Xeon Phi coprocessor cards in
offload mode where the card acts as an accelerator for the direct space
sum.
Apr 2014: AMBER 14 released offering the first wave of Intel Xeon Phi support in the form of native mode Amber PMEMD simulations.
BackgroundThis page provides background on running AMBER v16 (PMEMD) on Intel Hardware. One of the new features of AMBER, introduced with version 14, is the ability to use Intel® Xeon Phi™ Processor Family with PMEMD for both explicit solvent PME and implicit solvent GB simulations using MIC Native mode, which works by running the full simulation on the Intel Xeon Phi coprocessor. This can be used both with or without MPI. This work was further updated with the release of AMBER v16 to include Intel specific optimizations for Intel hardware (Broadwell and newer) as well as Xeon Phi Processor support (KNL) with the release of (Update.5 (AT) and Update.3 (AMBER)) to include more efficient code, better vectorization, and inclusion of openmp support. These versions of the PMEMD engine are considered experimental and there is no guarantee of any performance improvement. However, this support should produce results directly comparable to the CPU implementation due to the floating-point consistency of Intel's processors. Any differences in tests will be a consequence of floating-point rounding differences in hardware. Support for Intel Xeon Phi Architectures in PMEMD is an ongoing project and therefore frequent updates are likely. Improved performance can be expected in upcoming patches. It is advised that you consult the Amber Reference manual and the remainder of this page before running any simulations. Feedback and questions should be posted to the Amber mailing list. Authorship & SupportIntel Xeon Phi Native & KNL Implementations:
Further information relating to the specifics of the implementation is available from the following publications with regards to KNC Native & Offload Implementations:
Funding for this work has been graciously provided by Intel in the form of engineering expertise and sponsorship of an Intel Parallel Computing Center directed by Prof. Ross Walker at the San Diego Supercomputer Center, and by an NSF SI2-SSE grant to Ross Walker (NSF1148276) and Adrian Roitberg (NSF1147910). Citing the Xeon Phi Code If you make use of any of this Xeon Phi support in your work please include the following citations (in addition to the standard AMBER 14 citation):
|
|
The supported Intel Xeon PhiTM product family includes:
Intel® Xeon™ and other Intel® CPUs Intel® Xeon Phi™ Processor (Knights Landing - KNL) Intel® Xeon Phi™ coprocessors (Knight's Corner) |
Before attempting to build these versions you should have built and tested the serial and parallel CPU versions of Amber (pmemd and pmemd.MPI) with the Intel compiler suite and Intel MPI. This will help to ensure that basic issues relating to standard compilation on your hardware and operating system do not lead to confusion with coprocessor related compilation and testing problems. You should also be familiar with Amber's compilation procedures.
The following section provides details on best options for compiling the latest AMBER 18 code with Intel and GNU compilers on Intel hardware. It also provides details of how to compile AMBER 18 for the Intel Xeon Phi (Knight's Corner) co-processor as well as it's upcoming replacement the Intel Xeon PhiTM (Knight's Landing) processor. You should only build executables for the hardware you have available. There is no gain in using the KNC offload code on a KNL processor or the KNL compiled code on a regular Xeon processor for example. In the case of the KNC co-processor it is recommended that you are familiar with building and running simple code in native/offload mode on an Intel Xeon PhiTM KNC Coprocessor, which is described in the MIC developer zone.
Building General Intel PMEMD Model
The standard PMEMD executables provide general Intel XeonTM
support that, as of update.3 to AMBER
16, offers improved performance on the
latest generation of Intel hardware. It
does this by introducing improved
vectorization but also addition of
OpenMP instructions. This changes
somewhat the method by which one runs
parallel PMEMD jobs to obtain maximum
performance and thus you are encouraged
to read both this section and the later
on running simulations on Intel CPUs
even if you are using standard Xeon
chips.
Assuming you have installed Intel Parallel Studio XE version 2013 or later, you can build
pmemd as follows:
Alternatively, GNU compilers can still be used:cd $AMBERHOME
bashrc: export MKL_HOME=path-to-directory [optional]
make clean
./configure intel
make install
cd $AMBERHOME
make clean
./configure gnu
make install
pmemd.MPI:
PMEMD parallel for Intel XeonTM can be built with or without hybrid MPI/OpenMP support. Building with just MPI is effectively backwards compatible with the way in which parallel jobs were run in previous versions of AMBER. Remaining with pure MPI is not expected to negatively impact performance compared with previous versions of AMBER and thus is a reasonable choice when full backwards compatibility with existing run scripts etc is desired. Pure MPI parallel support can be built using Intel MPI (mpiicc/mpiifort) from the Intel Parallel Studio XE product, through the use of a new MPI flag (-intelmpi) introduced with the release of AMBER 14. You can build pmemd.MPI as follows:
Alternatively GNU compilers and your pick of MPI (openmpi, mpich, mvapich etc) can be used:cd $AMBERHOME
bashrc: export MKL_HOME=path-to-directory [optional but recommended]
make clean
./configure -intelmpi intel
make install
As of Amber16 PMEMD update.3, hybrid MPI/OpenMP compilation is possible and can be built (only with Intel compilers) to take advantage of OpenMP threading to potentially increase performance. The executable is named pmemd.MPI and supports running both with and without OpenMP threading depending on the setting of the OMP_NUM_THREADS environment variable although runs with OMP_NUM_THREADS=1 might be marginally impacted in performance over a compilation without OpenMP support. (See the running section below before running with this executable):cd $AMBERHOME
make clean
./configure -mpi gnu
make install
cd $AMBERHOME
bashrc: export MKL_HOME=path-to-directory
make clean
./configure -openmp -intelmpi intel
make install
Building KNL Xeon Phi PMEMD model
The Intel Xeon PhiTM support (added in update.3) adds a KNL optimized pmemd.MPI (that will run on regular Xeon but with potential performance regressions).
pmemd.MPI:
PMEMD parallel for Intel Xeon PhiTM can be built using the MPI (mpiicc/mpiifort) from the Intel Parallel Studio XE product only. It is supported in Amber 18 through the use of a new MIC2 flag (-mic2) that introduces Intel specific optimizations and optional (experimental) mixed precision (-mic2_SPDP) support. Build pmemd.MPI as follows:
cd $AMBERHOME
bashrc: export MKL_HOME=path-to-directory
make clean
./configure -intelmpi -openmp -mic2 intel
or
./configure -intelmpi -openmp -mic2_SPDP intel [caution: experimental at this time]
make install
Testing the KNL Xeon Phi PMEMD Model
You can run the test suite as follows:
export DO_PARALLEL="mpirun -np 2"
export OMP_NUM_THREADS=2
make test.mic2
The majority of these tests should pass. Differences that occur on only a few lines and are
minor in nature can be safely ignored. Any large differences, or if you are unsure, should be posted to the
Amber mailing list for comment.
Building the MIC Native PMEMD model The MIC native version supporting the
first generation KNC Intel Xeon PhiTM co-processors is called
MIC Native PMEMD Model
pmemd.mic_native
Assuming you have installed Intel Parallel Studio XE version 2013 or later, you can build pmemd.mic_native as follows:
cd $AMBERHOME
make clean
./configure -mic_native intel
make install
pmemd.mic_native.MPI:
PMEMD parallel for KNC Intel Xeon PhiTM (MIC) coprocessors can only be built using the MPI (mpiicc/mpiifort) from the Intel Cluster Studio XE product, which is supported from Amber 14 onwards through the use of a new MPI flag (-intelmpi). Build pmemd.mic_native.MPI as follows:
cd $AMBERHOME
make clean
./configure -mic_native -intelmpi intel
make install
It is possible to run across multiple Intel Xeon processors and KNC Intel Xeon PhiTM coprocessors, even on a cluster, with this implementation. However, it is a functional implementation and not performance optimized at this time. Detailing how to run this way is beyond the scope of the current instructions; however to see how to do this with MPI applications in general, see Intel's web instructions.
At present it is not possible to run the
test suite in an automated fashion in
native mode.
Building the MIC Offload PMEMD Model
The KNC Intel Xeon PhiTM (MIC) offload version supporting the first generation (KNC) of Intel Xeon PhiTM coprocessors is called pmemd.mic_offload.MPI and must be built separately from the standard parallel installation. MIC offload is not available in serial. If you are not planning on running on a first generation KNC Intel Xeon PhiTM co-processor you do not need to build the native or the offload executables.
pmemd.mic_offload.MPI
Assuming you have installed Intel Parallel Studio XE version 2013 or later, you can build pmemd.mic_offload.MPI as follows:
cd $AMBERHOME
make clean
./configure -mic_offload intel
make install
There is no need to specify the -intelmpi flag as this is the default behavior of the configure script.
Testing the MIC Offload PMEMD Model
You can run the test suite using the MIC coprocessor with:
make test.mic_offload
The majority of these tests should pass. However, given the parallel nature of the KNC MIC coprocessor, meaning the order of operation is not well defined, it is not uncommon for there to be several 'possible FAILURES'. You should inspect the .dif files created in the $AMBERHOME/logs/test_amber_mic_offload/ directory to manually verify any 'possible FAILURES'. Differences that occur on only a few lines and are minor in nature can be safely ignored. Any large differences, or if you are unsure, should be posted to the Amber mailing list for comment.
These instructions are intended to
provide help with optimizing Intel XeonTM processor
performance. For MPI-only PME, IPS or GB runs: For MPI+OpenMP hybrid GB run: For MPI+OpenMP hybrid PME or
IPS runs:
mpirun -np {mpi ranks} \
$AMBERHOME/bin/pmemd.MPI \
-O -i mdin -o mdout -p prmtop -c inpcrd
export I_MPI_PIN_MODE=pm
export I_MPI_PIN_DOMAIN=auto
[these are important to ensure
OpenMP threads are properly
distributed between MPI tasks]
mpirun -np {sockets} -env OMP_NUM_THREADS={processors per socket} \
$AMBERHOME/bin/pmemd.MPI
-O -i mdin -o mdout -p prmtop -c inpcrd
export KMP_BLOCKTIME=0
export KMP_STACKSIZE=200M
export I_MPI_PIN_DOMAIN=core
mpirun -np {mpi ranks} -env OMP_NUM_THREADS=2 \
$AMBERHOME/bin/pmemd.MPI \
-O -i mdin -o mdout -p prmtop -c inpcrd
In order to run a simulation on a
second generation (KNL) Intel Xeon PhiTM processor,
it is advised that you read the
Intel Xeon PhiTM Processor Developer's Quick Start Guide
[to be published shortly]
in addition to the instructions given
here. These instructions are created to provide guidance on maximizing the performance of the KNL system.
It should be noted that at this time the following is NOT
supported on KNL Xeon PhiTM processors: TI, EMIL,
Force Switch, IPS and Lennard Jones 12-6-4. For GB workloads: For PME Workloads:
export I_MPI_PIN_MODE=pm
export I_MPI_PIN_DOMAIN=auto:compact
mpirun -np 4 -env OMP_NUM_THREADS={OMP Threads} -env KMP_AFFINITY="compact \
-env KMP_STACKSIZE=10M $AMBERHOME/bin/pmemd.MPI \
-O -i mdin -o mdout -p prmtop -c inpcrd
export KMP_STACKSIZE=200M
export KMP_BLOCKTIME=0
export I_MPI_PIN_DOMAIN=core
mpirun -np {MPI ranks} -env OMP_NUM_THREADS={OMP Threads} \
$AMBERHOME/bin/pmemd.MPI
-O -i mdin -o mdout -p prmtop -c inpcrd
In order to run a simulation on the first generation KNC Intel Xeon PhiTM coprocessor, it is advised that you read the Intel Xeon PhiTM Coprocessor Developer's Quick Start Guide in addition to the instructions given here. This guide includes a description of the Intel Manycore Platform Software Stack (Intel MPSS), which enables the wide range of usage models that Intel Xeon PhiTM coprocessors support. Running a simulation on the coprocessor in native mode requires that all files and binaries be visible to the coprocessor. Either mount your file system on the coprocessor (requires root access) or explicitly transfer the binaries, libraries and input files to the /tmp directory on the coprocessor.
Note: Mounting requires the Amber directory plus any additional libraries used in an Amber simulation to be visible to the coprocessor.
Running Simulations with a Mounted Filesystem
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$INTEL_COMPILER_HOME/lib/mic/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MKL_HOME/lib/mic/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MPI_HOME/mic/lib/
export PATH=$PATH:$MPI_HOME/mic/bin
ssh mic0 "source source_knc.sh; \
$AMBERHOME/bin/pmemd.mic_native -O -i mdin -o mdout -p prmtop -c inpcrd"
Running Simulations without a Mounted Filesystem
scp -r $INTEL_MPI_HOME/mic/lib/* mic0:/tmp/
scp -r $INTEL_MPI_HOME/lib/mic/* mic0:/tmp/
scp -r $AMBERHOME/bin/pmemd.mic_native mic0:/tmp/
scp -r working_directory/* mic0:/tmp/
chmod 777 -R /tmp/*
ssh mic0 "export LD_LIBRARY_PATH=/tmp/; cd /tmp; \
./pmemd.mic_native -O -i mdin -o mdout -p prmtop -c inpcrd"
Unlike MIC native mode, once PMEMD has been configured with the -mic_offload flag and
compiled, no additional steps are required to run pmemd.mic_offload.MPI
. Work is automatically offloaded to the Intel
KNC MIC Architecture.
Execute the following command on the host to run a simulation in MIC offload mode:
mpirun -np 8 $AMBERHOME/bin/pmemd.mic_offload.MPI -O
Note: Choose the number of MPI processes to suit the specifications of the host CPU, e.g. 8 MPI processes for an Intel Xeon E5-2680 8 core processor, in order to achieve optimum performance. In the initial support for MIC offload in PMEMD, the amount of offloaded work to the coprocessor increases and settles to a stable value after running multiple time steps governed by the Amber load balancer. Thus, it is recommended that the simulation should run for at least 200 time steps to benefit from the coprocessor.
The KNC MIC offload code uses OpenMP (OMP) threads to distribute the offloaded work across the MIC coprocessor cores. The default number of OMP threads per offloading MPI process is set to 30; however, this can be overridden by substituting the above execution command with the following advanced execution command in this example runscript:
Run.mic_offload
#!/bin/bash
export MIC_ENV_PREFIX=PHI
export OMP_NUM_THREADS=1
mpirun -n 11 ./pmemd.mic_offload.MPI -O \
: -n 1 -env PHI_KMP_PLACE_THREADS 30c,4t,0O \
-env PHI_KMP_AFFINITY scatter \
-env PHI_OMP_NUM_THREADS 30 \
-env MIC_OMP_STACKSIZE 4M \
./pmemd.mic_offload.MPI -O \
: -n 1 -env PHI_KMP_PLACE_THREADS 30c,4t,30O \
-env PHI_KMP_AFFINITY scatter \
-env PHI_OMP_NUM_THREADS 30 \
-env MIC_OMP_STACKSIZE 4M \
./pmemd.mic_offload.MPI -O \
: -n 11 ./pmemd.mic_offload.MPI -O
"MIC_ENV_PREFIX=PHI" states that any environment variable prefixed with PHI will
be applicable to the KNC MIC coprocessor environment only (not the host processor environment).
"OMP_NUM_THREADS=1" states that the number of host OMP threads is 1.
In the MIC offload version of PMEMD only the middle two MPI processes are responsible for offloading work to the MIC coprocessor, e.g. if 8 MPI processes are specified, threads 4 and 5 are responsible for offloading to the MIC coprocessor. These two MPI processes simultaneously spawn OMP threads on the MIC coprocessor to execute the offloaded chunks of work. By partitioning the execution command to reflect the decomposition strategy, the number of OMP threads can be manually set. Partitioning of an MPI execution command is done via the use of ":" which is demonstrated in the example runscript above.
In this example runscript for a 24 core Intel CPU augmented with a 61 core MIC coprocessor, 24 MPI processes are requested on a single node. The first 11 MPI processes and the last 11 MPI processes execute on the host CPU cores. The middle two MPIs (12 and 13) offload to a single MIC coprocessor and each spawn 30 OMP threads. The cores of the MIC coprocessor are divided in two so that each MPI process is assigned half the cores of the MIC coprocessor. In the above example, MPI process 12 spawns OMP threads on cores 1-30 whereas MPI process 13 spawns OMP threads on cores 31-60.
"-env PHI_KMP_PLACE_THREADS 30c,4t,0O" states that the MPI process will use 30 cores of the MIC coprocessor (30C), may use as many as 4 threads per core (4T), and the first core that is being used has an offset of 0 (0O). Please consult Intel's thread affinity documentation for a more detailed explanation.
"-env PHI_KMP_AFFINITY scatter" states that the OpenMP threads will be mapped to the hardware threads in a scattered fashion. Please consult Intel's compiler documentation for other options.
"-env PHI_OMP_NUM_THREADS 30" states that the MPI process uses 30 OpenMP threads.
It may be of benefit to performance to adjust the number of OMP threads to be spawned by each offloading MPI process depending on the system being simulated, for example:
-env PHI_OMP_NUM_THREADS 60For larger simulations (>200,000 atoms) more OMP threads, such as 2 threads per core (a maximum of 4 threads per core is permitted on the MIC coprocessor), spawned on a MIC coprocessor provides better performance. But for more OMP threads we need a larger stack size per thread and a larger total stack size on a MIC coprocessor. The default stack size is 8 KB.
"-env MIC_OMP_STACKSIZE 4M" increases the OMP thread stack size to 4 MB.
For MPSS version 3.2.3 and later the total stack size of a MIC coprocessor is increased by the following steps:
cd /vars/mpss/common
su
mkdir -p etc/security
"* soft stack unlimited"
dir /etc/security 755 0 0
file /etc/security/limits.conf etc/security/limits.conf 644 0 0
service mpss restart
In order to simplify the selection of hardware for AMBER simulations on Xeon, KNC Xeon Phi and KNL Xeon Phi hardware we have teamed up with Exxact Corporation to offer preinstalled AMBER Certified computing solutions this includes a Xeon Phi Life Sciences Certified Solutions Program developed jointly between Exxact, Intel and the lab or Prof. Ross Walker. These systems can be ordered with AMBER 16 preinstalled (AMBER license required) and come with a full 3 year warranty. For details and to customize systems please contact either Ross Walker (ross_at_rosswalker.co.uk) or Mike Chen (mike@exxactcorp.com).
Custom turnkey cluster solutions are also available. Please email Ross Walker (ross at rosswalker.co.uk) or Mike Chen (mike@exxactcorp.com) for details.
The following provides some additional resources that you may find useful.