AMBER 16 Intel® Hardware
SUPPORT

| Background | Authorship & Support | Supported Features | Supported Intel® Xeon® and Xeon Phi™ Architectures |
| Installation and Testing | Running Simulations on Intel Hardware & Compilers |
| Considerations for Maximizing Performance | Recommended Hardware |
| Return to Main Amber Page |

News and Updates

Background

This page provides background on running AMBER v16 (PMEMD) on Intel Hardware.

One of the new features of AMBER, introduced with version 14, is the ability to use Intel® Xeon Phi™ Processor Family with PMEMD for both explicit solvent PME and implicit solvent GB simulations using MIC Native mode, which works by running the full simulation on the Intel Xeon Phi coprocessor. This can be used both with or without MPI. This work was further updated with the release of AMBER v16 to include Intel specific optimizations for Intel hardware (Broadwell and newer) as well as Xeon Phi Processor support (KNL) with the release of (Update.5 (AT) and Update.3 (AMBER)) to include more efficient code, better vectorization, and inclusion of openmp support. These versions of the PMEMD engine are considered experimental and there is no guarantee of any performance improvement. However, this support should produce results directly comparable to the CPU implementation due to the floating-point consistency of Intel's processors. Any differences in tests will be a consequence of floating-point rounding differences in hardware. Support for Intel Xeon Phi Architectures in PMEMD is an ongoing project and therefore frequent updates are likely. Improved performance can be expected in upcoming patches. It is advised that you consult the Amber Reference manual and the remainder of this page before running any simulations.

Feedback and questions should be posted to the Amber mailing list.

Authorship & Support

Intel Xeon Phi Native & KNL Implementations:

Ashraf Bhuiyan (Intel)
Charles Lin (SDSC)
Perri Needham (SDSC)
Ross C. Walker (SDSC)*

*Corresponding author.

Further information relating to the specifics of the implementation is available from the following publications with regards to KNC Native & Offload Implementations:

  • Perri Needham; Ashraf Bhuiyan; & Ross C. Walker* "Extension of the AMBER Molecular Dynamics Software to Intel's Many Integrated Core (MIC) Architecture", Comp. Phys. Comm., Volume 201, Pages 95-105, April 2016
     
  • Ashraf Bhuiyan; Perri Needham; & Ross C. Walker* "Amber PME Molecular Dynamics Optimization", High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches, 2015, Vol. 2, Ch. 6. ISBN: 978-0128038192 (Purchase Here)

Funding for this work has been graciously provided by Intel in the form of engineering expertise and sponsorship of an Intel Parallel Computing Center directed by Prof. Ross Walker at the San Diego Supercomputer Center, and by an NSF SI2-SSE grant to Ross Walker (NSF1148276) and Adrian Roitberg (NSF1147910).

Citing the Xeon Phi Code

If you make use of any of this Xeon Phi support in your work please include the following citations (in addition to the standard AMBER 14 citation):

  • Perri Needham; Ashraf Bhuiyan; & Ross C. Walker* "Extension of the AMBER Molecular Dynamics Software to Intel's Many Integrated Core (MIC) Architecture", Comp. Phys. Comm., Volume 201, Pages 95-105, April 2016

^

Supported Xeon and Xeon Phi Architectures

The supported Intel Xeon PhiTM product family includes:

Intel® Xeon™ and other Intel® CPUs

Intel® Xeon Phi™ Processor (Knights Landing - KNL)

  • Details coming soon.

    Intel® Xeon Phi™ coprocessors (Knight's Corner)

  • ^

    Installation and Testing

    Before attempting to build these versions you should have built and tested the serial and parallel CPU versions of Amber (pmemd and pmemd.MPI) with the Intel compiler suite and Intel MPI. This will help to ensure that basic issues relating to standard compilation on your hardware and operating system do not lead to confusion with coprocessor related compilation and testing problems. You should also be familiar with Amber's compilation procedures.

    The following section provides details on best options for compiling the latest AMBER 16 code with Intel and GNU compilers on Intel hardware. It also provides details of how to compile AMBER 16 for the Intel Xeon Phi (Knight's Corner) co-processor as well as it's upcoming replacement the Intel Xeon PhiTM (Knight's Landing) processor. You should only build executables for the hardware you have available. There is no gain in using the KNC offload code on a KNL processor or the KNL compiled code on a regular Xeon processor for example. In the case of the KNC co-processor it is recommended that you are familiar with building and running simple code in native/offload mode on an Intel Xeon PhiTM KNC Coprocessor, which is described in the MIC developer zone.

    General Intel PMEMD Model

  • Building General Intel PMEMD Model

    The standard PMEMD executables provide general Intel XeonTM support that, as of update.3 to AMBER 16, offers improved performance on the latest generation of Intel hardware. It does this by introducing improved vectorization but also addition of OpenMP instructions. This changes somewhat the method by which one runs parallel PMEMD jobs to obtain maximum performance and thus you are encouraged to read both this section and the later on running simulations on Intel CPUs even if you are using standard Xeon chips.

    Assuming you have installed Intel Parallel Studio XE version 2013 or later, you can build pmemd as follows:

    cd $AMBERHOME
    bashrc: export MKL_HOME=path-to-directory   [optional]
    make clean
    ./configure intel
    make install

    Alternatively, GNU compilers can still be used:

    cd $AMBERHOME
    make clean
    ./configure gnu   
    make install

    pmemd.MPI:

    PMEMD parallel for Intel XeonTM can be built with or without hybrid MPI/OpenMP support. Building with just MPI is effectively backwards compatible with the way in which parallel jobs were run in previous versions of AMBER. Remaining with pure MPI is not expected to negatively impact performance compared with previous versions of AMBER and thus is a reasonable choice when full backwards compatibility with existing run scripts etc is desired. Pure MPI parallel support can be built using Intel MPI (mpiicc/mpiifort) from the Intel Parallel Studio XE product, through the use of a new MPI flag (-intelmpi) introduced with the release of AMBER 14. You can build pmemd.MPI as follows:

    cd $AMBERHOME
    bashrc: export MKL_HOME=path-to-directory  
    [optional but recommended]
    make clean
    ./configure -intelmpi intel   
    make install

    Alternatively GNU compilers and your pick of MPI (openmpi, mpich, mvapich etc) can be used:

    cd $AMBERHOME
    make clean
    ./configure -mpi gnu   
    make install

    As of Amber16 PMEMD update.3, hybrid MPI/OpenMP compilation is possible and can be built (only with Intel compilers) to take advantage of OpenMP threading to potentially increase performance. The executable is named pmemd.MPI and supports running both with and without OpenMP threading depending on the setting of the OMP_NUM_THREADS environment variable although runs with OMP_NUM_THREADS=1 might be marginally impacted in performance over a compilation without OpenMP support. (See the running section below before running with this executable):

    cd $AMBERHOME
    bashrc: export MKL_HOME=path-to-directory
    make clean
    ./configure -openmp -intelmpi intel
    make install

    Intel® Xeon Phi™ Processor Family (Knights Landing -KNL) MIC2 PMEMD Model

  • Building KNL Xeon Phi PMEMD model

    The Intel Xeon PhiTM support (added in update.3) adds a KNL optimized pmemd.MPI (that will run on regular Xeon but with potential performance regressions).

    pmemd.MPI:

    PMEMD parallel for Intel Xeon PhiTM can be built using the MPI (mpiicc/mpiifort) from the Intel Parallel Studio XE product only. It is supported in Amber 16 through the use of a new MIC2 flag (-mic2) that introduces Intel specific optimizations and optional (experimental) mixed precision (-mic2_SPDP) support. Build pmemd.MPI as follows:

    cd $AMBERHOME
    bashrc: export MKL_HOME=path-to-directory
    make clean
    ./configure -intelmpi -openmp -mic2 intel   
    or
    ./configure -intelmpi -openmp -mic2_SPDP intel   [caution: experimental at this time]
    make install

  • Testing the KNL Xeon Phi PMEMD Model

    You can run the test suite as follows:

    export DO_PARALLEL="mpirun -np 2"
    export OMP_NUM_THREADS=2
    make test.mic2

    The majority of these tests should pass. Differences that occur on only a few lines and are minor in nature can be safely ignored. Any large differences, or if you are unsure, should be posted to the Amber mailing list for comment.

    ^

    MIC Native PMEMD Model

  • Building the MIC Native PMEMD model

    The MIC native version supporting the first generation KNC Intel Xeon PhiTM co-processors is called pmemd.mic_native (or pmemd.mic_native.MPI for running simulations in parallel on the coprocessor using MPI) and must be built separately from the standard serial and parallel installations. If you are not planning on running on a KNC Intel Xeon PhiTM co-processor you do not need to build the native or the offload executables.

    pmemd.mic_native

    Assuming you have installed Intel Parallel Studio XE version 2013 or later, you can build pmemd.mic_native as follows:

    cd $AMBERHOME
    make clean
    ./configure -mic_native intel   
    make install

    pmemd.mic_native.MPI:

    PMEMD parallel for KNC Intel Xeon PhiTM (MIC) coprocessors can only be built using the MPI (mpiicc/mpiifort) from the Intel Cluster Studio XE product, which is supported from Amber 14 onwards through the use of a new MPI flag (-intelmpi). Build pmemd.mic_native.MPI as follows:

    cd $AMBERHOME
    make clean
    ./configure -mic_native -intelmpi intel   
    make install

    It is possible to run across multiple Intel Xeon processors and KNC Intel Xeon PhiTM coprocessors, even on a cluster, with this implementation. However, it is a functional implementation and not performance optimized at this time. Detailing how to run this way is beyond the scope of the current instructions; however to see how to do this with MPI applications in general, see Intel's web instructions.

    At present it is not possible to run the test suite in an automated fashion in native mode.
     

    MIC Offload PMEMD Model

  • Building the MIC Offload PMEMD Model

    The KNC Intel Xeon PhiTM (MIC) offload version supporting the first generation (KNC) of Intel Xeon PhiTM coprocessors is called pmemd.mic_offload.MPI and must be built separately from the standard parallel installation. MIC offload is not available in serial. If you are not planning on running on a first generation KNC Intel Xeon PhiTM co-processor you do not need to build the native or the offload executables.

    pmemd.mic_offload.MPI

    Assuming you have installed Intel Parallel Studio XE version 2013 or later, you can build pmemd.mic_offload.MPI as follows:

    cd $AMBERHOME
    make clean
    ./configure -mic_offload intel   
    make install

    There is no need to specify the -intelmpi flag as this is the default behavior of the configure script.
     

  • Testing the MIC Offload PMEMD Model

    You can run the test suite using the MIC coprocessor with:

    make test.mic_offload

    The majority of these tests should pass. However, given the parallel nature of the KNC MIC coprocessor, meaning the order of operation is not well defined, it is not uncommon for there to be several 'possible FAILURES'. You should inspect the .dif files created in the $AMBERHOME/logs/test_amber_mic_offload/ directory to manually verify any 'possible FAILURES'. Differences that occur on only a few lines and are minor in nature can be safely ignored. Any large differences, or if you are unsure, should be posted to the Amber mailing list for comment.

    ^

  • Running Simulations on Intel® CPU

    These instructions are intended to provide help with optimizing Intel XeonTM processor performance.

  • For MPI-only PME, IPS or GB runs:

      mpirun -np {mpi ranks} \
      $AMBERHOME/bin/pmemd.MPI \
      -O -i mdin -o mdout -p prmtop -c inpcrd
      Set mpi ranks to number of processors being used for task. Note it is advisable to test performance as a function of MPI Ranks since scaling is highly system size, simulation option and hardware dependent and using more tasks does not always provide greater performance once the scaling limit has been reached.
  • For MPI+OpenMP hybrid GB run:

      export I_MPI_PIN_MODE=pm
      export I_MPI_PIN_DOMAIN=auto   [these are important to ensure OpenMP threads are properly distributed between MPI tasks]

      mpirun -np {sockets} -env OMP_NUM_THREADS={processors per socket} \
      $AMBERHOME/bin/pmemd.MPI
      -O -i mdin -o mdout -p prmtop -c inpcrd
      It is advised to set sockets to number of physical chips being used, and OMP_NUM_THREADS to the number of cores per chip.
  • For MPI+OpenMP hybrid PME or IPS runs:

      export KMP_BLOCKTIME=0
      export KMP_STACKSIZE=200M
      export I_MPI_PIN_DOMAIN=core

      mpirun -np {mpi ranks} -env OMP_NUM_THREADS=2 \
      $AMBERHOME/bin/pmemd.MPI \
      -O -i mdin -o mdout -p prmtop -c inpcrd
      It is advised to set 2 OMP Threads per mpi rank, and MPI ranks as "number of real [non-hyperthreading] processor cores".

    ^

  • Running Simulations on Intel® Xeon PhiTM Processors (KNL)

    MIC2 PMEMD Model

    In order to run a simulation on a second generation (KNL) Intel Xeon PhiTM processor, it is advised that you read the Intel Xeon PhiTM Processor Developer's Quick Start Guide [to be published shortly] in addition to the instructions given here. These instructions are created to provide guidance on maximizing the performance of the KNL system. It should be noted that at this time the following is NOT supported on KNL Xeon PhiTM processors: TI, EMIL, Force Switch, IPS and Lennard Jones 12-6-4.

  • For GB workloads:

      export I_MPI_PIN_MODE=pm
      export I_MPI_PIN_DOMAIN=auto:compact

      mpirun -np 4 -env OMP_NUM_THREADS={OMP Threads} -env KMP_AFFINITY="compact \
      -env KMP_STACKSIZE=10M $AMBERHOME/bin/pmemd.MPI \
      -O -i mdin -o mdout -p prmtop -c inpcrd
      It is advised to set OMP Threads to number of physical cores on the KNL chip. The number of MPI tasks should always be left at 4.
  • For PME Workloads:

      export KMP_STACKSIZE=200M
      export KMP_BLOCKTIME=0
      export I_MPI_PIN_DOMAIN=core

      mpirun -np {MPI ranks} -env OMP_NUM_THREADS={OMP Threads} \
      $AMBERHOME/bin/pmemd.MPI
      -O -i mdin -o mdout -p prmtop -c inpcrd
      It is advised to use "number of cores in the node" as MPI ranks, and 2 OMP Threads per mpi rank.

    ^

  • Running Simulations on Intel® KNC Xeon PhiTM Coprocessors

    MIC Native PMEMD Model

    In order to run a simulation on the first generation KNC Intel Xeon PhiTM coprocessor, it is advised that you read the Intel Xeon PhiTM Coprocessor Developer's Quick Start Guide in addition to the instructions given here. This guide includes a description of the Intel Manycore Platform Software Stack (Intel MPSS), which enables the wide range of usage models that Intel Xeon PhiTM coprocessors support. Running a simulation on the coprocessor in native mode requires that all files and binaries be visible to the coprocessor. Either mount your file system on the coprocessor (requires root access) or explicitly transfer the binaries, libraries and input files to the /tmp directory on the coprocessor.

    Note: Mounting requires the Amber directory plus any additional libraries used in an Amber simulation to be visible to the coprocessor.

  • Running Simulations with a Mounted Filesystem

    1. To mount a filesystem please follow the instructions available in the Intel MPSS readme file or follow this this link.
    2. Mount your AMBERHOME directory, working directory, Intel compiler directory, Intel MPI_HOME directory (for parallel run), and MKL_HOME directory (if MKL is used) on the coprocessor.
    3. Add the following environment variables to a file (source_knc.sh in this example), which will be sourced on execution of pmemd.mic_native:
      export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$INTEL_COMPILER_HOME/lib/mic/
      export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MKL_HOME/lib/mic/
      export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MPI_HOME/mic/lib/
      export PATH=$PATH:$MPI_HOME/mic/bin
    4. Run a simulation from your working directory on the host (CPU) with a mounted filesystem:
      ssh mic0 "source source_knc.sh; \
      $AMBERHOME/bin/pmemd.mic_native -O -i mdin -o mdout -p prmtop -c inpcrd"
  • Running Simulations without a Mounted Filesystem

    1. Upload the KNC Intel Xeon PhiTM coprocessor version of the Intel compiler library to the coprocessor:
      scp -r $INTEL_COMPILER_HOME/lib/mic/* mic0:/tmp/
       
    2. Similarly, upload the coprocessor versions of the MPI and MKL libraries:
      scp -r $INTEL_MPI_HOME/mic/lib/* mic0:/tmp/
      scp -r $INTEL_MPI_HOME/lib/mic/* mic0:/tmp/
       
    3. Upload the coprocessor version of PMEMD (pmemd.mic_native/pmemd.mic_native.MPI ) and your working directory (containing the input files for simulation)
      scp -r $AMBERHOME/bin/pmemd.mic_native mic0:/tmp/
      scp -r working_directory/* mic0:/tmp/
       
    4. Change the permissions of the libraries and binaries so that they are executable on the coprocessor:
      chmod 777 -R /tmp/*
    5. Finally run the simulation from the host:
      ssh mic0 "export LD_LIBRARY_PATH=/tmp/; cd /tmp; \
      ./pmemd.mic_native -O -i mdin -o mdout -p prmtop -c inpcrd"

    MIC Offload PMEMD Model

    Unlike MIC native mode, once PMEMD has been configured with the -mic_offload flag and compiled, no additional steps are required to run pmemd.mic_offload.MPI . Work is automatically offloaded to the Intel KNC MIC Architecture.
    Execute the following command on the host to run a simulation in MIC offload mode:

    mpirun -np 8 $AMBERHOME/bin/pmemd.mic_offload.MPI -O

    Note: Choose the number of MPI processes to suit the specifications of the host CPU, e.g. 8 MPI processes for an Intel Xeon E5-2680 8 core processor, in order to achieve optimum performance. In the initial support for MIC offload in PMEMD, the amount of offloaded work to the coprocessor increases and settles to a stable value after running multiple time steps governed by the Amber load balancer. Thus, it is recommended that the simulation should run for at least 200 time steps to benefit from the coprocessor.

    ^

  • Considerations for Maximizing Intel® KNC Xeon PhiTM Performance using Offload Mode (recommended for advanced users, only)

    The KNC MIC offload code uses OpenMP (OMP) threads to distribute the offloaded work across the MIC coprocessor cores. The default number of OMP threads per offloading MPI process is set to 30; however, this can be overridden by substituting the above execution command with the following advanced execution command in this example runscript:

    Run.mic_offload
    #!/bin/bash
    export MIC_ENV_PREFIX=PHI
    export OMP_NUM_THREADS=1
    mpirun -n 11 ./pmemd.mic_offload.MPI -O \
    : -n 1 -env PHI_KMP_PLACE_THREADS 30c,4t,0O \
    -env PHI_KMP_AFFINITY scatter \
    -env PHI_OMP_NUM_THREADS 30 \
    -env MIC_OMP_STACKSIZE 4M \
    ./pmemd.mic_offload.MPI -O \
    : -n 1 -env PHI_KMP_PLACE_THREADS 30c,4t,30O \
    -env PHI_KMP_AFFINITY scatter \
    -env PHI_OMP_NUM_THREADS 30 \
    -env MIC_OMP_STACKSIZE 4M \
    ./pmemd.mic_offload.MPI -O \
    : -n 11 ./pmemd.mic_offload.MPI -O

    "MIC_ENV_PREFIX=PHI" states that any environment variable prefixed with PHI will be applicable to the KNC MIC coprocessor environment only (not the host processor environment).

    "OMP_NUM_THREADS=1" states that the number of host OMP threads is 1.

    In the MIC offload version of PMEMD only the middle two MPI processes are responsible for offloading work to the MIC coprocessor, e.g. if 8 MPI processes are specified, threads 4 and 5 are responsible for offloading to the MIC coprocessor. These two MPI processes simultaneously spawn OMP threads on the MIC coprocessor to execute the offloaded chunks of work. By partitioning the execution command to reflect the decomposition strategy, the number of OMP threads can be manually set. Partitioning of an MPI execution command is done via the use of ":" which is demonstrated in the example runscript above.

    In this example runscript for a 24 core Intel CPU augmented with a 61 core MIC coprocessor, 24 MPI processes are requested on a single node. The first 11 MPI processes and the last 11 MPI processes execute on the host CPU cores. The middle two MPIs (12 and 13) offload to a single MIC coprocessor and each spawn 30 OMP threads. The cores of the MIC coprocessor are divided in two so that each MPI process is assigned half the cores of the MIC coprocessor. In the above example, MPI process 12 spawns OMP threads on cores 1-30 whereas MPI process 13 spawns OMP threads on cores 31-60.

    • "-env PHI_KMP_PLACE_THREADS 30c,4t,0O" states that the MPI process will use 30 cores of the MIC coprocessor (30C), may use as many as 4 threads per core (4T), and the first core that is being used has an offset of 0 (0O). Please consult Intel's thread affinity documentation for a more detailed explanation.

    • "-env PHI_KMP_AFFINITY scatter" states that the OpenMP threads will be mapped to the hardware threads in a scattered fashion. Please consult Intel's compiler documentation for other options.

    • "-env PHI_OMP_NUM_THREADS 30" states that the MPI process uses 30 OpenMP threads.

    It may be of benefit to performance to adjust the number of OMP threads to be spawned by each offloading MPI process depending on the system being simulated, for example:

    -env PHI_OMP_NUM_THREADS 60
    For larger simulations (>200,000 atoms) more OMP threads, such as 2 threads per core (a maximum of 4 threads per core is permitted on the MIC coprocessor), spawned on a MIC coprocessor provides better performance. But for more OMP threads we need a larger stack size per thread and a larger total stack size on a MIC coprocessor. The default stack size is 8 KB.
    • "-env MIC_OMP_STACKSIZE 4M" increases the OMP thread stack size to 4 MB.

    • For MPSS version 3.2.3 and later the total stack size of a MIC coprocessor is increased by the following steps:

      1. On the host, as root, create the directories etc/ and etc/security in /var/mpss/common:
        cd /vars/mpss/common
        su
        mkdir -p etc/security
      2. Next create the file limits.conf (in /var/mpss/common/etc/security) containing the following line of text (with tab separated values):
        "* soft stack unlimited"
      3. Create the file /var/mpss/common.filelist containing the following (with space separated values) lines:
        dir /etc/security 755 0 0
        file /etc/security/limits.conf etc/security/limits.conf 644 0 0
      4. Finally, cycle the MPSS daemon to reboot the cards using:
        service mpss restart

    ^

    Recommended Hardware

    In order to simplify the selection of hardware for AMBER simulations on Xeon, KNC Xeon Phi and KNL Xeon Phi hardware we have teamed up with Exxact Corporation to offer preinstalled AMBER Certified computing solutions this includes a Xeon Phi Life Sciences Certified Solutions Program developed jointly between Exxact, Intel and the lab or Prof. Ross Walker. These systems can be ordered with AMBER 16 preinstalled (AMBER license required) and come with a full 3 year warranty. For details and to customize systems please contact either Ross Walker (ross_at_rosswalker.co.uk) or Mike Chen (mike@exxactcorp.com).

    AMBER Certified
    Entry-Level Workstation
    AMBER Certified
    Mid-Level Workstation
    AMBER Certified
    High-End Workstation

    Ideal for Graduate Students

    Ideal for Researchers

    Maximum Performance

    1x Intel Core i7-4930K CPU
    32 GB system memory
    AMBER16 preinstalled and tested
    CentOS 6 or 7
    3 year warranty

    2x Intel Xeon E5-2620 v4 CPUs
    1 Intel Xeon Phi 7120
    64 GB system memory
    AMBER16 preinstalled and tested
    CentOS 6 or 7
    3 year warranty

    2x Intel Xeon E5-2697 v4 CPUs
    1 Intel Xeon Phi 7120
    64 GB system memory
    AMBER16 preinstalled and tested
    CentOS 6 or 7
    3 year warranty

    ~$2500 ~$4799 ~$9999

    Custom turnkey cluster solutions are also available. Please email Ross Walker (ross at rosswalker.co.uk) or Mike Chen (mike@exxactcorp.com) for details.

    ^

    Additional Resources

    The following provides some additional resources that you may find useful.

    ^