University of Utah CHPC opteron cluster benchmarks

UNIVERSITY OF UTAH CHPC OPTERON CLUSTER BENCHMARKS, 3/01/05

Following are some explicit solvent pme md simulation benchmarks for a
large Opteron cluster at the U of Utah.  This cluster (delicatearches) has 256
relatively slow dual 1.4 GHz Opterons interconnected with Myrinet mpich_gm and 
gigabit ethernet mpich 1.2.6 or mpich2. The operating systems are 64 bit 
GNU/Linux.  Thanks to Tom Cheatham and the folks at U of Utah for access to
this nice resource.  As a preamble to the benchmarks, let me say that this
stuff is incredibly fast for a linux cluster despite the slower processor
speeds; I got 1.99 nsec/day of factor ix (~91k atoms) and 5.4 nsec/day on
the jac benchmark (~23.5k atoms) for pmemd 8. All benchmarking was done with
the system under full load.

The Pathscale pathf90 compiler, version 2 was used to compile PMEMD and SANDER
for all these benchmarks.  At the time of doing these benchmarks, the
Portland Group pgf90 compiler, version 5.2-4 was available, but there were
issues with compiler bugs and some other problem with the mpich-gm libraries.
Since then, I have had access to the lastest pgf90 release (6.0) and it seems
that critical bugfixes have occurred and performance is roughly comparable to
that of the Pathscale compiler (based on spotcheck benchmarks only).
Configuration files have been released for PMEMD as part of the new_configure
tarball on the ambermd.org website.

Special Notations:

NA - Not applicable.

ND - Not done.  In the case of SANDER, certain benchmarks were not done either
     because scaling would be less than 50% or because SANDER can only do
     parallel runs using a processor count that is a power of 2.

BENCHMARKS USING THE MYRINET MPICH_GM INTERCONNECT:

90906 Atoms, Constant Pressure Molecular Dynamics (Factor IX)
             cutoff = 8.0 angstrom, timestep = 0.0015 psec,
             orthogonal unit cell

#procs | PMEMD 8
       | psec/day scaling(%)
       |
 1     |   60        NA
 2     |  111       100
 4     |  215        96
 8     |  398        89
16     |  749        84
24     |  974        73
32     | 1336        75
40     | 1542        69
48     | 1662        62
56     | 1851        59
64     | 1994        54


23558 Atoms, Constant Volume Molecular Dynamics (JAC Benchmark)
             cutoff = 9.0 angstrom, timestep = 0.001 psec,
             orthogonal unit cell

JAC (joint amber charm) benchmark (constant volume), 23558 atoms pme, 
explicit solvent simulation, mpich_gm interconnect; here I use default 
skinnb values, which is a fair way to run this test (it has no effect on 
output, it is a performance optimization, and specifying the wrong value can 
de-optimize the code):

#procs | PMEMD 8             | SANDER 8
       | psec/day scaling(%) | psec/day scaling(%)
       |                     |
  1    |  124        NA      |  115        NA
  2    |  235       100      |  231       100
  4    |  461        98      |  424        92
  8    |  873        93      |  752        81
 16    | 1630        87      | 1170        63
 32    | 2979        79      | 1464        40
 48    | 4019        71      |   ND        ND
 56    | 4547        69      |   ND        ND
 64    | 4800        64      |   ND        ND
 72    | 5400        64      |   ND        ND


BENCHMARKS USING GIGABIT ETHERNET MPICH2 (ver 1) INTERCONNECT:
(DON'T USE -DSLOW_NONBLOCKING_MPI FOR PMEMD)

90906 Atoms, Constant Pressure Molecular Dynamics (Factor IX)
             cutoff = 8.0 angstrom, timestep = 0.0015 psec,
             orthogonal unit cell

#procs | PMEMD 8
       | psec/day scaling(%)
       |
  1    |   60        NA
  2    |  109       100
  4    |  193        89
  6    |  273        84
  8    |  350        81
 12    |  499        76
 16    |  642        74
 24    |  864        66
 32    |  997        57

NOT using -DSLOW_NONBLOCKING_MPI is optimal over the range of 4-32 procs 
with mpich2


BENCHMARKS USING GIGABIT ETHERNET MPICH (ver 1.6.3) INTERCONNECT:
(DO USE -DSLOW_NONBLOCKING_MPI FOR PMEMD)

90906 Atoms, Constant Pressure Molecular Dynamics (Factor IX)
             cutoff = 8.0 angstrom, timestep = 0.0015 psec,
             orthogonal unit cell

#procs | PMEMD 8
       | psec/day scaling(%)
       |
  1    |   60        NA
  2    |  109       100
  4    |  184        85
  6    |  252        77
  8    |  322        74
 12    |  450        69
 16    |  554        64
 24    |  720        55
 32    |  864        50

USING -DSLOW_NONBLOCKING_MPI is optimal over the range of 12-32 processors. 
For fewer processors, you will do about as well or slightly better without 
using -DSLOW_NONBLOCKING_MPI.

Bob Duke
NIEHS and 
UNC-Chapel Hill Chemistry Dept.