From: "Robert Duke"
Subject: Re: AMBER: pmemd speedup and interactions
Date: Wed, 31 Mar 2004 15:03:39 -0500

Lubos -

BEGIN JOKE

Well, you are the first guy to complain to me about only getting 6.9
nsec/day for their problem with pmemd on 1 processor, or 14 nsec/day on 4
processors.... ;-)

END JOKE

Now, all joking aside, there are three serious issues here; none of them
have anything to do with problems with pmemd. EVERYBODY that runs MD
software on linux clusters, PLEASE listen up!!!

The issues are:

1) You must always choose a REASONABLE benchmark if you want reasonable
results.
2) The interconnect is very important the minute you go past having two
processors on cheap hardware.
3) Hyperthreading is an unsupported and untested feature for this software.
I do not have access to HT processors on any linux systems at the moment,
but I would not expect big gains, and would expect possible losses, from
using such features. I need to learn more about this stuff, but I will
share some opinions about potential problems below.

SO lets first ignore the hyperthreading results, and just look at your other
results first:

The benchmark:

First of all, the benchmark chosen has some major problems, and certainly is
not the sort of benchmark one would want to use to demonstrate good scaling.
You do not say how many atoms are in the system, but it must be small,
because you are running 1000000 steps in 210 minutes on 1 processor. That
works out to a 0.001 psec step every 12.6 msec, I believe. Now pmemd was
designed to efficiently run moderate to big simulations, say 20,000 to
500,000 atoms. If I run my classic factor ix solvated protein problem with
90906 atoms on a 2.4 GHz xeon, it takes 2381 msec per step. So you are
running about 200x faster. Why is this important? Well, for multiple
processors, 12.6 msec / <num of processors> is roughly how frequently you
would ideally be communicating data in each step. Now because your system
is smaller there will be less data to communicate, but still, you will be
thrashing badly on communications. Why are you running so fast to begin
with? First of all, it must be a small system, and secondly, you have
chosen a small cutoff of 6 angstroms, which reduces the direct force
calculations required. Most folks do not choose cutoffs below 8 angstroms,
the default. So you run 2-2.5 times faster, but the accuracy of your
simulation suffers. Also, in your simulation setup, you make some other bad
choices: 1) nscm = 1 - This causes removal of rotational/translational
velocity at EVERY step of the simulation, with a significant computational
overhead, and an added communications overhead in parallel (an all reduce at
a minimum, there may be additional stuff in 3.03/3.1, but I would have to
look back and see). Now, nscm is available to prevent the solute from
pulling the "flying cube of ice" stunt in pme simulations; it is useful if
you have a solute, and should be invoked about once every 1000 steps. No
need to use it at all for a cube of water. 2) skinnb = 0.4 Skinnb is
purely a performance option, affecting the "skin thickness" that is added
when constructing the direct force pairlists. Movement past the skin is
detected and stimulates a pairlist rebuild, so it should have no effect
whatsoever on results. The default for pmemd is 1.0 angstrom. By making it
shorter, you make it necessary to rebuild the pairlist much more frequently,
and there is LOTS of computational AND communications overhead associated
with that. SO skinnb should be left alone unless you know what you are
doing. Anyway, your settings for nscm and skinnb both will contribute to a
degradation in scaling. So you should take a reasonable, say 20K-100K atom
problem and use it for benchmarking; I would suggest using benchmarks/dhfr,
jac, or rt_polymerase from amber 7 (and dhfr is available in the test
directory for amber6). In amber 8, my favorite factor ix benchmark is
available. You should use settings near those in the test/benchmark code
for mdin. You should benchmark for 100-5000 steps, not 1000000. With any
real system, it will take you a while to do a nsec on 1-4 processors. One
other point about problem size. PMEMD is faster than other software for two
main reasons: 1) caching optimizations, and 2) communications optimizations.
By caching optimizations, I mean that the code has been structured to
increase the likelihood that data/code needed will be found in the high
speed memory cache (the cache "hit ratio" is higher). This has a big impact
on the time required for computation. In the communications optimization
area, a number of things are actually done which I won't go into. The
bottom line on all this, however, is that all this stuff does not really
help much on small problems that don't use much memory and don't have much
data to communicate. PMEMD was written to make the big stuff run faster.

The interconnect:

This is always the killer when you go past 2 processors on linux clusters.
For 2 processors, the shared memory interconnect is typically reasonably
fast. On real supercomputers, the interconnect between groups of processors
with shared memory is really fast, and pmemd will scale nicely out to
somewhere between 64-128 processors. PMEMD was specifically designed to do
this, with the early work done mainly on an IBM SP3, a real supercomputer.
I have always been a little embarrassed about the pmemd performance at 2
processors as compared to 1 processor, but this is "by design". Parallel
pmemd does all kinds of things, and sets up all kinds of data structures,
with the goal of scaling at better than 50% over the range of 16-128
processors. All this introduces overhead that hurts on only two processors
(ie., on 2 processors, simpler code would be faster). However, I did not
really want to support a 1 processor implementation, a 2-4 processor
implementation, and an 8+ processor implementation. So in my hands for
factor ix on the xeon, pmemd scales with only about 80% efficiency going
from 1 to 2 processors, but then scales with 87% efficiency going from 2 to
4 processors, and compared to other software just does not lose much as you
add processors.

This nice scaling, however, presupposes a good interconnect. What is a good
interconnect? Well, it costs money. Basically, gigabit ethernet at it's
best is not very good. However, if you are going to use gigabit ethernet,
and you expect decent scaling, you MUST:
1) buy decent network interface cards (nic's). Typically, you want to buy
server as opposed to client nic's, and they will cost you something like 3-4
times as much. Intel builds some good ones, but you have to buy the
expensive ones, not the workstation ones.
2) buy a decent gigabit ethernet hub or switch. Carlos Simmerling's group
recommends some Hewlett-Packard gear.
3) have good cables - Category 5e or 6, I would expect.
4) have good motherboards on you systems, with a fast pci architecture.
5) if there is a difference in pci speeds on different slots, your nic must
be in the right slot.
6) the gigabit ethernet interconnect should be dedicated; ie., it should not
be what you also use to talk to the internet or other machines outside your
cluster.
7) ideally, the interconnect for gigabit ethernet should be dedicated to one
problem at a time; ie., if you do a run, it is not a run where somebody else
has a subset of the processors and is hammering the interconnect too. Two
clusters of 8 nodes with two separate interconnects will perform better
running 2 problems than one cluster of 16 nodes with one interconnect
running 2 problems. If somebody is running Gaussian, or sander, or even
pmemd on the same cluster, you will notice. I typically post benchmarks for
small systems without additional load on them. For real supercomputers, the
norm is to benchmark with other things going on; they can withstand this a
lot better, but are also impacted.
8) mpich probably needs some tweaking for optimal performance, and how you
build it may matter.
9) heaven only knows what the issues are with the linux networking, os, and
driver software.

Sound scary? Good. It is not as simple as falling off a log, and the folks
that get good scaling from these systems either are very very lucky or have
someone who knows what they are doing who is dedicated to the task.

How bad can it be if you don't do this stuff right? Really really bad. I
run two micron pc's for my development/test setup - both nice little
athlon's, but cheap, with cheap 100 mbit/sec nics, a cheap hub, cable all
over the place. When I run two processor tests on these systems, they are
really really slow, and in fact it is generally faster to run a problem on
one processor than two. Am I surprised? Not at all. Am I dismayed? Well,
not really, because I am just testing for software correctness when I run
these systems; I did not pay for performance, and I sure as heck did not get
it.

The guys that build and run supercomputers have a reason to exist. There
are in this world problems that don't decompose particularly well (like MD),
and the only way you can get these problems to run fast is to have really
really fast communications between processors. Folks that maintain that
things like grid computing and piles of cheap pc's can do everything that
supercomputers can do are peddling snake oil to the unwary. They can just
do SOME of the things that supercomputers can do. I have rewritten code,
making it 100x faster on cheap pc's. The problems were decomposable (ie.,
doing step n+1 did not rely on knowing the results of step n).

Hyperthreading:

The cynical part of me would write this "HYPErthreading". I am sure there
are some advantages, but the advantages are likely small in our problem
space, and the potential downside is big. I don't have access to
hyperthreading systems yet, so I don't know all the nuances. However, from
a general perspective, as Ross said, running on more processors costs more
in communications overhead, and that is just the beginning of the potential
problems. Generally, something like hyperthreading should work well when
machines are doing multiple tasks, blocking relatively randomly on i/o to
disk or what have you. We instead have multiple machines all doing
basically the same thing and blocking in synchrony. We CAN get computation
/ i/o overlap by being clever, but hyperthreading probably does not help
much (thus in asynch i/o, we post i/o and do other things and then wait for
the i/o to complete when we can't do anything else - this can be done on a
single processor just fine through operating system thread scheduling, and
may be a little cheaper with hyperthreading). However, any attempt at
increasing the virtual processor count tends to just increase overhead
without much benefit. Thus, say you were to do an mpirun -np 8 on 4 real
processors. You end up with less cache, memory, and i/o bandwidth available
per processor to handle an increased i/o load. You win to the extent that
the asynchronly helps you, and I would expect that it does not help that
much. You lose to the extent that sharing cache, memory, and i/o bandwidth
hurts you. As I say, I don't know all the details, but I don't support or
recommend doing this just yet. If someone gets consistent good results for
reasonable problems, please let us all know what the tricks are; I would
expect at most a small gain from doing a mpirun -np 4 on 4 real processors
with hyperthreading, associated with more efficient thread scheduling; I
have heard horror stories about vastly inferior performance due to thread
prioritization issues on the virtual processors.

Well, sorry for going on and on about this stuff, but these are all hot
button issues for me. Linux clusters may be cheap, but to at least some
extent you get what you paid for.

Regards - Bob

----- Original Message -----
From: "Lubos Vrbka"
To: <amber@scripps.edu>
Sent: Wednesday, March 31, 2004 4:55 AM
Subject: AMBER: pmemd speedup and interactions


> hi,
>
> i just forgot to add my input file, in case it would be important...
>
> # 1 ns const. press. md; shake
> &cntrl
> amber7_compat = 0,
> imin = 0,
> nscm = 1,
>
> ntx = 5, irest = 1,
>
> ntpr = 500, ntwr = 500, ntwx = 1000,
> iwrap = 0,
>
> ntf = 2, ntb = 2,
> cut = 6,
>
> nstlim = 1000000, dt = 0.001,
>
> ntr = 0, ibelly = 0,
>
> ntt = 1, temp0 = 200,
> ntp = 1, pres0 = 1, taup = 1,
> ntc = 2, tol = 0.00001,
>
> jfastw = 1, watnam = 'SPC',
> owtnm = 'O', hwtnm1 = 'H1', hwtnm2 = 'H2',
>
> nmropt = 0
> &end
> &ewald
> skinnb = 0.4
> &end
>
> btw, other things i can see with the binaries on some pentium3 or athlon
> clusters. socket (p4) binaries do not have 100% load on processors, but
> less...
> this could be due to slow network connecting the nodes (but sometimes i
> cannot see this behaviour for 3.0.1 binary but it appears for 3.1 for
> the same cluster...
> or is something from my input forcing the processes to communicate more
> than they should and thus decreasing the load on processors?
>
> thanks,
>
> --
> Lubos