amber-developers: Pmemd vs sander

From: "Robert Duke"
Subject: Re: amber-developers: Pmemd vs sander
Date: Sun, 16 Nov 2003 11:56:57 -0500

Yong -
Well, we all have a heck of a time distinguishing between roundoff error and
something real.

In sander 7 mode, there are some real differences between pmemd and sander 7
in the calculation of virials, but I don't believe there are other
significant differences. In amber7 mode, pmemd retains the amber 6
molecular virials calculation algorithm, which treats crosslinked molecules
differently (neither approach being necessarily more correct - it is a
result of changes in 7 associated with Tom D.'s extra points code). So are
you comparing sander 7 to pmemd/amber7 mode? If so, there will definitely
be differences, rather quickly. Look at the differences between sander 6
and sander 7, by the way.

If not, then lets talk about levels of difference. I typically run
somewhere around 20 different tests (my test machine is not up at the
momemt), and typically expect identical results, within reason, for pmemd
vs. the appropriate sander for somewhere between 30 and 100 steps (a lot
less than 10 ps, or 10000 steps). I say within reason because for some
models you may have an energy value that is calculating out internally very
near a point where a visible digit changes, so you get a 121.4561499999999
vs. a 121.456150000000 for example if you look at the number in more
precision. What happens in such a test is that given value will be
different, perhaps in step 1, and then match in step 10. I chose these
bounds (30-100 steps) from observations of what happens with sander running
on different architectures and with different numbers of processors (ie.,
you start seeing differences after that even with the same piece of code).

In regard to mpi, what I observe for pmemd may have a significantly
observable, but not significant, difference from sander 6/7 in runs past
around 300 steps (still a heck of a lot less than 10000 steps). This is due
to dynamic loadbalancing in pmemd. Under mpi, on systems with any other
load whatsoever (heck, just OS scheduling), you are virtually guaranteed
that two identically set up runs will not produce identical results for more
than about 300 steps in pmemd. This is purely due to the impact of rounding
errors when the workload is split between processors differently, and pmemd
will for all intents and purposes never split the workload the same way past
the first list build, because the workload split is dependent on relative
timings, and even 2 otherwise idle machines will have ever so slightly
different throughputs from one second to the next.

Also, in pmemd, depending on the version you are using, the axis
reorientation algorithm can produce different results really quickly in
unstable systems. In pmemd 3.00-3.03, axis reorientation was done under mpi
but not in uniprocessor code. In 3.13, I changed the defaults to do axis
reorientation only under mpi if the box had an aspect ratio of 1.5 or
higher. Axis reorientation has a subtle effect on gridding and/or fft calcs
in pme, and in fact could even affect the nfft values that are auto-computed
(due to the eveness requirement in the first dimension of rcfft's). I think
there are even more significant impacts in shake. I am a little unsure
about the shake code in general, after observing how different the results
can be if one simply reorients the box. I only see this in systems very far
from equilibrium, though (vacuum bubbles, etc.).

SO, could there be bugs in pmemd that affect energy output past 300 steps?
You bet. Could there be bugs in sander 6/7 that affect energy output past
300 steps? You bet. I worry about all this stuff a lot with mpi, where
complete coverage of the atoms (ie., being sure that all the calcs for each
atom occur once and only once) is an issue. I worry about listbuilds
picking up all atoms. I don't completely understand some of the sander 6
listbuilding code, but I seem to recollect that sander 6 may actually miss
putting some atoms at the edge of the skin on the list that I get (this may
only be for nonorthogonal code). I have more faith in my code here than the
sander 6 code. There could be memory overruns with subtle affects. When I
first decided to start hacking on sander 6, I did it primarily due to a bug
in sander 6 that I never did figure out. I was doing long runs on an sgi
dual processor workstation, and I would go for 250-500 psec and crash with
an obscure error in the pme fft code. Not reproducible at all. Some sort
of memory stomp, but given the "lets hack a huge piece of memory into lots
of little pieces" memory allocation scheme, I never was successful in
finding it. So I just started rewriting the code and the problem went away.
Here I definitely have much more faith in my code, because using dynamic
memory allocation in f90 is more reliable, and can be debugged more
effectively. My point on both of these things is that I don't trust sander
6 or 7, and with good reasons. The way we do software engineering, we could
have subtle bugs, and we would never know it. I am including myself in
this. I think I have tried to develop pmemd to a higher level of quality,
but I am still very dissatisfied with it, because I have been in a rush, it
derives from poorly documented and structured code, and the rewards
associated with doing a really excellent job of the engineering are not
there.

I guess the bottom line on all this is, if you see a difference, is it a
statistically significant difference in a large sample of runs, and is it
"large enough" to indicate a possible bug? So I think you have to do dozens
of runs for each system and then do statistics. I suppose that on a
uniprocessor, this should not be the case; the math should be reproducible.
Still, how do you know that the differences are significant, and not just
attributable to rounding errors, different implementations of sqrt(), sin(),
exp() or what have you? Sander 6 uses 4 byte constants in its wrapping
algorithms; sander 7 uses 8 byte constants. Pmemd is sander 6 in this
regard; you see differences real quick if wrapping. Is it significant? IF
there is a bug that produces a difference, why do you think it is in pmemd?
(;-)). I can give you two answers on this one myself, in two different
directions. 1) The bugs are more apt to be in pmemd, because it does more
complicated things to get better scaling. The code complexity associated
with all this workload redistribution and spatial optimization is
significant. 2) The bugs are less apt to be in pmemd, because it is written
in f90 in modules, with all the engineering reliability benefits that
pertain - like interface checking and dynamic memory allocation.

I think several doctoral degrees could be done on all this stuff, and the
field would benefit. I don't think any of us really know what we are doing
in this area, and that most folks have way too much confidence in their own
code, as well as the code of others. I know that is pessimistic. I have a
lot of interest in the whole area of error analysis, starting with the force
fields themselves down through the algorithms that generate the energetics
from the forcefields. I am also very interested in driving the code base
toward faster, more reliable algorithms that we can have more confidence in.
In the meantime, in looking at the differences between pmemd and sander 6/7,
ask yourself "is it significant?". And in doing this, look at the
differences between sander 6, 7, pmemd, gromacs, namd, commercial product x,
... If there are problems, I do want to know about them and fix them.

Regards - Bob

----- Original Message -----
From: "Yong Duan"
To: <amber-developers@scripps.edu>
Sent: Sunday, November 16, 2003 12:39 AM
Subject: amber-developers: Pmemd vs sander

>
> This is probably better addressed to bob, but I think may have general
> insterets.
>
> Although the first step energies of sander and pmemd agree to the last
> digit, after a while (within 10ps) the energies of the two may start to
> show difference withi a notable change in the energies by pmemd. The
> pmemd energies stabilize at a new level which is not too different from
> the one by sander, but nevertheless noticeable (about the level of
> fluctuation). Has anybody else noticed it?
>
> Secondly, for some reason, I keep getting "FLUSH--FAILED:: Resource
> temporarily unavailable" from sander on Xeon/mpich/ifc/static
> combination.
>
> yong
>