Date: Tue, 12 Aug 2003 07:24:58 -0700 (PDT)
From: Michael Crowley
Subject: Re: AMBER: machine dependent?

Dear Shuang,
Although you may be finding a hidden problem in Amber 7, I agree
with Dr. Case that one should not see SHAKE failures in a
production run, that is, a run that is meant to be a valid
simulation and not an equilibration. As soon as you see
a SHAKE failure, all confidence in the validity of the
simulation is probably gone.

I would like to explain a SHAKE failure and ask for patience
if it is obvious, in the hope that it will help one understand the
subsequent explanation of why different machines would
give different answers.
SHAKE failures a mystery to a large number of researchers and
I hope the following will help with understanding what is
going on how one should respond to the event.

======WARNING====== Very long verbose message ......

SHAKE is a method to solve a set of nonlinear simultaneous
equations that describe the desired bond lengths between a group of
atoms, for instance the three bonds in TIP3 water (OH,OH,HH).
It is a little more complicated than that but not much.
When motion is propagated in the integration step of the dynamics,
the atoms move away from their desired bond distances. The new
coordinates are plugged into the SHAKE equations, and these
equations are solved for a new set of coordinates that do
have the desired bond lengths (and also follow some other constraints).
The solution is found using the assumption that the system has not
strayed too far from the desired distances in one step, and
the equations can be solved in the linear region of the equations.
The general method involves many atoms and is solved by an
iterative method which only works within the valid linear region.
Outside that region, there is either no solution of multiple solutions.
Thus if your atoms have moved so far that they are outside
the region where the SHAKE algorithm is valid, then SHAKE will
"fail" when it does not converge to a solution.

SHAKE failures usually occur when one starts a simulation with
a very strained structure, one that has atoms too close or bonds
of angles stretched or compressed in a way that produces very large
(abnormally large) forces. The propagation of the classical motion will
send atoms out of the region where SHAKE is valid, and one will
see SHAKE failure. Sometimes the simulation can recover if the
resultant kinetic energy is removed (vlimit) and the bond
forces pull the system back to the region where SHAKE can find
a solution. Most of the time, this condition is avoided by
minimizing the system before starting any dynamics.

Most people see SHAKE failure as a BAD thing, and use it as a signal
that there is something wrong with their structure or system.
Certainly, if a failure has occurred in the middle of a long
simulation, the system will definately have taken a step
away from the trajectory that would have been followed if
the constraints were enforced. One probably cannot trust the
data after that event. In other words, one is not really
simulating the system that one intends (at least one of the
rigid bonds is not rigid for that step).

The messaage from me about SHAKE failure is to concentrate on your
system to see what is causing the SHAKE failures since your
trajectory is probably meaningless with them.

If you are interested in a system with behavior that exceeds the
SHAKE limitations, then you must turn SHAKE off and use a much smaller
time step.

Finally, I suggest that the reason that you might see different
behavior on different machines is that the propagation of numerical
error in an interative process will differ depending on how
the instructions are executed on different machines. With optimized
code, the CPUs on different machines are doing things in
different orders, combining operations in different ways that
are algebreicaly correct, but when done many times over will
accumulate roundoff error in different ways. There are ways to
track down what may be happening differently on different machines,
like turning off optimization to see if that is the source of
the divergence, but I think that your time is better spent
finding the source of you SHAKE failures.

Apologies for the long message, I hope it helps, and that it
is not misleading. Comments +- criticisms welcome.
Sincerely,
Mike