Date: Sun, 4 Aug 2002 11:34:20 -0700
From: David Case
Subject: problems with parallel sander on linux clusters?
Some people on the list have noted that amber6 sander seems more stable in
parallel runs on Linux clusters than amber7 sander, and/or that
communications would crash at odd times.
We have not been able to track this down, or to reproduce the problem
reliably. I asked Carlos Simmerling if they had seen problems with sander7;
his response is below.
So: two suggestions:
1. follow as close as you can the setup described on the Simmerling group
web page (see the link on "running amber on linux clusters" on the Amber
web page.
2. if that still fails, consider trying LAM instead of MPICH -- see the
posting recently with results that seemed to say that was more reliable.
3. Obviously, if anyone figures out what is causing this, we would love to
know. The MPI calls in sander all look kosher to me, but there could
certainly be a buffer overrun or wrong parameter somewhere that only
triggers a problem when something else happens as well....
4. Run relatively short jobs (50 hours?) and restart. Use ntwr to save
restart files every 10000 steps or so; that should enable you to limp along.
...regards...dac
----- Forwarded message from Carlos Simmerling -----
From: "Carlos Simmerling"
Subject: Fw: problems with parallel amber on Linux clusters
it looks like nobody here has run into any problem with sander7.
we have run as long as 30ns on 8 cpus, about 300 hours
wallclock without a crash. I'm not sure what could be wrong
with the other clusters...
Carlos
----- End forwarded message -----