Stage 2: Fitting Parameters
Stage 1 was kind of a lengthy one, especially compared to this. With the
results from Stage 1 unpacked
(or equivalent results of your own ready), plus the original topologies for
Glycerol and MPD, the
following input will accomplish the parameter fitting for three dihedrals and
bond angles, with optimization of the angle equilibrium values:
&files
-parm <your Amber home
directory>/dat/leap/parm/gaff.dat
-d carbofluid.dat
-o fit.out
&end
¶m
System Glycerol_vac.top Glycerol/coords.cdf
Glycerol/energies.dat
System MPD_vac.top
MPD/coords.cdf MPD/energies.dat
ParmOutput frcmod
eunits hartree,
accrep report.m
verbose 1,
% Angle fitting input
fita c3 oh ho
fita c3 c3 oh
fita c3 c3 c3
FitAnglEq 1,
arst 0.0002,
arstcpl 1.0,
% Torsion fitting input
fith c3 c3 c3 c3
fith hc c3 c3 oh
fith hc c3 c3 c3
fith oh c3 c3 oh
fith h1 c3 c3 oh
fith X c3 oh X
fith c3 c3 oh ho
hrst 0.0002,
&end
The data is presented in the System entries in our input and
the contents are obvious. The output format is specified as
"frcmod" meaning that a frcmod file will be written to record the
optimized parameters. The name of this frcmod file is given as
carbofluid.dat up in the &files namelist. It's
important to point out that mdgx needs to know not just topologies
for every system in the fitting set but also where all of the parameters
came from in order to optimize anything. This is to make sure that it
knows unambiguously which parameters are unique. If the initial guess force
field involved a frcmod file of its own, this can be entered in
the &files namelist with the -fmod keyword, any
number of times.
That will run in a few seconds and use a
linear least-squares
fit to produce the best parameters for the figure of merit: match the
relative quantum energies of each conformation subject to penalties on how big
the dihedral amplitudes are allowed to get or how far from their original
values in the guess (GAFF) force field the bond angles are allowed to go. What
I just said is a fairly loaded statement: I'll explain the input and restraints
in the next paragraph. But, the part about these being the best possible
parameters with respect to the figure of merit holds: as long as you can
state what you want in a way that fits in a set of linear equations, which we
have, linear least squares is the unbeatable method of choice. There are well
known problems with least squares fits, but in my years doing this I've learned
enough of how these problems manifest themselves and I've tried to provide
enough tools in mdgx to keep them in check.
The most important controls are contained in the arst and
hrst inputs: these provide the stiffness of weights on all bond
angle and torsion terms, holding the fitted values of angles to the GAFF
presets and torsions to zero, respectively. These are pretty good values to
use, too: it's about what I roll with. As was the case in charge fitting, the
size of the data set is irrelevant: these restraints will scale with not
just the data set size but the number of instances of each parameter in the
problem so that users can maintain a consistent restraints protocol even as
their data set grows. For bond angles specifically, there is this
"arstcpl" keyword: it sets the "exchange rate" between shifting
the angle equilibrium and shifting the harmonic angle stiffness as it solves
the optimization problem. I do recommend fitting angle equilibria (activated
with FitAnglEq = 1 ), and just roll with
arstcpl = 1.0 .
The first thing to look at is the fita and fith
inputs, which of course specify the angle and torsion parameters to fit. These
are given by simply listing the atom types: in the case of torsions, there may
be more than one term in the cosine series, and thus multiple optimizable
parameters. All such parameters will be activated by naming the four atom
types, and one cannot select optimization in particular terms of a series.
Even with the ability to optimize particular torsions at will, there's still a
question of which torsions, because it's not always obvious what the
model contains. One trick for dealing with this is to run the fitting once
with torsions 1 in the input--that will activate all torsions for
optimization, and prompt analysis and outputs that can be used to track down
the parameters controlling the motions that we sampled in Stage 1. The
following paragraph will explain the nature of this information in section 7
of the output. (In fact, I listed all torsions in these systems with the
fith commands, and as we'll see it was a safe thing to do, but
it's good to keep a leash on the thing.)
I would recommend you not just take the
resulting frcmod and run with it: spend a
moment to look over the output, and to plot the
fitting results in
Matlab or
Octave. The
mdgx program offers a lot of information on how the fitting went
and what each parameter is doing. The simplest metrics are given in section 1:
the fits for glycerol and MPD improved from 2.37 and 2.70 kcal/mol RMS error to
1.86 and 2.18, respectively. This is not that big an improvement--the GAFF
parameters are already doing a pretty good job. It's a bit surprising that the
numbers are not smaller for molecules of this size, but we're dealing with
vicinal and (2,4) hydroxyls: the errors are probably dominated by the
electrostatics and the crudeness of point partial charges. If we look a little
further, section 2 gives us a look at the actual fitted parameter values but
more importantly the sampling along each relevant dimension. The output
shows that all of these fitted parameters are well sampled throughout the
ranges they might occupy in a simulation: this is a good indicator that there
is at least no catastrophic over-fitting in the result. The minor warning in
section 3 is not a big deal‐at least there is data on both sides of the
fitted equilibrium angle value and the program can see the energy getting
better, bottoming out, and then getting worse as the angle stretches. Section
5 indicates that the fitting problem is reasonably well behaved in this case.
Section 6 is not for beginners to fret over--I don't even put a lot of time
into that one but I track a lot of things in case they might show me something
useful. Section 7 is a place to spend more time: here you can see the places
that each parameter occurs, including all instances in every system, and the
sampling in each case. This is where you'll discover things you might not have
known, like "Wow, the torsion spanning types A1 B2 C3 D4 is in
this sugar molecule and also this napthalene derivative I'm fitting? Do I
really want that?" It will also alert you if the parameter samples one subspace
of its overall domain in one system and samples the rest in another, a
situation which could lead to over-fitting even though things looked good up in
section 2. In this case, all parameters are well sampled in all systems
with the exception of one: the X-c3-oh-X torsion, which is
thoroughly sampled in Glycerol but not so well in the one place it appears in
MPD. We can now apply our better judgment: because glycerol has more examples
of this torsion and the chemical context in glycerol is not much different from
MPD, there is not much to worry about in this case. Just to draw your
attention to it, here it is in lines 318-324 of the
fitting output.
[ Parameter ] [ Atom types ]
Sampling Histogram
Residue (Atom Names)
Count MIN +0 MAX
Warnings
--------------------------------------- ----- ------------------
--------
...
DIHE X c3 oh X
GOL (HO1 O1 C1 H11 )
324 XXXXXXXXXXXXXXXXXX
GOL (HO1 O1 C1 H12 )
324 XXXXXXXXXXXXXXXXXX
GOL (HO2 O2 C2 H2 )
324 XXXXXXXXXXXXXXXXXX
GOL (H32 C3 O3 HO3 )
324 XXXXXXXXXXXXXXXXXX
GOL (H31 C3 O3 HO3 )
324 XXXXXXXXXXXXXXXXXX
MPD (H4 C4 O4 HO4 )
324
XXXUoe:XO ::
Summarizing Stage 2: This is pretty simple. Get fitting data, run
fitting program. There's still a bit to go, but it's just to repeat what we've
done with a more educated guess as to the best parameter set. Users familiar
with the ParamFit program may recall an important constant "K"
that needed to be derived for each system. This is all handled internally by
mdgx (see the output section 4), but the principle is similar (the
way mdgx actually handles it is a product of several vigorous
discussions, but this is one place where the "finesse" comes in). I hope the
output is useful to you in your own parameter development--it was a major
source of ideas when we did our IPolQ protein force fields.
Go on to Part 3
|