Amber masthead
Filler image AmberTools21 Amber20 Manuals Tutorials Force Fields Contacts History
Filler image

Useful links:

Amber Home
Download Amber
Installation
Amber Citations
GPU Support
Updates
Mailing Lists
For Educators
File Formats
Extending the Amber Force Field → Parameter Fitting with mdgx

Stage 2: Fitting Parameters

Stage 1 was kind of a lengthy one, especially compared to this. With the results from Stage 1 unpacked (or equivalent results of your own ready), plus the original topologies for Glycerol and MPD, the following input will accomplish the parameter fitting for three dihedrals and bond angles, with optimization of the angle equilibrium values:

&files
  -parm    <your Amber home directory>/dat/leap/parm/gaff.dat
  -d      carbofluid.dat
  -o      fit.out
&end

&param
  System  Glycerol_vac.top  Glycerol/coords.cdf  Glycerol/energies.dat
  System  MPD_vac.top            MPD/coords.cdf      MPD/energies.dat
  ParmOutput frcmod
  eunits    hartree,
  accrep    report.m
  verbose    1,

  % Angle fitting input
  fita       c3 oh ho
  fita       c3 c3 oh
  fita       c3 c3 c3
  FitAnglEq  1,
  arst       0.0002,
  arstcpl    1.0,

  % Torsion fitting input
  fith       c3 c3 c3 c3
  fith       hc c3 c3 oh
  fith       hc c3 c3 c3
  fith       oh c3 c3 oh
  fith       h1 c3 c3 oh
  fith       X  c3 oh X
  fith       c3 c3 oh ho
  hrst       0.0002,
&end

The data is presented in the System entries in our input and the contents are obvious. The output format is specified as "frcmod" meaning that a frcmod file will be written to record the optimized parameters. The name of this frcmod file is given as carbofluid.dat up in the &files namelist. It's important to point out that mdgx needs to know not just topologies for every system in the fitting set but also where all of the parameters came from in order to optimize anything. This is to make sure that it knows unambiguously which parameters are unique. If the initial guess force field involved a frcmod file of its own, this can be entered in the &files namelist with the -fmod keyword, any number of times.

That will run in a few seconds and use a linear least-squares fit to produce the best parameters for the figure of merit: match the relative quantum energies of each conformation subject to penalties on how big the dihedral amplitudes are allowed to get or how far from their original values in the guess (GAFF) force field the bond angles are allowed to go. What I just said is a fairly loaded statement: I'll explain the input and restraints in the next paragraph. But, the part about these being the best possible parameters with respect to the figure of merit holds: as long as you can state what you want in a way that fits in a set of linear equations, which we have, linear least squares is the unbeatable method of choice. There are well known problems with least squares fits, but in my years doing this I've learned enough of how these problems manifest themselves and I've tried to provide enough tools in mdgx to keep them in check.

The most important controls are contained in the arst and hrst inputs: these provide the stiffness of weights on all bond angle and torsion terms, holding the fitted values of angles to the GAFF presets and torsions to zero, respectively. These are pretty good values to use, too: it's about what I roll with. As was the case in charge fitting, the size of the data set is irrelevant: these restraints will scale with not just the data set size but the number of instances of each parameter in the problem so that users can maintain a consistent restraints protocol even as their data set grows. For bond angles specifically, there is this "arstcpl" keyword: it sets the "exchange rate" between shifting the angle equilibrium and shifting the harmonic angle stiffness as it solves the optimization problem. I do recommend fitting angle equilibria (activated with FitAnglEq = 1), and just roll with arstcpl = 1.0.

The first thing to look at is the fita and fith inputs, which of course specify the angle and torsion parameters to fit. These are given by simply listing the atom types: in the case of torsions, there may be more than one term in the cosine series, and thus multiple optimizable parameters. All such parameters will be activated by naming the four atom types, and one cannot select optimization in particular terms of a series. Even with the ability to optimize particular torsions at will, there's still a question of which torsions, because it's not always obvious what the model contains. One trick for dealing with this is to run the fitting once with torsions 1 in the input--that will activate all torsions for optimization, and prompt analysis and outputs that can be used to track down the parameters controlling the motions that we sampled in Stage 1. The following paragraph will explain the nature of this information in section 7 of the output. (In fact, I listed all torsions in these systems with the fith commands, and as we'll see it was a safe thing to do, but it's good to keep a leash on the thing.)

I would recommend you not just take the resulting frcmod and run with it: spend a moment to look over the output, and to plot the fitting results in Matlab or Octave. The mdgx program offers a lot of information on how the fitting went and what each parameter is doing. The simplest metrics are given in section 1: the fits for glycerol and MPD improved from 2.37 and 2.70 kcal/mol RMS error to 1.86 and 2.18, respectively. This is not that big an improvement--the GAFF parameters are already doing a pretty good job. It's a bit surprising that the numbers are not smaller for molecules of this size, but we're dealing with vicinal and (2,4) hydroxyls: the errors are probably dominated by the electrostatics and the crudeness of point partial charges. If we look a little further, section 2 gives us a look at the actual fitted parameter values but more importantly the sampling along each relevant dimension. The output shows that all of these fitted parameters are well sampled throughout the ranges they might occupy in a simulation: this is a good indicator that there is at least no catastrophic over-fitting in the result. The minor warning in section 3 is not a big deal‐at least there is data on both sides of the fitted equilibrium angle value and the program can see the energy getting better, bottoming out, and then getting worse as the angle stretches. Section 5 indicates that the fitting problem is reasonably well behaved in this case. Section 6 is not for beginners to fret over--I don't even put a lot of time into that one but I track a lot of things in case they might show me something useful. Section 7 is a place to spend more time: here you can see the places that each parameter occurs, including all instances in every system, and the sampling in each case. This is where you'll discover things you might not have known, like "Wow, the torsion spanning types A1 B2 C3 D4 is in this sugar molecule and also this napthalene derivative I'm fitting? Do I really want that?" It will also alert you if the parameter samples one subspace of its overall domain in one system and samples the rest in another, a situation which could lead to over-fitting even though things looked good up in section 2. In this case, all parameters are well sampled in all systems with the exception of one: the X-c3-oh-X torsion, which is thoroughly sampled in Glycerol but not so well in the one place it appears in MPD. We can now apply our better judgment: because glycerol has more examples of this torsion and the chemical context in glycerol is not much different from MPD, there is not much to worry about in this case. Just to draw your attention to it, here it is in lines 318-324 of the fitting output.

 [ Parameter ] [ Atom types ]                   Sampling Histogram
  Residue (Atom Names)                   Count   MIN    +0    MAX    Warnings
 --------------------------------------- -----  ------------------   --------
...
 DIHE X  c3 oh X
  GOL (HO1  O1  C1  H11 )                  324  XXXXXXXXXXXXXXXXXX
  GOL (HO1  O1  C1  H12 )                  324  XXXXXXXXXXXXXXXXXX
  GOL (HO2  O2  C2  H2  )                  324  XXXXXXXXXXXXXXXXXX
  GOL (H32  C3  O3  HO3 )                  324  XXXXXXXXXXXXXXXXXX
  GOL (H31  C3  O3  HO3 )                  324  XXXXXXXXXXXXXXXXXX
  MPD (H4   C4  O4  HO4 )                  324       XXXUoe:XO  ::

Summarizing Stage 2: This is pretty simple. Get fitting data, run fitting program. There's still a bit to go, but it's just to repeat what we've done with a more educated guess as to the best parameter set. Users familiar with the ParamFit program may recall an important constant "K" that needed to be derived for each system. This is all handled internally by mdgx (see the output section 4), but the principle is similar (the way mdgx actually handles it is a product of several vigorous discussions, but this is one place where the "finesse" comes in). I hope the output is useful to you in your own parameter development--it was a major source of ideas when we did our IPolQ protein force fields.

Go on to Part 3

"How's that for maxed out?"

Last modified: Feb 7, 2020