Stage 2: Fitting Parameters

Stage 1 was kind of a lengthy one, especially compared to this. With the results from Stage 1 unpacked (or equivalent results of your own ready), plus the original topologies for Glycerol and MPD, the following input will accomplish the parameter fitting for three dihedrals and bond angles, with optimization of the angle equilibrium values:

&files -parm <your Amber home directory>/dat/leap/parm/gaff.dat -d carbofluid.dat -o fit.out &end &param System Glycerol_vac.top Glycerol/coords.cdf Glycerol/energies.dat System MPD_vac.top MPD/coords.cdf MPD/energies.dat ParmOutput frcmod eunits hartree, accrep report.m verbose 1, % Angle fitting input fita c3 oh ho fita c3 c3 oh fita c3 c3 c3 FitAnglEq 1, arst 0.0002, arstcpl 1.0, % Torsion fitting input fith c3 c3 c3 c3 fith hc c3 c3 oh fith hc c3 c3 c3 fith oh c3 c3 oh fith h1 c3 c3 oh fith X c3 oh X fith c3 c3 oh ho hrst 0.0002, &end

The data is presented in the System entries in our input and the contents are obvious. The output format is specified as "frcmod" meaning that a frcmod file will be written to record the optimized parameters. The name of this frcmod file is given as carbofluid.dat up in the &files namelist. It's important to point out that mdgx needs to know not just topologies for every system in the fitting set but also where all of the parameters came from in order to optimize anything. This is to make sure that it knows unambiguously which parameters are unique. If the initial guess force field involved a frcmod file of its own, this can be entered in the &files namelist with the -fmod keyword, any number of times.

That will run in a few seconds and use a linear least-squares fit to produce the best parameters for the figure of merit: match the relative quantum energies of each conformation subject to penalties on how big the dihedral amplitudes are allowed to get or how far from their original values in the guess (GAFF) force field the bond angles are allowed to go. What I just said is a fairly loaded statement: I'll explain the input and restraints in the next paragraph. But, the part about these being the best possible parameters with respect to the figure of merit holds: as long as you can state what you want in a way that fits in a set of linear equations, which we have, linear least squares is the unbeatable method of choice. There are well known problems with least squares fits, but in my years doing this I've learned enough of how these problems manifest themselves and I've tried to provide enough tools in mdgx to keep them in check.

The most important controls are contained in the arst and hrst inputs: these provide the stiffness of weights on all bond angle and torsion terms, holding the fitted values of angles to the GAFF presets and torsions to zero, respectively. These are pretty good values to use, too: it's about what I roll with. As was the case in charge fitting, the size of the data set is irrelevant: these restraints will scale with not just the data set size but the number of instances of each parameter in the problem so that users can maintain a consistent restraints protocol even as their data set grows. For bond angles specifically, there is this "arstcpl" keyword: it sets the "exchange rate" between shifting the angle equilibrium and shifting the harmonic angle stiffness as it solves the optimization problem. I do recommend fitting angle equilibria (activated with FitAnglEq = 1), and just roll with arstcpl = 1.0.

The first thing to look at is the fita and fith inputs, which of course specify the angle and torsion parameters to fit. These are given by simply listing the atom types: in the case of torsions, there may be more than one term in the cosine series, and thus multiple optimizable parameters. All such parameters will be activated by naming the four atom types, and one cannot select optimization in particular terms of a series. Even with the ability to optimize particular torsions at will, there's still a question of which torsions, because it's not always obvious what the model contains. One trick for dealing with this is to run the fitting once with torsions 1 in the input--that will activate all torsions for optimization, and prompt analysis and outputs that can be used to track down the parameters controlling the motions that we sampled in Stage 1. The following paragraph will explain the nature of this information in section 7 of the output. (In fact, I listed all torsions in these systems with the fith commands, and as we'll see it was a safe thing to do, but it's good to keep a leash on the thing.)

I would recommend you not just take the resulting frcmod and run with it: spend a moment to look over the output, and to plot the fitting results in Matlab or Octave. The mdgx program offers a lot of information on how the fitting went and what each parameter is doing. The simplest metrics are given in section 1: the fits for glycerol and MPD improved from 2.37 and 2.70 kcal/mol RMS error to 1.86 and 2.18, respectively. This is not that big an improvement--the GAFF parameters are already doing a pretty good job. It's a bit surprising that the numbers are not smaller for molecules of this size, but we're dealing with vicinal and (2,4) hydroxyls: the errors are probably dominated by the electrostatics and the crudeness of point partial charges. If we look a little further, section 2 gives us a look at the actual fitted parameter values but more importantly the sampling along each relevant dimension. The output shows that all of these fitted parameters are well sampled throughout the ranges they might occupy in a simulation: this is a good indicator that there is at least no catastrophic over-fitting in the result. The minor warning in section 3 is not a big deal‐at least there is data on both sides of the fitted equilibrium angle value and the program can see the energy getting better, bottoming out, and then getting worse as the angle stretches. Section 5 indicates that the fitting problem is reasonably well behaved in this case. Section 6 is not for beginners to fret over--I don't even put a lot of time into that one but I track a lot of things in case they might show me something useful. Section 7 is a place to spend more time: here you can see the places that each parameter occurs, including all instances in every system, and the sampling in each case. This is where you'll discover things you might not have known, like "Wow, the torsion spanning types A1 B2 C3 D4 is in this sugar molecule and also this napthalene derivative I'm fitting? Do I really want that?" It will also alert you if the parameter samples one subspace of its overall domain in one system and samples the rest in another, a situation which could lead to over-fitting even though things looked good up in section 2. In this case, all parameters are well sampled in all systems with the exception of one: the X-c3-oh-X torsion, which is thoroughly sampled in Glycerol but not so well in the one place it appears in MPD. We can now apply our better judgment: because glycerol has more examples of this torsion and the chemical context in glycerol is not much different from MPD, there is not much to worry about in this case. Just to draw your attention to it, here it is in lines 318-324 of the fitting output.

[ Parameter ] [ Atom types ] Sampling Histogram Residue (Atom Names) Count MIN +0 MAX Warnings --------------------------------------- ----- ------------------ -------- ... DIHE X c3 oh X GOL (HO1 O1 C1 H11 ) 324 XXXXXXXXXXXXXXXXXX GOL (HO1 O1 C1 H12 ) 324 XXXXXXXXXXXXXXXXXX GOL (HO2 O2 C2 H2 ) 324 XXXXXXXXXXXXXXXXXX GOL (H32 C3 O3 HO3 ) 324 XXXXXXXXXXXXXXXXXX GOL (H31 C3 O3 HO3 ) 324 XXXXXXXXXXXXXXXXXX MPD (H4 C4 O4 HO4 ) 324 XXXUoe:XO ::

Summarizing Stage 2: This is pretty simple. Get fitting data, run fitting program. There's still a bit to go, but it's just to repeat what we've done with a more educated guess as to the best parameter set. Users familiar with the ParamFit program may recall an important constant "K" that needed to be derived for each system. This is all handled internally by mdgx (see the output section 4), but the principle is similar (the way mdgx actually handles it is a product of several vigorous discussions, but this is one place where the "finesse" comes in). I hope the output is useful to you in your own parameter development--it was a major source of ideas when we did our IPolQ protein force fields.

Go on to Part 3

"How's that for maxed out?"

Last modified: Jul 25, 2023