Stage 1: The Molecule in Many, Many Guises

The mdgx &configs module will be called into action once more, in much the same capacity as at the beginning of the IPolQ tutorial. This time, we will generate many more configurations of two different molecules with the aim of sampling a number of degrees of freedom, decorrelating the motions, and producing inputs ready for quantum calculations. This showcases the power of the &configs module, and in fact it's just warming up. In principle, this is nothing you couldn't accomplish with a shell script writing input files and calling sander, but it would take half an hour to set up and several minutes to run. After doing that myself a few hundred times, I wrote the mdgx &configs module.

The following input will create the conformations we need based on some initial coordinates for Glycerol using the topology with vacuum phase charges as given in the corresponding .prepi file. As was stated in the IPolQ tutorial, we will be doing the quantum calculations in vacuum, so we want to make the conformations and then do the parameter fitting in the context of vacuum phase charges. We will substitute the IPolQ charges later. There's a lot of conformations, but don't despair‐these QM calculations are fast (about 75 seconds each on a 2.6GHz CPU made in 2010). The conformations sample the major heavy atom dihedral in glycerol and also the rotation of the central hydroxyl group: both of these motions are sampled independently thanks to the combine 1 2 command (see the mdgx &configs keywords glossary by running $AMBERHOME/bin/mdgx -CONFIGS). While sampling the dihedrals, a few key angles are pulled towards random values up to 20 degrees away from their expected equilibria. This may seem pretty savage, but the NMR restraints that mdgx is using to accomplish this are harmonic potentials that work against the existing harmonic potentials already present in the topology created with our initial guess for the force field (in this case GAFF). Only a few of the angles in the final structures will depart from their standard configurations by more than 10 degrees. Note that the operations specified by GridSample, RandomPerturb, and similar keywords get labeled by numbers in the order they appear so that the combine (alias LinkOperations) keyword has a handle on them. The user can also use the label attribute available to each operation to give them unique and intuitive names for managing very elaborate manipulations.

&files -p Glycerol_vac.top -c Glycerol.crd -o GenConformers.out &end &configs GridSample :1@O1 :1@C1 :1@C2 :1@O2 { -180.0 180.0 } Krst 64.0 GridSample :1@C1 :1@C2 :1@O2 :1@HO2 { -180.0 180.0 } Krst 64.0 RandomSample :1@C2 :1@O2 :1@HO2 { 90.0 130.0 } Krst 256.0 RandomSample :1@C1 :1@C2 :1@O2 { 90.0 130.0 } Krst 256.0 RandomSample :1@C1 :1@C2 :1@C3 { 90.0 130.0 } Krst 256.0 combine 1 2 count 324 verbose 1 % Controls on the quantum mechanical operations qmlev 'MP2', basis 'cc-pvDZ', % Output controls: TAKE NOTE, this will generate a lot of files so % do this in a clean directory that you won't ls very often. outbase 'Conf', 'Conf' write 'pdb', 'orca' outsuff 'pdb', 'orca' &end

The point of generating so many conformations is not to make this tutorial excessively long: rather, it is to emphasize that one must aggressively sample the conformations of a molecule to ensure that a set of conjured parameters is correctly calibrated. A cross-section of the conformations this job file will create is show in the figure below.

If you have six processors to devote to the problem, a simple shell script will take care of them all while you go to lunch. For harder systems and more accurate calculations, it will be necessary to call upon cluster resources and batch submissions, but this is the nature of the problem. Hopefully, mdgx can take care of the work of creating the quantum input files. As the shell script below shows, however, the complete quantum outputs should be saved in order to take advantage of another new feature of mdgx.

#!/bin/bash NJOB=0 for I in `seq 1 324` ; do <ORCA executable> Conf${I}.orca > Conf${I}.oout & let "NJOB+=1" if [ ${NJOB} -eq 6 ] ; then NJOB=0 wait fi done

For those that do not want to crunch the numbers on their own, here are the glycerol conformations and here are their ORCA outputs. Beware: lots of files! Your results may differ slightly as the initial coordinates I've provided weren't the same ones I took in my own pass through the tutorial, but there should be no trouble getting a good spread of conformations.

I've said that parameters do not exist in isolation, but there's more: most parameters in Amber are shared between more than one molecule because they are bound to atom types with broader application. One cannot simply retune parameters for one case and expect them to still serve their original purpose. However, mdgx makes it very simple to optimize parameters for multiple systems at once. The &param module we are about to employ was written with the expectation of an ever-expanding data set, and it has been used to fit nearly a thousand parameters for several hundred combinations of amino acids simultaneously, involving more than a quarter million data points.

We'll also fit parameters for the larger molecule methyl(2,4)-pentanediol (MPD) in this exercise. As shown in the figure below, MPD is has more hydrophobic character than glycerol, but the GAFF atom typing is the same for both molecules and so they share many bonded parameters. I made some IPolQ charges for that molecule as well: here are seed coordinates, its topology with vacuum phase charges, and the .prepi file that produced it. We can use those files, along with a sensible adaptation of the conformational search job file for glycerol, to sample conformations of MPD. Here is the input file, here is a tarball of the conformations, and here is a tarball of the ORCA outputs. These calculations each took about 7.5 minutes on our 2.6GHz CPU, about six times the cost of the glycerol conformations but still pretty short.

With all of the calculations done, we now have to extract the energies. One could, of course, just grep for the energies inside each of the ORCA outputs and write them out next to the Amber coordinates of the conformations the correspond to. However, this leaves you vulnerable in a couple of ways: first, in order to script the process successfully, every calculation needed to work, and in a lot of cases, particularly when trying to get parameters for drug-sized molecules, the calculations can fail. Worse, there can be more insidious failures, warnings thrown deep inside the quantum outputs or matters as simple as a strained configuration obtaining a new chemical configuration that puts it on a whole new energy surface. A robust script to read the energies therefore needs to be fault tolerant as well as programmed to check for problems. This is why mdgx has its own solution: the &speval module. Given a topology (so that it knows the number of atoms) and a series of quantum outputs, mdgx is programmed to check for errors as it extracts coordinates and single point energies from quantum outputs. (We'll be doing more work under the hood as time goes on to improve the error checking, but if all goes well you won't notice anything.) The code you need to run it for the Glycerol case looks like this:

&files -p Glycerol_vac.top -o concat.out -d energies.dat -x coords.cdf &end &speval data Conf*.oout &end

This may look distressingly simple: all you need is a topology. A report on the process will be piped concat.out, a file equivalent to sander's mdout, while energies go into the "dump" file and coordinates get written to what coords.cdf, a trajectory like you would get in sander's default mdcrd. Inside the &speval module the only item that appears is a request to find files of the name format Conf*.oout. The trick here is to run this inside the directory where you've made those conformations; otherwise just extend that name format with the relative or absolute path and you're good to go. If you have multiple directories full of files, add more data entries. This operation elegantly reduces the hundreds of files created by conformational search and ORCA evaluations to just two files: coords.cdf, and energies.dat. Repeat for MPD. This tarball contains the distilled results for glycerol and MPD in separate directories.

To summarize Stage 1: We created a large number of conformations of two molecules, glycerol and MPD, which share a lot of chemical features and thus Amber (GAFF) atom types. We computed the MP2 single point energies of each of these conformations and compiled the coordinates into a compact, NetCDF format trajectory (for precision).

In practice: There are a lot of ways to improvise here, so long as you roll out of Stage 1 with a list of structures and the energies they correspond to according to your benchmark method (here, MP2 / cc-pvDZ). The next stage, and the mdgx &param module, are agnostic to the origins of this information. In this tutorial I have tried to impart what I feel are "best practices" by doing all of this in the context of the gas phase charge set. However, if you don't want to fit your own charges and have some that are descriptive of the solution phase instead, it's not a terrible thing to just roll with those.

For completeness: It is important to understand that the conformations mdgx has just produced are optimized with respect to the guess force field at this stage. The partial charges will not be changing due to anything that happens in this tutorial, and in practice it is sensible to make charges and then take them as given in order to make bonded parameters. But, in this example, we are performing MP2 single point calculations on GAFF-optimized structures. This business of getting the QM energies of MM-optimized structures doesn't sit well with some people, and I have every sympathy: it's not ideal in my mind, either. However, as I showed in the 2014 paper it is probably the lesser of two evils. Those who disagree with my approach prefer to take MM-optimized structures and make their relative MM energies agree with the QM single point energies of similar QM optimized structures: these QM-optimized structures will typically have identical coordinates in the handful of parameters that are being fitted, but the other degrees of freedom have been optimized with respect to the QM and MM force fields, respectively. If one wishes to pursue the other route, it is still possible with mdgx: all you need is a set of coordinates and energies, the code doesn't care how you got them.

Go on to Part 2

"How's that for maxed out?"

Last modified: Apr 29, 2025