pKa Calculations using Thermodynamic Integration

(Note: These tutorials are meant to provide illustrative examples of how to use the AMBER software suite to carry out simulations that can be run on a simple workstation in a reasonable period of time. They do not necessarily provide the optimal choice of parameters or methods for the particular application area.)
Copyright Ross Walker 2006

pKa Calculations using Thermodynamic Integration - SECTION 2

pKa Calculations Using Thermodynamic Integration

By Ross Walker & Mike Crowley

2) Run thermodynamic integration on the model compound to find delta-G(model)

We will now run the thermodynamic integration calculation on the model compound. In sander, thermodynamic integration is done in a parallel fashion using an internal code called multi-sander. This is a wrapper around the parallel version of sander that runs multiple copies of sander within a single parallel run. For reason you must have the parallel version of sander (sander.MPI) installed and tested on your system. You must also know how to startup any required MPI demons on your system and successfully execute a parallel sander run on two cpus.

We will run a total of 5 values of lambda for 3ns each on the model system. This is the same as they did in the paper. We will use igb=5, shake all hydrogens (ntc=2, ntf=2), use a 1fs time step (dt=0.001). In the paper they used the Berendsen thermostat for temperature control. This, however, is no longer recommended. The preferred method of temperature control is now to use Langevin dynamics (ntt=3). Since this is only a small model system we can likely get away without having to minimise our system before doing MD.

We will run 5 simulations here. The first will not be a restart (irest=0, ntx=1). The other 4 will be restarts. The difference in each case will be the value of clambda. This is explained in more detail in the "Thermodynamic Integration" chapter of the Amber Reference Manual. The first simulation (irest=0) will use clambda=0.0. The other 4 will use clambda=0.11270, 0.5000, 0.88729 and 1.0000 respectively. Below is an example mdin file for the second step. You should setup the 5 mdin files (remembering that step1 has ntx=1 and irest=0). The ntave=1000 causes sander to write a running average every 1000 steps. This makes it easier to check for equilibration.

model_step2.mdin

   model ASH to ASP for pKa calculation
 &cntrl
   nstlim =3000000, nscm=2000,
   ntx=5, irest=1, ntpr=1000,
   tempi=300.0, temp0=300.0, ntt=3, gamma_ln=5.0,
   ntb=0, igb=5, cut=999.0,
   dt=0.001, 
   ntc=2, ntf=2, 
   ntwr = 10000, ntwx=1000, ntave=1000
   icfe=1, clambda=0.11270,
 /

We will also need a group file for each of the 5 runs. This file specifies the command line options for each of the sander threads that will be run. Here is the example for step 2:

model_step2.group

-O -i model_step2.mdin -p asph_model.prmtop -c asph_model_step1.rst -o asph_model_step2.out -inf asph.mdinfo -x asph_model_step2.mdcrd -r asph_model_step2.rst
-O -i model_step2.mdin -p asp_model.prmtop -c asp_model_step1.rst -o asp_model_step2.out -inf asp.mdinfo -x asp_model_step2.mdcrd -r asp_model_step2.rst

Once you have created all 5 files you can run all 5 jobs in a single run script. E.g.:

run_model.x

mpirun -np 2 $AMBERHOME/exe/sander.MPI -ng 2 -groupfile model_step1.group
mpirun -np 2 $AMBERHOME/exe/sander.MPI -ng 2 -groupfile model_step2.group
mpirun -np 2 $AMBERHOME/exe/sander.MPI -ng 2 -groupfile model_step3.group
mpirun -np 2 $AMBERHOME/exe/sander.MPI -ng 2 -groupfile model_step4.group
mpirun -np 2 $AMBERHOME/exe/sander.MPI -ng 2 -groupfile model_step5.group

The -np specifies 2 cpus. This can in principle be any even number and it will split the processors between the groups. E.g. if you have a 4 processor machine or cluster you could do -np 4 to use all 4 processors.

This calculation takes approximately 4400 s (1.22 hours) to run all 5 lambdas on a Pentium-D 3.2GHz Machine.

Here is a tar file containing the output files: model.tar.gz.(12.25 MB)

2.1) Check for convergence

Next we need to check if our initial 3ns runs were sufficient to equilibrate our system. Realistically we should do a more in-depth analysis that what we will do in this tutorial but what we do here should be sufficient for our purposes. The first thing we should do is plot the 1000 step instantaneous average ands cumulative averages for the 5 lambda values. We can write a simple awk script to extract the relevant data from the output files:

dvdl.awk

BEGIN{ isd=0; epsum=0; npts=0 }

/Molecular dynamics:/{
    while( $1 != "nstlim") getline;
    split($3,a,",");
    nstlim = a[1];
}

/DV\/DL,/{ isd=1}

/NSTEP/{ if(isd == 1 ){nstep = $3 ; tim = $6; }}

/EPtot/ {
    if(isd == 1 ) {
	if( nstep >= nstart ) {
	    npts += 1;
	    eptot = $9;
	    epsum = epsum + $9;
	    print tim,eptot,epsum/npts ;
	}
	isd = 0;
    }
}

You should run this awk script on each of the 5 lambda value output files (it doesn't matter if you choose the ASP or ASPH ones as they should be identical) to obtain lists of the 1000 step instantaneous DV/DL values and cumulative average DV/DL values. E.g.:

awk -f dvdl.awk asp_model_step1.out > step1.dat

Firstly you should tail the end of each stepx.dat file check that the two DV/DL values on the last line match. This is the overall average calculated by sander and the full cumulative average calculated by the awk script. If the values don't match then something has gone horribly wrong and you should carefully check your output files for signs of error. E.g.

tail -n 5 step1.dat

2997.000 -58.0952 -57.8712
2998.000 -58.0155 -57.8712
2999.000 -58.2206 -57.8713
3000.000 -58.0181 -57.8714
3000.000 -57.8714 -57.8714

Note, due to rounding differences between different architectures you may have slightly different results to me. If we have equilibrated for long enough, however, the final results we obtain should be the same to within the sampling error.

You can now open xmgrace (or your own favourite graphing program) and check each of the 5 stepx.dat files for convergence. Run xmgrace from the command line and then choose Data->import->ascii select the filename and for the Load as box select NXY. (Click on the images for a larger view)

Lambda = 0.0	Lambda = 0.11270
Lambda = 0.5	Lambda = 0.88729
Lambda = 1.0

As you can from these images the cumulative average (shown in red) for all 5 runs has converged to what could be considered acceptable tolerances. Note the large number of jumps in the instantaneous average values. This is why it is important to run a long time scale simulation with sufficient data collection. Since our runs appear to have converged we can now use them to extract the data necessary to calculate delta-G(model).

2.2) Calculate delta-G(model)

Now we have the output files we can calculate delta-G(model) by using equation 5 from the paper to find dG/dL:

Then we numerically integrate this over the range of lambda from zero to one to get delta-G(model).

The fact that we use the average value of dU/dL (DV/DL) is why we needed to confirm convergence of our data. We now need to select which portion of our data we will extract for use in the above equation. From the graphs above we will discard the first nanosecond of each 3 ns run as being the equilibration phase. Thus we need to process our output files again and only extract the data for the last 2,000,000 steps (1 fs time step) from each file. Our awk script from above can do this for us. We just have to tell it what step to start on:

awk -f dvdl.awk nstart=1000000 asp_model_step1.out > prod1.dat

files: prod1.dat, prod2.dat, prod3.dat, prod4.dat, prod5.dat

You should visualise these files in xmgrace again to verify that the cumulative average looks good.

Calculating deltaG(model) by Gaussian Quadrature

We will now calculate the value of deltaG(model) by estimating the following integral based on the data from our calculations above:

We use the equation shown below to calculate the area under DV/DL. We do this using Gaussian quadrature which is significantly more accurate than using a trapezoid approach:

The following table summarizes the data we require which is the final line of prod2-4.dat which is the cumulative average over each 2,000 step block. You should refer to page 155 of the Amber 9 manual for more information on the origin of the Lambda and W_i values:

Step	1	2	3	4	5
<DV/DL>	-57.8765	-58.3322	-59.6629	-61.7819	-62.2840
Lambda	0.0000	0.1127	0.5000	0.8873	1.0000
W_i	0.00000	0.27777	0.44444	0.27777	0.00000

Since the value of W_i is zero for lambda values of 0.0 and 1.0, these two steps don't contribute to the value of the integral. Thus we could actually have saved our selves simulation time by not calculating the end points. However, it was good practice and it would have made it much easier to debug things if we had problems with the simulation. Thus the value of the integral above is:

deltaA = (-58.3322 * 0.27777) + (-59.6629 * 0.44444) + (-61.7819 * 0.27777) = -59.8807 kcal / Mol

In this case deltaA = deltaG thus our value for the deltaG(model) is -59.8807 kcal/Mol. Unfortunately since this isn't a deltaG value at constant temperature and pressure we can't use it to directly calculate the pKa of our model. However, we can use it combined with the protein simulation to obtain a deltadeltaG from which we can calculate a delta pKa (pKa shift) for the residue in the protein matrix.

We are now ready to go to section 3 where we will build the input files for the AspH and Asp- thioredoxin protein.

CLICK HERE TO GO TO SECTION 3