(Note: These tutorials are meant to provide
illustrative examples of how to use the AMBER software suite to carry out
simulations that can be run on a simple workstation in a reasonable period of
time. They do not necessarily provide the optimal choice of parameters or
methods for the particular application area.)

Copyright Ross Walker 2006

**AMBER ADVANCED WORKSHOP
TUTORIAL A6 - SECTION 2**

**pKa Calculations Using Thermodynamic Integration**

**By
Ross Walker & Mike Crowley**

*2***) Run thermodynamic integration on the
model compound to find delta-G(model)**

We will now run the thermodynamic integration calculation on the model compound. In Amber 9 thermodynamic integration is done in a parallel fashion using multi-sander. This is a wrapper around the parallel version of sander that runs multiple copies of sander within a single parallel run. For reason you must have the parallel version of sander (sander.MPI) installed and tested on your system. You must also know how to startup any required MPI demons on your system and successfully execute a parallel sander run on two cpus. If you are doing this tutorial in the workshop the required software should already be setup and in your path and the lamd mpi demon should be running. (check this with 'ps -aux' - if you don't see lamd running start it with lamboot).

We will run a total of 5 values of lambda for 3ns each on the model system. This is the same as they did in the paper. We will use igb=5, shake all hydrogens (ntc=2, ntf=2), use a 1fs time step (dt=0.001). In the paper they used the Berendsen thermostat for temperature control. This, however, is no longer recommended. The preferred method of temperature control is now to use Langevin dynamics (ntt=3). Since this is only a small model system we can likely get away without having to minimise our system before doing MD.

We will run 5 simulations here. The first will not be a restart (irest=0, ntx=1). The other 4 will be restarts. The difference in each case will be the value of clambda. It explains this in more details in the amber manual. The first simulation (irest=0) will use clambda=0.0. The other 4 will use clambda=0.11270, 0.5000, 0.88729 and 1.0000 respectively. Below is an example mdin file for the second step. You should setup the 5 mdin files (remembering that step1 has ntx=1 and irest=0). The ntave=1000 causes sander to write a running average every 1000 steps. This makes it easier to check for equilibration.

model_step2.mdin |

model ASH to ASP for pKa calculation &cntrl nstlim =3000000, nscm=2000, ntx=5, irest=1, ntpr=1000, tempi=300.0, temp0=300.0, ntt=3, gamma_ln=5.0, ntb=0, igb=5, cut=999.0, dt=0.001, ntc=2, ntf=2, ntwr = 10000, ntwx=1000, ntave=1000 icfe=1, clambda=0.11270, / |

We will also need a group file for each of the 5 runs. This file specifies the command line options for each of the sander threads that will be run. Here is the example for step 2:

model_step2.group |

-O -i model_step2.mdin -p
asph_model.prmtop -c asph_model_step1.rst -o asph_model_step2.out -inf
asph.mdinfo -x asph_model_step2.mdcrd -r asph_model_step2.rst -O -i model_step2.mdin -p asp_model.prmtop -c asp_model_step1.rst -o asp_model_step2.out -inf asp.mdinfo -x asp_model_step2.mdcrd -r asp_model_step2.rst |

Once you have created all 5 files you can run all 5 jobs in a single run script. E.g.:

run_model.x |

mpirun -np 2 $AMBERHOME/exe/sander.MPI -ng 2 -groupfile
model_step1.group mpirun -np 2 $AMBERHOME/exe/sander.MPI -ng 2 -groupfile model_step2.group mpirun -np 2 $AMBERHOME/exe/sander.MPI -ng 2 -groupfile model_step3.group mpirun -np 2 $AMBERHOME/exe/sander.MPI -ng 2 -groupfile model_step4.group mpirun -np 2 $AMBERHOME/exe/sander.MPI -ng 2 -groupfile model_step5.group |

The -np specifies 2 cpus. This can in principle be any even number and it will split the processors between the groups. E.g. if you have a 4 processor machine or cluster you could do -np 4 to use all 4 processors.

This calculation takes approximately 4400 s (1.22 hours) to run all 5 lambdas on a Pentium-D 3.2GHz Machine.

Here is a tar file containing the output files: model.tar.gz.(12.25 MB)

*2.1) Check for convergence*

Next we need to check if our initial 3ns runs were sufficient to equilibrate our system. Realistically we should do a more in-depth analysis that what we will do in this tutorial but what we do here should be sufficient for our purposes. The first thing we should do is plot the 1000 step instantaneous average ands cumulative averages for the 5 lambda values. We can write a simple awk script to extract the relevant data from the output files:

dvdl.awk |

BEGIN{ isd=0; epsum=0; npts=0 } /Molecular dynamics:/{ while( $1 != "nstlim") getline; split($3,a,","); nstlim = a[1]; } /DV\/DL,/{ isd=1} /NSTEP/{ if(isd == 1 ){nstep = $3 ; tim = $6; }} /EPtot/ { if(isd == 1 ) { if( nstep >= nstart ) { npts += 1; eptot = $9; epsum = epsum + $9; print tim,eptot,epsum/npts ; } isd = 0; } } |

You should run this awk script on each of the 5 lambda value output files (it doesn't matter if you choose the ASP or ASPH ones as they should be identical) to obtain lists of the 1000 step instantaneous DV/DL values and cumulative average DV/DL values. E.g.:

awk -f dvdl.awk asp_model_step1.out > step1.dat

Firstly you should tail the end of each stepx.dat file check that the two DV/DL values on the last line match. This is the overall average calculated by sander and the full cumulative average calculated by the awk script. If the values don't match then something has gone horribly wrong and you should carefully check your output files for signs of error. E.g.

tail -n 5 step1.dat

2997.000 -58.0952 -57.8712 2998.000 -58.0155 -57.8712 2999.000 -58.2206 -57.8713 3000.000 -58.0181 -57.8714 3000.000 -57.8714 -57.8714

Note, due to rounding differences between different architectures you may have slightly different results to me. If we have equilibrated for long enough, however, the final results we obtain should be the same to within the sampling error.

You can now open xmgrace (or your own favourite graphing program) and check each of the 5 stepx.dat files for convergence. Run xmgrace from the command line and then choose Data->import->ascii select the filename and for the Load as box select NXY. (Click on the images for a larger view)

Lambda = 0.0 |
Lambda = 0.11270 |

Lambda = 0.5 |
Lambda = 0.88729 |

Lambda = 1.0 |

As you can from these images the cumulative average (shown in red) for all 5 runs has converged to what could be considered acceptable tolerances. Note the large number of jumps in the instantaneous average values. This is why it is important to run a long time scale simulation with sufficient data collection. Since our runs appear to have converged we can now use them to extract the data necessary to calculate delta-G(model).

*2.2) Calculate delta-G(model)*

Now we have the output files we can calculate delta-G(model) by using equation 5 from the paper to find dG/dL:

Then we numerically integrate this over the range of lambda from zero to one to get delta-G(model).

The fact that we use the average value of dU/dL (DV/DL) is why we needed to confirm convergence of our data. We now need to select which portion of our data we will extract for use in the above equation. From the graphs above we will discard the first nanosecond of each 3 ns run as being the equilibration phase. Thus we need to process our output files again and only extract the data for the last 2,000,000 steps (1 fs time step) from each file. Our awk script from above can do this for us. We just have to tell it what step to start on:

awk -f dvdl.awk nstart=1000000 asp_model_step1.out > prod1.dat

files: prod1.dat, prod2.dat, prod3.dat, prod4.dat, prod5.dat

You should visualise these files in xmgrace again to verify that the cumulative average looks good.

*Calculating deltaG(model) by Gaussian Quadrature*

We will now calculate the value of deltaG(model) by estimating the following integral based on the data from our calculations above:

We use the equation shown below to calculate the area under DV/DL. We do this using Gaussian quadrature which is significantly more accurate than using a trapezoid approach:

The following table summarizes the data we require which is the
final line of prod2-4.dat which is the cumulative average over each 2,000 step
block. You should refer to page 155 of the Amber 9 manual for more information
on the origin of the Lambda and W_{i} values:

Step | 1 | 2 | 3 | 4 | 5 |

<DV/DL> | -57.8765 | -58.3322 | -59.6629 | -61.7819 | -62.2840 |

Lambda | 0.0000 | 0.1127 | 0.5000 | 0.8873 | 1.0000 |

W_{i} |
0.00000 | 0.27777 | 0.44444 | 0.27777 | 0.00000 |

Since the value of W_{i} is zero for lambda values of
0.0 and 1.0, these two steps don't contribute to the value of the integral. Thus
we could actually have saved our selves simulation time by not calculating the
end points. However, it was good practice and it would have made it much easier
to debug things if we had problems with the simulation. Thus the value of the
integral above is:

deltaA = (-58.3322 * 0.27777) + (-59.6629 * 0.44444) + (-61.7819 * 0.27777) = -59.8807 kcal / Mol

In this case deltaA = deltaG thus our value for the deltaG(model) is -59.8807 kcal/Mol. Unfortunately since this isn't a deltaG value at constant temperature and pressure we can't use it to directly calculate the pKa of our model. However, we can use it combined with the protein simulation to obtain a deltadeltaG from which we can calculate a delta pKa (pKa shift) for the residue in the protein matrix.

We are now ready to go to section 3 where we will build the input files for the AspH and Asp- thioredoxin protein.

(Note: These tutorials are meant to provide
illustrative examples of how to use the AMBER software suite to carry out
simulations that can be run on a simple workstation in a reasonable period of
time. They do not necessarily provide the optimal choice of parameters or
methods for the particular application area.)

Copyright Ross Walker 2006