Case Study - Folding TRP Cage (Advanced analysis and clustering)

(Note: These tutorials are meant to provide illustrative examples of how to use the AMBER software suite to carry out simulations that can be run on a simple workstation in a reasonable period of time. They do not necessarily provide the optimal choice of parameters or methods for the particular application area.)
Copyright Ross Walker 2005

Case Study - Folding TRP Cage (Advanced analysis and clustering) - SECTION 6

By Ross Walker

Stage 6: Analysing the Results

We will start our analysis of the results by plotting temperature, total energy, kinetic energy and potential energy from the output file. Examining these properties will allow us to check that nothing strange happened during the MD. We can check that the temperature rose smoothly and equilibrated at 325 K. Similarly changes in the kinetic energy and potential energy should be smooth and should level off at equilibrated values. Any sudden spikes in the temperature or energy plots are indicative of a problem with our simulation protocol, bad starting structure, too long a time step, inappropriate parameters etc.

In order to extract the data from the output files we can use the following perl script that automates the process (process_mdout.perl). We can then process the output files in one go by listing them all on the command line as follows:

mkdir analysis
cd analysis
../process_mdout.perl ../heat1.out ../heat2.out ../heat3.out ../heat4.out ../heat5.out ../heat6.out ../heat7.out ../equil1.out ../equil2.out ../equil3.out ../equil4.out ../equil5.out ../equil6.out ../equil7.out ../equil8.out ../equil9.out ../equil10.out

Here is a tar file containing all of the files produced by process_mdout.perl - analysis.tar (2.3 Mb)

We can now use a graphing program, such as xmgr which we used in tutorial 1, to plot the temperature and energy data.

First the temperature:

Temperature

Temperature close up of heating stage

This looks fine, our temperature has equilibrated around 325 K as we hoped. The heating phase also shows a nice gradual increase in temperature, with no sudden jumps so it looks like things are okay here.

Next we look at the energy plots. We are most interested in the potential energy plot since we need to locate the structure with the lowest energy so that we can reproduce the RMSd plots they show in the paper. The structure with the lowest energy we will be labelling our folded state.

Total Energy (Black), Potential Energy (Red), Kinetic Energy (Green)

This looks good, the kinetic energy is stable as we'd expect since the temperature plot, which is directly proportional to the kinetic energy, was stable. The total and potential energy plots show an initial decrease and then stabilise implying that our system has folded up to a state more stable than the starting linear structure. By plotting just the potential energy we can see the decrease more clearly. We can also plot a running average over 10ps as they did in the paper which makes things even clearer. Note, in xmgr you can add a running average by going to Data->Transformations->Running averages.

Potential Energy (Red = running average over 10ps)

This is what they show as Figure 1(A) in the paper. Note our initial energy is higher than that shown in the paper. This is most likely due to differences in the generalised Born term since they used AMBER 6 while we used AMBER 8. AMBER 7 and 8 have slightly different Born radii to AMBER 6. It is also possible that we have not folded to the same structure as they do in the paper. Further analysis will be able to tell us more.

The energy initially increases very sharply, this is due to us heating the system but then it starts to decrease again as our system folds up. Overall we see a decrease of around 40 kcal/mol over the 50 ns. This is in line with what they state in paragraph 4 of the paper.

The next step is to locate the structure with the lowest energy, this we will call our folded state and we will use it to reproduce Figure 1 B from the paper. We can locate the lowest energy by looking at the summary.EPTOT file that process_mdout.perl produced. But first we need to strip off the first 50ps of the data as this represents the heating phase. Here is the file with this section removed. (summary_EPTOT_stripped.dat.gz). The following awk script will quickly find the lowest value in the file for us. It will print a value every time it finds a new minimum value, the last value printed will be the lowest energy found.

>gunzip summary_EPTOT_stripped.dat.gz
>cat summary_EPTOT_stripped.dat | awk '{if($2<min) {min=$2;print $1" "min}}'

51.000   -453.0616
52.000   -458.9413
53.000   -466.4404
57.000   -471.1558
61.000   -478.0281
71.000   -479.0216
72.000   -482.4780
74.000   -488.0607
122.000   -494.2667
188.000   -506.0259
224.000   -511.8178
510.000   -519.0695
1278.000   -521.5923
1292.000   -526.4179
3903.000   -528.1458
14896.000   -534.8027
16909.000   -537.3546
17745.000   -539.2624
17759.000   -539.3218
19120.000   -539.6930
19719.000   -539.9273
21316.000   -545.5322
26834.000   -545.9479
42505.000   -547.4913

NOTE: your answers may be different here if you did your own simulation as variations in machine precision mean you will have explored a slightly different region of phase space.

So, the lowest energy was -547.4913 kcal/mol and it occurred 42.505 ns into our simulation. We can find out where this is in our output files by using grep:

>grep 42505.000 *.out

equil9.out: NSTEP =  1227500   TIME(PS) =   42505.000  TEMP(K) =   367.02  PRESS =     0.0

So it occurred on step 1227500 of our 9th equilibration step. We wrote to the nc file every 500 steps (ntwx=500) during the equilibration so this should represent frame 2455 (1227500/500) of equil9.nc. Lets extract that structure using cpptraj and take a look at it. Here is the cpptraj script (extract_frame9_2455.trajin):

trajin equil9.nc 2455 2455
trajout lowest_energy_struct.pdb pdb

> $AMBERHOME/bin/cpptraj TC5b.prmtop < extract_frame9_2455.trajin > extract_frame9_2455.out

Input Files: extract_frame9_2455.trajin, equil9.nc, TC5b.prmtop
Output Files: extract_frame9_2455.out, lowest_energy_struct.pdb.2455

Check the output file to ensure there were no errors.

Now lets take a look at the pdb of this structure in VMD.

vmd lowest_energy_struct.pdb.2455

This looks pretty good, the system is well folded with the tryptophan in a cage structure vaguely similar to that shown in Figure 2 of the paper. At this point we could compare this structure to the crystal structure but we have to remember that when the paper was published the experimental structure had not been determined so we shall leave this step until last. First off we will try to generate figure 1B which shows the backbone RMSd between the trajectory structures and this lowest energy structure. Here is cpptraj script for this:

trajin equil1.nc
trajin equil2.nc
trajin equil3.nc
trajin equil4.nc
trajin equil5.nc
trajin equil6.nc
trajin equil7.nc
trajin equil8.nc
trajin equil9.nc
trajin equil10.nc
reference lowest_energy_struct.pdb.2455
rms reference out rmsd_to_lowest_energy_struct.dat @N,CA,C time 1.0

> $AMBERHOME/bin/cpptraj TC5b.prmtop < measure_rmsd.trajin > measure_rmsd.out

Input Files: measure_rmsd.trajin, TC5b.prmtop, lowest_energy_struct.pdb.2455, all nc files from table above.
Output Files: measure_rmsd.out, rmsd_to_lowest_energy_struct.dat

Check the output from cpptraj for errors. The only errors you should see are ones telling you that frame 5001 of each of the trajectories is corrupt. This is not really an error since there were only 5000 frames in each nc file anyway so we can safely ignore it.

Here is the plot of the rmsd:

Our rmsd plot shows significantly more fluctuation than that shown in the paper although we do see the formation of a plateau around 16 to 38 ns. However, this does not show the stability over 50 ns that they suggest in paragraph 5 of the paper. This might suggest many things. Chances are our trajectory has not folded up to the native structure but has instead folded to an alternative structure. This structure is fairly stable but obviously must unfold again after 38 ns. Indeed, if you watch equil8.nc, equil9.nc and equil10.nc in vmd you will see the non-alpha helix part of the structure keep unfolding and refolding as the system tries to locate a more stable structure. The difference we see from the paper may be due to the implicit solvent model we chose or due to some other minor difference in the parameters. What this data tells us is that most likely we haven't run our simulation for long enough since the end point of the trajectory shows that the system wants to unfold and fold up differently. The starting structure we have used may be slightly different than that used in the paper and so while their's folded quickly ours gets itself trapped in a high energy minimum.

It is also possible that the slightly different Born radii we have used (they used the AMBER 6 Born radii remember) may have delayed the folding process. There are a number of ways we could extend this simulation further and try and fold up correctly. We could simply run a longer trajectory and wait for the system to leave the high energy basin it is trapped in. This may take a long time. Alternatively we could try a simulated annealing approach where we heat the system up and then let it cool again. We could also try a replica exchange simulation where we have many replicas and exchange properties between them. This avoids the problems of getting trapped in high energy minima. For the record I have obtained from the authors the structure that was published in the paper and can confirm that it gives a lower energy than our lowest energy structure. As such we have just been unlucky that our simulation wasn't long enough for the system to fold completely. Feel free to try extending it yourself if you wish or try some of the alternative approaches I mention above. There is a lesson to be learnt here. You can never really run a simulation for "too" long. Just because the system hasn't folded in 50 ns doesn't mean it won't fold in the next 50 ns so be aware of this when you are considering what your results imply.

Even though our system does not seem to have reached a completely stable structure it has at least folded up to a structure that shows stability over more than 20ns. Since we don't have the time to keep trying different simulation protocols or extend our simulation we shall go ahead with what we have and try several different types of analysis that we can do on our 50 ns of trajectory data.

Lets start by comparing our lowest energy structure to the NMR models that are now available (1L2Y.pdb). Like all NMR structures this pdb consists of an ensemble of possible structures, in this case 38. Ideally we should average these structures and compare the average structure to our theoretical structure. However, in the interests of speed I will skip this step and we can just compare the first structure in the file to our lowest energy structure. We can align the two structures and measure the RMSD using VMD. This is acceptable because, as the image below shows, residues 3 to 19 of the NMR structures are well conserved and it is just the ends of the chain that differ largely between the 38 structures.

Hence we will do fits just to the structurally well conserved residues.

Note: I am using VMD v1.8.3 here - If you are using a different version your menus may be slightly different. Note you need v1.8.2 or later for the RMSD fitting to work.

Here is a pdb containing just the first NMR structure (nmr_struc_1.pdb).

Step 1 - load our theoretical structure

Step 2 - load the first NMR structure (Make sure you select New Molecule)

At the moment the structures won't line up correctly. This is because the coordinate frames are different.

In order to compare the structures we need to RMSD fit and align them. Fortunately VMD contains a tool that makes this very easy. You can access it by clicking Extensions->Analysis->RMSD Calculator:

We can now do the alignment. We will align the backbone's only so make sure the Backbone only box is highlighted. Also select residues 3 to 19 by typing residue 2 to 18 in the text box (REMEMBER VMD NUMBERS FROM ZERO). Then click Align to align the structures and then click RMSD to measure the rmsd between the two structures.

So we have a backbone RMSD of 3.72 angstroms. This is a far cry from the 1.4 angstroms they quote in the paper which means we have almost certainly not folded to the same structure. Here is the alignment as shown in VMD (after I have changed the graphical representation options to make it easier to compare. The NMR structure is shown in blue):

Indeed, the comparison looks very bad. The Tryptophan looks to be in a very different position at almost 90 degrees to the experimental position. However, this isn't the whole story. Here we fitted to residues 3 to 19 (2 to 18 in VMD speak.) What happens if we repeat the process and fit just the residues involved in the alpha helix, 3 to 9?

Open the RMSD fitting tool again and this time enter "residue 2 to 8" in the text box, then hit align and then RMSD:

We get a backbone RMSD for the alpha helix of just 0.7 angstroms. Indeed it fits very well. Things start to go wrong around residue 9 where the backbone twists the wrong way. After this the structures are very different. There is one thing you should note, however, take a close look at the ends of the chain in each of the two structures:


Theoretical Structure	NMR Structure

Clearly there is a difference in the hydrogen bonding at the ends of the protein chain. Our structure has at least two, maybe 3 hydrogen bonds that are acting to zip the ends of the structure closed. In the NMR structure on the other hand the hydrogen bonding is very different. So this may be one explanation as to why the structures are so different, perhaps these hydrogen bonds are what are keeping us trapped in the high energy minimum. Lets take a look then at how we can use AMBER's cpptraj to look at hydrogen bonds over the course of our trajectory. This may also shed some light on why we get an initial structure formed early on in our trajectory that stays stable for almost 15 ns and yet has no real secondary structure (no alpha helix). If you watch the trajectory you should be able to see this structure and also observe how around 15 to 16 ns it suddenly snaps open and allows the system to refold with the alpha helix in place. We will also look at how we can use the MMTSB toolset to do some cluster analysis on our trajectory. This will allow us to assign the structures over time to various families so that we can comment on their lifetimes.

6.1) Further Trajectory Analysis using Cpptraj

6.1.1) Converting NC's to Binpos
The first thing we will do is convert our compressed text format nc files into binpos format. This is a binary version of a trajectory file. We could do the subsequent steps easily enough with nc files but using binpos is significantly faster. At the same time we will concatenate our 10 nc files into a single binpos file. Here is the cpptraj script to do it:

nc_to_binpos.cpptraj

trajin equil1.nc
trajin equil2.nc
trajin equil3.nc
trajin equil4.nc
trajin equil5.nc
trajin equil6.nc
trajin equil7.nc
trajin equil8.nc
trajin equil9.nc
trajin equil10.nc

trajout equil_1-10.binpos binpos

>$AMBERHOME/bin/cpptraj TC5b.prmtop <nc_to_binpos.cpptraj >nc_to_binpos.out

If all worked you should now have a 174 mb equil_1-10.binpos file in your working directory.

6.1.2) Calculating an Average Structure
Next we shall calculate an average structure over our entire trajectory (excluding the heating part).

average_crd.cpptraj

trajin equil_1-10.binpos

average equil_average.pdb pdb

>$AMBERHOME/bin/cpptraj TC5b.prmtop <average_crd.cpptraj

This will create the file equil_average.pdb

This is horrendous... What does this tell us?

Remember an average is just that, an average of the coordinates visited during the trajectory. The system is rotating and translating so you cannot hope to learn anything from this average. It is not a physically meaningful structure. Remember this when you are averaging coordinates.

However, if we are cleverer about calculating our average we can obtain more meaningful results. First of all we can do a mass weighted backbone RMS fit of every frame to the first frame in the trajectory, this will remove the rotation and translation aspect. Secondly we can look at a specific part of a trajectory where the average structure may have more meaning. Lets try this by looking at the section between 28 and 35 ns. This is frames 28000 to 35000 of the trajectory file.

average_28-35_crd.cpptraj

trajin equil_1-10.binpos 28000 35000
rms first mass @C,CA,N
average equil_average_28-35.pdb pdb

>$AMBERHOME/bin/cpptraj TC5b.prmtop <average_28-35_crd.cpptraj

This will create the file equil_average_28-35.pdb

This is much better and much more revealing. Try opening it yourself in VMD and exploring the structure. Load the TC5b.prmtop file first before loading the pdb file in this way the bonding will be sensible. Notice how some of the residues are very small, notably some of the hydrogen bonds lengths are tiny. This suggests that these residues are very dynamic and rotate a lot during this section of the trajectory. If you rotate something around a bond your average coordinate will be a point on that centre of rotation. However, note that the Tryptophan looks almost perfect. This must hardly move over this section of the trajectory. Similarly the backbone is well formed here which indicates that the folded part of the structure stays well defined between 28 and 35 ns. This is in line with what the RMSD plot suggested.

6.1.3) Hydrogen Bonding Analysis
Next we shall look at how we can use cpptraj to monitor the presence of hydrogen bonds during our trajectory.

Here is the cpptraj script that we will use:

analyse_hbond.cpptraj

trajin equil_1-10.binpos
hbond :* avgout avg-hbt.dat printatomnum

The second line is the one that tells cpptraj the analysis we want:

hbond :* avgout avg-hbt.dat printatomnum

hbond means "do H bond analysis."

:* means "consider all residues in the analysis"

avgout avg-hbt.dat "tells cpptraj to print out the solute-solute hydrogen bond averages to file."

printatomnum "tells cpptraj to include atom numbers in the output."

>$AMBERHOME/bin/cpptraj TC5b.prmtop <analyse_hbond.cpptraj >analyse_hbond.out

This will create the file analyse_hbond.out and ave-hbt.out.

Before you look at the actual result always check to make sure your selections actually worked and gave you the number of atoms you were expecting.

If we are certain things ran as we expected we can take a look at the results. These are printed at the very end of the output file. In our case, the first 10 lines read:

#Acceptor                  DonorH            Donor   Frames         Frac      AvgDist       AvgAng
PRO_12@O_197         ARG_16@H_228     ARG_16@N_227    12143       0.2429       2.8708     159.4340
TYR_3@O_56            LEU_7@H_118      LEU_7@N_117    12115       0.2423       2.8869     159.8708
ASP_9@OD2_167     ARG_16@HH22_248   ARG_16@NH2_246    11670       0.2334       2.8165     155.2230
PRO_12@O_197         GLY_15@H_221     GLY_15@N_220    11563       0.2313       2.8890     153.5216
ASP_9@OD2_167     ARG_16@HH12_245   ARG_16@NH1_243    10416       0.2083       2.8227     153.4352
ARG_16@O_250        TRP_6@HE1_104    TRP_6@NE1_103     9557       0.1911       2.8488     158.4464
TRP_6@O_116          GLY_10@H_171     GLY_10@N_170     9340       0.1868       2.8801     155.7751
ASP_9@OD1_166     ARG_16@HH22_248   ARG_16@NH2_246     8261       0.1652       2.8113     155.4659
GLN_5@O_92            ASP_9@H_159      ASP_9@N_158     8187       0.1637       2.8864     157.7047
ASP_9@OD1_166     ARG_16@HH12_245   ARG_16@NH1_243     7139       0.1428       2.8223     153.2402

This table should be reasonably self explanatory. You can visualise the hydrogen bonds in VMD. Fire up VMD and load the TC5b.prmtop and equil_1-10.binpos. It will load quicker if you turn off the display of the molecule (double click the D by it's filename in "VMD Main") and also minimise VMD.

Then go to "Graphical Representations" and select the following:

Change to the first structure in the trajectory. Then go to the OpenGL display, press "2" to change to "bond label" mode. Then click the two pairs of H-O atoms so you get their distances labelled.

Now play through the trajectory and watch what happens. You may want to change the step size to 10 so that it steps through the trajectory at a reasonable speed. Here is an MPEG movie of the trajectory (TRPcage_h_bonds.mpg 14.5 MB). Notice how these two hydrogen bonds quickly form and remain formed for the first 15 ns of the simulation. These two bonds act to stop the alpha helix forming. Eventually they break and the alpha helix forms very rapidly. These two hydrogen bonds never reform after this point. The first major secondary structure motif to form is the alpha helix, this is in line with what they describe in the paper only they see helix formation much earlier than we do. In our simulation the formation of the two hydrogen bonds discussed above is sufficient to keep us trapped in a high energy minimum for many nano seconds, so delaying the overall folding.

6.1.4) Dihedral Analysis using Cpptraj
If you compare our theoretical folded structure with the NMR one an obvious discrepancy is in the position of the Tryptophan. In particular in the position of the tryptophan ring with respect to the protein backbone. Although we created our starting structure in what I believe is an identical method to how they did in the paper it is always possible that we did not create the same starting structure. While the final folded structure should be independent of the starting structure the timescale on which the folding occurs may not be independent. Unfortunately we don't have time in this tutorial to try running several simulations from different starting structures to test this. However, one thing we can do is take a look at the various backbone related dihedral angles involving the Tryptophan. From this information we may be able to make some predictions about whether the folding pathway is likely to be sensitive to the initial positioning of the Tryptophan.

We will use cpptraj to monitor the following 4 angles as a function time:

Here is a cpptraj script that will do this for us:

measure_trp_angles.cpptraj

trajin equil_1-10.binpos
dihedral trpphi :5@C :6@N :6@CA :6@C out trp_phi.dat
dihedral trppsi :6@N :6@CA :6@C :7@N out trp_psi.dat
dihedral trpchi1 :6@N :6@CA :6@CB :6@CG out trp_chi1.dat
dihedral trpchi2 :6@CA :6@CB :6@CG :6@CD2 out trp_chi2.dat

>$AMBERHOME/bin/cpptraj TC5b.prmtop <measure_trp_angles.cpptraj

This should produce the following four output files: trp_phi.dat, trp_psi.dat, trp_chi1.dat, trp_chi2.dat

You may need to go through the files and manually adjust any sections where the angle has jumped from -180 to +180 so that it looks clean when plotted, I won't go into details on how to do this here but suffice to say you can do it in xmgrace with some code like:

dihedral cleanup was done using transformations: evaluate expression:
y = (y < -100) ? y=y+360 : y

Here is a plot of the four angles:

This plot is very informative. Notice how the phi angle stays more or less constant during the entire simulation. Secondly notice how psi starts out around 160 degrees and then drops sharply to around -50 degress after about 12.5 ns. This corresponds to the point where the alpha helix starts to form. Hence we may get very different behaviour if we start our simulation with a structure where the values of the phi and psi angles are reversed. Perhaps this is what they had in the paper? If you have the inclination, and the computer time available to you, you could try this and see what results you obtain.

The chi1 and chi2 angles are also very informative. Here is the RMSD plot again, compare it to the chi1 and chi2 plots above:

Notice how the chi1 and chi2 angle changes appear to be correlated with the changes in RMSD. This suggests that changes in the Tryptophan structure are central to the stability of each of the structures sampled by the TRPcage peptide. This correlation could be investigated further by calculating a cross correlation between the RMSD and chi1 and chi2 angles. Perhaps you could try this out if you are interested. Numerical recipes contains example code for calculating correlation functions.

6.2) Cluster Analysis using the MMTSB Toolset

The final type of analysis we will do in this tutorial is called cluster analysis. Here we will use an algorithm that goes through the trajectory and locates clusters of structures that are the same. From this it creates centroids describing each cluster and then gives an RMSD for each structure in the trajectory with respect to each identified cluster.

In order to complete this section of the tutorial you will require the MMTSB toolset. This is not part of AMBER but can be downloaded free of charge from http://feig.bch.msu.edu/mmtsb/Main_Page. If you are running this tutorial at an AMBER workshop then the MMTSB toolset should already be installed on your machine. You can check by seeing if the environment variable MMTSBDIR is set. E.g.

>echo $MMTSBDIR

If it isn't set then you probably need to install the MMTSB toolset. Refer to the documentation on the website. Essentially you should untar the toolset to somewhere like /usr/local/mmtsb, set MMTSBDIR to this directory and then add $MMTSBDIR/bin and $MMTSBDIR/perl to your path. Once that is done we can continue.

WARNING: Do not try this clustering over a remote filesystem or you will be waiting a very long time!!! If your current working directory is on an NFS share I highly suggest that you copy all of the files you'll need to a local scratch directory and then copy the results back when you are done. Also, avoid doing an 'ls' in the directory containing the pdb files or you'll end up waiting forever for the list to scroll past.

Step 1 - Prepare the directories

The clustering process creates a large number of files, namely a pdb file for every frame in the trajectory. As such we don't want to 'pollute' our working directory with lots of files so we shall create some sub directories to do the processing in:

>mkdir clustering
>cd clustering
>mkdir PDBfit

Step 2 - Use cpptraj to extract RMSD'd PDB's from the trajectory

extract_pdb.cpptraj

trajin ../equil_1-10.binpos

#-- put all the pdb frames in a subdirectory
trajout PDBfit/trp.pdb pdb

#orient all frames best fit to backbone of helix in NMR structure
rms first mass :2-20@CA,C,N

>$AMBERHOME/bin/cpptraj ../TC5b.prmtop <extract_pdb.cpptraj

This will give you 50,000 pdb files in the PDBfit directory. Each one representing a single snapshot from the trajectory that has been RMSD fitted to residues 2 to 20 of the first structure of the NMR data.

Step 3 - Adjust numbering of PDB files

This step is not strictly necessary but it will make it easier to locate specific pdb files later on.

Cpptraj does not write leading zeros on the pdb names which means when you sort the directory it is listed as:

1
10
100
...

Rather than numerically. The following c shell script will rename the files with leading zeros:

fix_numbering_pdb.csh

#! /bin/csh
set ff = *.pdb.1
set fnam = $ff:r
set numfil = `ls -1 *.pdb.???? | wc -l`
if( $numfil != 0)then
  foreach fnam (*.pdb.????)
     set fr=$fnam:r
     set fnum=$fnam:e
     mv $fnam $fr.0$fnum
     echo $fnam $fr.0$fnum
  end
endif
set numfil = `ls -1 *.pdb.??? | wc -l`
if( $numfil != 0)then
  foreach fnam (*.pdb.???)
     set fr=$fnam:r
     set fnum=$fnam:e
     mv $fnam $fr.00$fnum
     echo $fnam $fr.00$fnum
  end
endif
set numfil = `ls -1 *.pdb.?? | wc -l`
if( $numfil != 0)then
  foreach fnam (*.pdb.??)
     set fr=$fnam:r
     set fnum=$fnam:e
     mv $fnam $fr.000$fnum
     echo $fnam $fr.000$fnum
  end
endif
set numfil = `ls -1 *.pdb.? | wc -l`
if( $numfil != 0)then
  foreach fnam (*.pdb.?)
     set fr=$fnam:r
     set fnum=$fnam:e
     mv $fnam $fr.0000$fnum
     echo $fnam $fr.0000$fnum
  end
endif

chmod +x fix_numbering_pdb.csh
./fix_numbering_pdb.csh

Step 4 - Do the clustering

To do the clustering we will use the mmtsb tool kclust. This tool requires a file with the list of structures to cluster. In this case we will call the file clustfils. We will fill it with all of the structures from our trajectory. If you only wanted to do the cluster analysis on a portion of the trajectory you could of course only include the structures you are interested in within this file. Lets create the file:

>cd PDBfit
>ls -1 . > ../clustfils

(ls -1 means single column)

>vi ../clustfils

Remove first line since kclust can only handle < 50000 structures. We will fit structures 2 to 50,000. We need to remember that we did this as it means our pdbs actually start counting from two. This means that the ID's in the centroid files will be off by one. You will see what I mean later. Here is the clustfils file.

Now we can run the clustering:

>kclust -mode rmsd -centroid -cdist -heavy -lsqfit \
-radius 6 -maxerr 1 -iterate \
../clustfils > ../Centroid_6
>cd ..

The clustering command should take around 3 to 5 minutes to run. However, it is very very memory intensive and may take considerably longer on your machine if you do not have much memory available. In fact, if you run out of memory and do not have enough swap space, the analysis will not work at all. It seems to take a bit less than 1 Gb. On a P4/3GHz/512 Gb RAM, the machine pages nearly unceasingly, the entire run taking 10 minutes. If your machine is paging (watch a monitor like xosview), be sure to do NOTHING else on the machine while the job runs, even moving the mouse around. This will save the machine from having to page your application in and out of swap at the same time as running and paging your clustering job.

The main output from this analysis so far is the file Centroid_6 (Centroid_6.gz 478kB, don't forget to unzip this file if you download it.) which contains:
- the centroids in pdb format
- the members of each centroid
- the rmsd distance of each member from its centroid.

Take a quick look at the Centroid_6 file in a text editor so you can see what the format looks like. Note the offset in the ID and pdb numbering:
97 trp.pdb.00098 5.221610
ID 97 is actually pdb file number 98. Be careful of this.

Step 5 - Process the Clusters

The centroids themselves are not physically meaningful since they are effectively a mathematical construct based on the member of a cluster. However, the structure that has the lowest RMSD to each centroid is meaningful and much easier to understand. In this way we can look at the structure that is closest to the centroid that represents each of the clusters found.

Here is an awk script we will use to process the Centroid_6 file:

extract_centroids.awk

BEGIN{b0=2;}

{centind=index($1,"#Centroid");
   c=$2;
   getline;centind=index($0,"#Centroid");
   FIL0 = sprintf("centroid%2.2d.member.dat",c)
   while(centind != 1){
     print $1,$3 > FIL0 ;
     getline;centind=index($0,"#Centroid");
   }

   numcent=0;    print $2,$1,NR-b0; 
   c=$2;
   getline; endrec = index("End",$0);
   while( endrec != 1 ){
     FIL = sprintf("centroid%2.2d.pdb",c);
     print > FIL;
     getline; endrec = index($0,"#End");
   }
   b0=NR+2;
 }

>awk -f extract_centroids.awk Centroid_6 | tee Centroid_6.stats

Note: The tee command allows us to pipe the output to a file Centroid_6.stats and to standard out (the screen) at the same time.

This command produces a pdb file for each centroid, a member file for each centroid and a stats file with the number of members in each centroid. You can download all these files as a tar archive here: centroid_data.tar.gz (326 kB)

At this point you should see that we have a total of 6 clusters.

1 #Centroid 11732
2 #Centroid 6179
3 #Centroid 26772
4 #Centroid 2735
5 #Centroid 98
6 #Centroid 2483

Two of them (1 & 3) are highly populated, three are marginally populated (2,4 & 6) and one is probably insignificant (5).

Individual pdb (centroid??.pdb) and member files (centroid??.member.dat) for each centroid are produced. View the pdb files in VMD but keep in mind that these are much like an average structure and do not represent a physically meaningful configuration, they are only the centroid of the cluster.

You can find the structure that is nearest the cluster centre (centroid), so you can see a more physically relevant structure of the centroid. Sort the member file numerically on the second column (rms distance from the centroid).

>sort -n -k2 centroid03.member.dat | head -5

21340 1.331332
18240 1.346612
18275 1.352564
34161 1.362107
17425 1.370057

So structure ID 21340 has the lowest RMSD to centroid3. So we will copy this to make our bestmember for this centroid. Remember though that our ID and pdb file are offset by one since we deleted the first pdb file from the data set. Hence structure ID 21340 refers to pdb file trp.pdb.21341:

>cp PDBfit/trp.pdb.21341 centroid03_bestmember.pdb

Now view this best member pdb analyzing for features of interest such as the results of the H-bond analysis.

Repeat above for the other clusters. Here is a tar file containing the 6 best members: best_members.tar.gz

Step 6 - Plot the RMS distances from the centroids

The member files for each of the centroids can be read into xmgrace or xmgr and plotted. Some work is required to cleanup the plot to make it easy to read. This can be done by loading each member data set in the following order:

(the order of most populated):
centroid03.member.dat
centroid01.member.dat
centroid02.member.dat
centroid04.member.dat
centroid06.member.dat
centroid05.member.dat

In xmgrace you can then do (xmgr menus may differ slightly):

Bring up Set Appearance window from xmgrace Plot menu
Select all sets
set Line properties Type to None
set Symbol properties to Type circle and adjust Size to 16
click Apply (at bottom)

Select each set, and change the Symbol colors as follows (press Apply
after selecting each color):

G0.S0 black (leave it alone, it is already black)
G0.S1 red
G0.S2 blue
G0.S3 green
G0.S4 yellow
G0.S5 orange

You should end up with a plot like this:

Some significant patterns appear in this clustering analysis. Note that cluster 6 is only at the very start of the simulation and can be ignored from now on. Note the progression of one cluster to another, where blue is the only transition structure to black except a short transition via green. The initial meta-stable cluster is the red cluster and the transition to the dominant cluster is through the clusters green, yellow, and blue.

From here one can do many other clusterings by varying the clustering parameters and the range of structures over which the clustering is done. Once a region of the trajectory is selected as interesting, clustering can be done on just that region by creating a list of pdb files (clustfils) contained in that region.

Secondly, one can change the radius for clustering, smaller radii produce more (smaller) clusters, often unmanageable and insignificantly different from each other. However, without some trial and error, one does not know what a productive and informative value of the radius should be.

Thirdly, one can explore the clustering of pdb files that have been aligned in different ways. Our alignment was to residues 2 to 20 of the first structure. One could try aligning to another part of the structure, or another structure such as the lowest energy or even one of the closest members of the first clustering analysis.

One more analysis of the clusters produced above is to process the clusters through cpptraj to find out other information about the clusters. The simplest analysis is rmsd to some structure such as the centroid of the cluster, but we probably will not gain any new information since the clustering produced that information anyway.

The best approach would be to create trajectories containing only the members of each cluster and doing analysis on each trajectory such as rms, and hbond. I leave this step to the more enthusiastic tutee.

CLICK HERE TO GO TO SECTION 7