The premise of bootstrapping error analysis is fairly straightforward. For a time series containing N points, choose a set of N points at random, allowing duplication. Compute the average from this ``fake'' data set. Repeat this procedure a number of times and compute the standard deviation of the average of the ``fake'' data sets. This standard deviation is an estimate for the statistical uncertainty of the average computed using the real data. What this technique really measures is the heterogeneity of the data set, relative to the number of points present. For a large enough number of points, the average value computed using the faked data will be very close to the value with the real data, with the result that the standard deviation will be low. If you have relatively few points, the deviation will be high. The technique is quite robust, easy to implement, and correctly accounts for time correlations in the data. Numerical Recipes has a good discussion of the basic logic of this technique. For a more detailed discussion, see ``An introduction to the bootstrap'', by Efron and Tibshirani (Chapman and Hall/CRC, 1994). Please note: bootstrapping can only characterize the data you have. If your data is missing contributions from important regions of phase space, bootstrapping will not help you figure this out.
In principle, the standard bootstrap technique could be applied directly to WHAM calculations. One could generate a fake data set for each time series, perform WHAM iterations, and repeat the calculation many times. However, this would be inefficient, since it would either involve a) generating many time series in the file system, or b) storing the time series in memory. Neither of these strategies is particularly satisfying, the former because it involves generating a large number of files and the latter because it would consume very large amounts of memory. My implementation of WHAM is very memory efficient because not only does it not store the time series, it doesn't even store the whole histogram of that time series, but rather just the nonzero portion.
However, there is a more efficient alternative. The principle behind
bootstrapping is that you're trying to establish the various of averages
calculated with N points sampling the true distribution function, using
your current N points of data as an estimate of the true distribution.
The histogram of each time series is precisely that, an estimate of the
probability distribution. So, all we have to do is pick random numbers
from the distribution defined by that histogram. Once again, Numerical
Recipes shows us how to do it: we compute the normalized cumulant function,
, generate a random number between 0 and 1
, and solve
for
. Thus, a single Monte Carlo trial is computed in the following manner:
After a number of Monte Carlo trials, the standard deviation in the probability in each bin is computed using the store probabilities and squared probabilities. This standard deviation is the statistical uncertainty in the probability distribution. The uncertainty in the free energy is then estimated as
![]() |
(3) |
where is the free energy at
,
is the probability, and
connotes the standard deviation. This indirect computation of the
uncertainty in the free energy is convenient because the potential of mean
force is only known up to a constant, and thus the proper alignment of the
PMFs computed from the faked data sets is unknown. The probabilities, by
contrast, have no such ambiguity. This procedure assumes that the
fluctuations in the probabilities are roughly Gaussian, and in principle the
best way to do it would be to store the PMF generated by each Monte Carlo
trial, optimlly align them, and then compute the standard deviations
directly. However, I've played with varying the correlation times (more on
this later), and it looks like the Gaussian-like assumption usually holds, at
least for test data.
The situation is slightly more complicated when one attempts to apply the bootstrap procedure in two dimensions, because the cumulant is not uniquely defined. My approach is to flatten the two dimensional histogram into a 1 dimensional distribution, and take the cumulant of that. As long as I maintain a unique mapping between the 1D cumulant and the 2D histogram, there is no difficulty (I think...). The rest of the procedure is the same as in the 1-D case.
There is one major caveat throughout all of this analysis: thus far, we have assumed that the correlation time in time series is shorter than the snapshot interval. To put it another way, we've assumed that all of the data points are statistically independent. However, this is unlikely to be the case in a typical molecular dynamics setting, which means that the sample size used in the Monte Carlo bootstrapping procedure is too large, which in turn causes the bootstrapping procedure to underestimate the statistical uncertainty.
My code deals with this by allowing you to set the correlation time for each time series used in the analysis, in effect reducing the number of points used in generating the fake data sets (see section refss:format). For instance, if a time series had 1000 points, and you determined by other means that the correlation time was 10x the time interval for the time series, then you would set ``correl time'' to 10, and each fake data set would have 100 points instead of 1000. If the value is unset or is greater than the number of data points, then the full number of data points is used. Please note that the actual time values in the time series are not used in any way in this analysis; for purposes of specifying the correlation time, the interval between consecutive points is always considered to be 1.
The question of how to determine the correlation time is in some sense beyond
the scope of this document. In principle, one could simply compute the
autocorrelation function for each time series; if the autocorrelation is well
approximated by a single exponential, then 2x the decay time (the time it
takes the autocorrelation to drop to ) would be a good choice. If it's
multiexponential, then you'd use the longest time constant. However, be
careful: you really want to use the longest correlation time sampled in the
trajectory, and the fluctuations of the reaction coordinate may fluctuate
rapidly but still be coupled to slower modes.
It is important to note that the present version of the code uses the correlation times only for the error analysis and not for the actual PMF calculation. This isn't like to be an issue, as the raw PMFs aren't that sensitive to the correlation times unless they vary by factors of 10 or more.