SANE - Structure Assisted NOE Evaluation
SANE is a perl program which generates restraints from crosspeaks in NOESY
spectra. It works with crosspeak lists from both Felix and NMRView and
is able to analyse 2D, 3D and 4D NOESY spectra. To reduce the chemical
shift ambiguity it uses existing assignments, the average distance in an
ensemble of structures, the secondary structure and relative NOE contributions.
Any combination of these filters can be used.
The program has been described in: "SANE (Structure Assisted NOE Evaluation):
an automated model-based approach for NOE assignment." BM Duggan,
GB Legge, HJ Dyson and PE Wright, J Biomol NMR, 2001 19(4)
321-9. Abstract
Running SANE
To run SANE make sure your PATH variable includes the directory containing
the SANE perl script and type sane sane.par, where sane.par
is a parameter file containing a variety of information such as chemical
shift tolerances, cut-offs for the filtering routines and the names and
locations of files. Some example parameter files are provided with the
code (NMRView 3D parameter file, Felix
2D,
3D,
4D parameter files).
In this document the parameters used by SANE, and defined in the parameter
file, are coloured this dull red. Files required
by SANE are;
-
A crosspeak list, defined by xpk_file. It
can be a points format list from Felix, or the
standard ppm format file from NMRView.
-
An assignment list, defined by ass_file. For
Felix data create a file with one line per assignment
with each line containing residue number, residue name, atom name and chemical
shift. For NMRView data the standard ppm.out file is used.
-
A MAP file, defined by map_file.
This file contains a list of the names used for the crosspeaks and the
names of their equivalent atoms in the PDB files. This file allows any
sort of nomenclature to be used. This file is also used to generate Amber
format restraints from UPL files.
NOTE Your life will be much simpler if you decide on a nomenclature
to distinguish stereospecifically assigned non-degenerate geminal protons
from non-stereospecifically assigned non-degenerate geminal protons and
degenerate geminal protons. I used HB2/3 for stereospecifically assigned
non-degenerate beta protons, HB+/- for non-stereospecifically assigned
non-degenerate beta protons and QB for degenerate protons and extended
this system for other pro-chiral centres.
-
A list of standard chemical shifts,
defined by standard_shifts. This is a list
of the chemical shifts reported in the BioMagRes Bank and is provided with
the code. If there are unassigned resonances then the mean chemical shift
and the standard deviation specified in this file can be used to include
the unassigned resonances as possibilities. To use this option the flag
using_unassigned
must be set.
-
An ensemble of PDB structures, defined by ensemble.
An ensemble is not strictly required. If there are no structures listed
then distance and contribution filtering will not be performed. SANE expects
the proton atom names in the pdb files to start with a H, e.g. HB2. If
they start with a number, e.g. 2HB, then SANE will not produce correct
results. Nomenclature can be converted using the script dyana2leap.
The nature of the data is specified by data_type
being either "Felix" or "NMRView". If using Felix data then SANE also requires
a volume file, defined by vol_file. If using
NMRView data then SANE requires a sequence file specified by seq_file.
To account for folded peaks SANE requires information about the spectrum.
For NMRView data the user must specify the upper
and lower chemical shifts of each dimension
which can be obtained from the Attributes window. For Felix data the user
must specify the frequency,
spectral_width,
reference_point
and reference_ppm for each dimension. This
information is also used for the conversion of points to ppm necessary
with Felix data.
Output
SANE always creates three output files; an OUT file, a UPL file and an
ambig MAP file. It can optionally create a new XPK file and an XPK file
containing only the crosspeaks for which restraints were not written.
-
The OUT file (out_file)
contains information about every peak examined. The peak number, chemical
shifts, volume, distance bin and assignments are printed first. After each
filtering step it prints out the possible assignments and the difference
between their chemical shifts and the peak position. After the distance
filtering step it also prints the shortest and average distance in the
ensemble for each possibility. After contribution filtering sane also prints
the relative contribution of each possibility. At the end of each peak
entry SANE prints out the restraint it will write to the UPL file and any
messages about the filtering steps.
-
The UPL file (upl_file)
contains the generated restraints. Each restraint has a comment containing
the peak number, the experiment number and the nature of the restraint.
The restraint can be Preassigned (the crosspeak was completely assigned
before starting sane), Unique (SANE found a unique assignment), Single
Ambig (one side of the restraint is ambiguous) or Double Ambig (both sides
of the restraint are ambiguous). If the restraint is Preassigned or Unique
then the shortest and average distances in the ensemble are also printed
in the comment. Restraints will only be labelled Preassigned if using_assignments
is set to "true". The UPL file, with the ambig MAP file, is used for the
generation of Amber restraints. If the ambiguous restraints are removed
from the UPL file then it can also be used for the creation of DYANA restraints.
-
The ambig MAP file (ambig_map)
defines the the groups of atoms used in the ambiguous restraints. Unique
labels for each ambiguous restraint are created using the expt_flag,
the dimension number and the crosspeak number.
-
A new XPK file, containing the new assignments found by SANE, is created
if new_xpk is defined. If this variable is
not defined then the new XPK file will not be written. The program will
require less memory and may run quicker if this variable is left undefined.
Assignments are updated for every dimension in which there is only one
assignment in the final list of possibilities. This means that dimensions
can be assigned even though a unique possibility was not found.
-
An unassigned XPK file, containing only the crosspeaks for which restraints
were not written is created if unass_xpk is
defined. This allows you to run SANE again with different parameters on
troublesome peaks. Like the new XPK file, the unassigned file will not
be written if unass_xpk is not defined and
this may allow the program to run faster. Assignments are also updated
in this file.
Procedure
SANE follows much the same procedure one would use manually. The following
filters can be used and at the moment they will be applied in the order
in which they are described below.
-
Chemical shift is used to determine all possible assignments for each dimension.
A +/- tolerance is specified for each dimension.
If using_unassigned is set to "true", and
your assignment list has some atoms with chemical shifts set to -999.99
or less, then the mean shift of that atom in the file standard_shifts
will be used as the chemical shift and the tolerance will be set to the
standard deviation in standard_shifts multiplied
by number_BMRB_stddevs. Once a list of possible
assignments for each dimension of the crosspeak has been determined all
combinations of those assignments are checked for consistency with the
type of experiment, specified by expt_flag.
-
Distance filtering is applied to each possibility if the using_distances
flag is set. If the average distance for a particular possibility, calculated
over the ensemble of structures, is greater than the user specified cutoff
(distance_cutoff) then that possibility is
discarded. Typical values for distance_cutoff
range from 25 to 5 angstroms.
-
Existing assignments are used to reduce the number of possibilities if
the using_assignments flag is set. If the
existing assignments are not amongst the possibilities then a warning is
printed and the assignments are ignored. The only reasons for getting this
warning are; your assignment list does not include the crosspeak assignment
you made, your chemical shift tolerances are too tight, or you did distance
filtering and your distance cutoff is too tight for the structures you're
using.
-
Secondary structure is used to eliminate possibilities if the using_secondary
flag is defined. The locations of the secondary structure elements are
defined like this,
$sse{'bstrand1'} = "10-13";
$sse{'bstrand2'} = "49-52";
$sse{'ahelix1'} = "3-16";
$sse{'ahelix2'} = "47-54";
and a list of NOEs that one should not observe in strands, helices and
throughout all regions of the protein is defined like this
$exclude{'strand'} = "aN2,aN3,aN4,NN2,ab3";
$exclude{'helix'} = "aN5+,Na2+,NN5+";
$exclude{'all'} = "ab1,ba1,aa1,Nb1,bN3";
The 'aN2' in the exclude strand list indicates that possibilities that
involve a Halpha proton and a HN proton two residues towards the carboxyl
terminus, and both protons are in the same strand, will be discarded. The
'aN5+' in the exclude helix list indicates that possibilities involving
a Halpha proton and a HN proton separated by 5 or more residues, and in
the same helix, will be discarded. NOEs with a trailing '+' are expanded
out to the length of the longest secondary structure element (strand or
helix as appropriate). A '+' cannot be used in the exclude all list. The
expanded list of NOEs to discard is written out at the start of the OUT
file.
This filter is probably most useful at the
start of structure calculations when attempting to obtain a global fold.
It should probably not be used towards the end of the refinement.
-
Relative NOE contributions are calculated if the
using_contributions
flag is defined. For each possibility the shortest distance in the ensemble
is used to calculate the relative contribution to the observed NOESY crosspeak.
The contributions are then ranked from largest to smallest and summed until
a user specified cutoff (contribution_cutoff)
is exceeded. Possibilities occurring after the cutoff has been exceeded
are discarded. Typical values for contribution_cutoff
range from 1.00 to 0.80.
The entire list of possibilities is written out after the chemical shift
filtering. After each filtering step a message and the reduced list
of possibilities is printed if one or more possibilities have been eliminated.
If a filtering step does not eliminate any possibilities then you won't
see any output from that step. After performing all the filtering if there
is more than one possibility then an ambiguous restraint is written. If
there is only one possibility then a unique restraint is written, and if
there are no possibilities then a message to that effect is printed.
SANE assumes that the spectrum is aliased rather than folded. It will
account for aliased chemical shifts using the referencing parameters, in
the case of Felix data, or the chemical shifts of the edges of the spectrum,
in the case of NMRView data. At the moment there is no way to cope with
data that has been folded rather than aliased.
Defining your spectrum
For SANE to properly analyse your spectrum it needs to know what type of
experiment it is, the nuclei involved and how the processed data is arranged.
expt_flag
defines the nature of the experiment as follows;
-
0-9 indicates a 2D NOESY,
-
10-19 an 15N edited 3D NOESY,
-
20-29 a 13C edited 3D NOESY,
-
30-39 an 15N edited 3D HSQC-NOESY-HSQC,
-
40-49 a 4D NOESY and
-
50-59 a select filter experiment (still to be fully implemented).
-
60-69 a 3D aromatic NOESY experiment
Shared time CN NOESY experiments can be specified by values of expt_flag
between 10 and 29. expt_flag is printed in
the comment section of each restraint in combination with the crosspeak
number. This allows each restraint to be traced back to its own spectrum
and crosspeak. The range of values allowed for each type of experiment
enables restraints from several different experiments of the same type,
e.g. different mixing times, to be used without confusion.
SANE needs to know which dimensions in the transformed matrix correspond
to the protons and which to the heteronuclei. It also needs to know what
the heteronucleus is. This information is specified with the parameters
protonA_dim,
heteroA_dim,
protonB_dim,
heteroB_dim,
heteroA
and heteroB. For example, an 15N NOESY transformed
in the usual manner could be specified by setting
heteroA = "N"
protonA_dim = 1
heteroA_dim = 3
protonB_dim = 2
For Shared time CN NOESY spectra the heteronuclear dimension must be specified
with "CN".
SANE also requires that you specify which dimension of the matrix contains
the directly detected dimension. This is done with the parameter
detect_dim.
SANE will not fold assignments in the directly detected dimension. This
leaves fewer possibilities for the program to consider and allows it to
run quicker, but if you folded the directly detected dimension using a
spectrometer without digital filters then those folded resonances will
not be able to be assigned correctly.
Bells and Whistles
volume to distance conversion
SANE uses bins to convert volumes to distances. It offers two different
methods. The first method (bin_volumes) requires
the user to define a list of boundaries (Bound1,
Bound2,
...) and the distance bins (Bin1,
Bin2,
...) they correspond to. The volumes are then converted directly to bins.
In the second method (bin_distances) the volumes
are converted to distances and then a user defined list of boundaries (Bound1,
Bound2,
...) is used to sort the volumes into distance bins (Bin1,
Bin2,
...). The second method requires the gradient
and intercept from a calibration. Both methods
divide the initial volume by a parameter,
scaling_constant,
and can use up to ten different bins.
water
Peaks at the same chemical shift as the water can be ignored by specifying
the water_shift and the water_tolerance.
Peaks that fall within this range will be ignored.
ignore and adjust lists
SANE allows the user to define lists of peaks which can be ignored
(ignore_list), or their distance bins adjusted
to the next lower bin (adjust_list). To ignore
or adjust a peak the peak number is included in the appropriate list in
the parameter file. Including a peak number in the adjust list more than
once will cause it to be adjusted more than once. If an attempt to adjust
a peak to a bin longer than the longest bin is made, then the message,
"Attempted to adjust to lower than smallest bin", is printed and the restraint
is not written.
negative peaks
Setting the flag using_negative_volumes
to "yes" causes sane to treat negative volumes as positive volumes. If
using_negative_volumes
is not set then the usual analysis will be done for negative peaks but
a restraint will not be written and the message, "Crosspeak has negative
volume", will be printed.
number of ambiguities
The parameter accepted_possibilities controls
how many possibilities will be accepted when writing an ambiguous restraint.
Setting accepted_possibilities to 0 will include
all possibilities in the ambiguous restraint. A value of 1 will write only
unique restraints, a value of 2 will write unique restraints and ambiguous
restraints involving up to two possibilities, a value of 3 will write unique
restraints and ambiguous restraints involving up to three possibilities,
and so on.
unassigned resonances
Unassigned resonances may be included as possible assignments. This
is done by setting the flag using_unassigned
to true and including the names of the unassigned resonances in the assignment
list with their chemical shifts set to a value less than -999. If
this is done then SANE will use the BioMagRes Bank shifts, specified in
standard_shifts,
to determine the chemical shift range for the unassigned resonance. The
mean BMRB shift is used with the standard deviation as the tolerance. The
number of standard deviations can be specified with number_BMRB_stdevs,
which has a default value of 2. Any peak falling within the specified number
of standard deviations from the mean chemical shift of the unassigned resonance
will include that unassigned resonance as a possible assignment. Leaving
the unassigned resonances out of the assignment list will prevent SANE
from including unassigned resonances as possibilities.
shortest average distance eliminated by contribution filter
When using both distance and contribution filtering it is possible
for the possibility with the shortest average distance to be eliminated
by the contribution filter. In such cases a warning message, "Possibility
with lowest mean distance eliminated by filters", is printed.
Elimination of the possibility with the shortest average distance can
occur if there is a large variation, throughout the ensemble, of the distances
associated with possibilities. Long range restraints tend to have larger
variations in their distances than short range restraints, so the contribution
filter tends to favour them. This is due to the use of the minimum distance,
rather than the average distance, when calculating the contribution filter.
In the future, NOE contributions may be calculated using the average distance,
rather than the minimum, or the option of choosing a method may be added.
Future developments
Suggestions for improvements and new features, as well as bug reports,
are always welcome. At the moment I hope to (eventually) add the following;
-
The ability to cope with select filter data.
-
Closer integration with AMBER to allow the iteration of restraint generation
and molecular dynamics to become more automated.
-
It is still unclear whether contribution filtering works best using the
shortest distance in the ensemble or the average distance in the ensemble.
It may be that at different times in the structure determination process
one method is better than the other. The introduction of a flag allowing
the user to choose one method or the other may be useful.
Some other useful scripts
There are a few scripts for converting files and examining structures which
may be useful. These are a mixture of perl, awk and sed.
-
upl4dyana - converts the UPL file produced by sane to a format acceptable
by DYANA (and DIANA). Adjusts the field spacing, removes ambiguous restraints,
arginine, lysine, aspartic acid and glutamic acid are all assumed to have
charged sidechains, geminal methylenes are arbitrarily assigned, and pseudoatom
nomenclature is adjusted.
-
splitpdb - splits the combined PDB file produced by DYANA into seperate
files.
-
pdb2ensemble - combines a list of PDB files into one file.
-
noecount - counts the number of intra-residue, sequential, medium (2,3,4),
long range (5 or more) and ambiguous restraints in a UPL file. The script
prints to the screen a list of the total number of each type of restraint
and the number for each residue. It also creates a new upl file, with the
suffix .new_upl, with duplicate restraints removed and the restraints sorted
by residue.
-
listviols - creates a list of distance or torsion restraint violations
from Amber .o files. Violating restraints are sorted according to the number
of structures in which they occur and then by the average size of the violation.
In the case of distance violations the peak number is listed as well.
-
rmsdbyresid - calculates rmsd per residue. Written by Ishwar Radhakrishnan.
-
sviol - makes a list of the energies and violations of a group of Amber
structures. Written by Randall Ketchem.
Last updated 2001 August 16 by
Brendan
Duggan