PSS - Protein Structural Statistics - main program |
PSS - Protein Structural Statistics - main program
pss [options]
PSS reads an ensemble of PDB files of protein structures, performs a multiple sequence alignment, and computes structural statistics for each position of the alignment. Different optional functionalities are proposed: structure superposition, Cartesian coordinate statistics, dihedral angle calculation and statistics, and a cluster analysis based on dihedral angles. An HTML report containing a summary of results and figures can also be generated.
--help
(-h
)
Print help message
--verbose
(-v
)
Verbose mode
--project <name>
Name (default: project) of the directory where results are stored
--path <path>
Path (default: .) where the project directory is created
--pdb <path>
Path containing the PDB structures (default: .)
--structlist <structlist>
File containing a list of PDB codes or basenames, chain identifiers, and optional family names
--poslist <poslist>
File containing a list of positions and dihedrals to include in clustering
--align <method>
Alignment method among pdbnum
, clustalo
(default), seq-modeller
, and seqstr-modeller
--alignfile <file>
External alignment file in PIR format
--firstalignpos <pos>
First alignment position (default: 1)
--consensusfile <file>
External consensus sequence file in PIR format
--superp <method>
Structure superposition method among theseus
(default) and modeller
--superpref <ref>
Reference structure for superposition (for example 1GZ8:A
)
--superp3d
3D visualization of superposed structures in HTML report
--cartstat
Structural statistics in Cartesian coordinates
--dihestat
Structural statistics in dihedral coordinates
--nosymmetry
Don't symmetrize ASP, GLU, PHE, TYR sidechain dihedrals
--clust
Cluster analysis
--radius <radius>
Radius parameter in degrees for clustering (default: 60)
--linkage <method>
Clustering linkage method among s
(single), m
(maximum, default), and a
(average)
--report
Write HTML report
--html <path>
Path in HTML links (default: .)
--override
Force new execution instead of reusing existing calculations
PSS is a protein structural statistics program for Unix-like environments. It has a command-line interface, and is implemented in Perl as two separate files. The PSS.pl script is the main program, while the PSS.pm library provides object-oriented classes and methods to describe atoms, residues, molecules, sequences, and alignments. A separate documentation for PSS.pm is available in doc/PSS.pm.html.
Input data for PSS consists in a set of PDB files of protein structures.
Only standard PDB ATOM records are expected to be correctly treated by
PSS. Output consists in various files, stored in a dedicated project
directory, whose name can be changed with the --project <name>
option. The --path <path>
option (default: .) allows to
change the location of the project directory. Intermediate results
are written in text files, easily accessible and parsable. External
program input and output files are also kept. An HTML report can be
produced, facilitating the access to results and their analysis, through
hyperlinks, 3D visualization of superposed structures, figures, and tables.
The PSS.pl script is divided into subroutines, whose execution is optional and depends on requested options. First and mandatory steps consists in reading structures and/or sequences, and determining a multiple sequence alignment as well as a consensus sequence. Other optional subroutines include structure superposition, Cartesian statistics, dihedral calculation and statistics, clustering, and the production of figures and of an HTML report.
Iterative use of the program, where a first round of calculations is
completed by another run with additional analyses, is possible and
facilitated. PSS indeed detects when results are already present and
reuse them instead of repeating a calculation. Delete the folders or use
the --override
option to force a new calculation.
A structure list file can be provided as argument (--structlist
<structlist>
) to have more control on the PDB files read by PSS. The
structlist file contains on each line a PDB code or file basename, a
PDB chain identifier, and an optional user-defined family name, these
fields being separated by spaces (for example '1GZ8 A close'). If a PDB
code is detected, it is translated to uppercase letters. An empty chain
identifier can be entered as '_'. The first chain identifier of a PDB
file can be automatically detected with 'first', this is the default if
no chain identifier is entered. At last, all chain identifiers present
in a PDB file can be automatically considered with '*'.
When a structlist file has not been provided, PDB files are fetched
automatically from the execution directory or from another specified
directory (with the --pdb <path>
option). The first chain of
these PDB files is considered.
PDB entries requested in the structlist file are first looked for in the PDB directory. If not present, attempt is made to download the entries that have the format of a PDB code from the PDB server with Wget.
When non-explicit chain identifiers were provided in the structlist file, such as 'first' or '*', they need to be explicited by extracting the information from the PDB files.
Depending on other options, if atomic coordinates will be needed for analyses, sequences and structures are read and stored in PSS::Molecule objects, otherwise only sequences are read and stored in PSS::Sequence objects.
A position list file can be provided as argument (--poslist
<poslist>
) to restrict the list of amino acid positions and
structural variables of interest for certain analyses. The format of
this file is two columns separated by spaces.
The first column of the poslist file determines which positions are included in the cluster analysis and in dihedral distribution plots. The second column contains definitions of dihedral angles to be included in clustering at the positions specified in the first column.
The format of the first column consists in position ranges. Ranges can be a single position ('<pos>'), a range of positions ('<first>:<last>'), or all positions ('all'). The reference for position numbering is the multiple sequence alignment.
The format of the second column consists in a list of dihedral angles. Allowed values are 'B' for backbone (phi and psi), 'S' for sidechain (all chi angles), 'A' for all (all angles including omega), or a comma-separated list (no space) of angles among 'phi', 'psi', 'omega', 'chi1', 'chi2', 'chi3', or 'chi4'. The second column is optional (default is 'B').
Obtaining a multiple alignment of input protein sequences is an essential preliminary step in PSS workflow. PSS is indeed meant to analyze ensembles of proteins with similar but different sequences, and it is necessary to establish a correspondence between amino acid positions of the different sequences before structural comparison can be performed.
In PSS, we propose four different approaches to multiple sequence
alignment, available as arguments to the --align <method>
option:
pdbnum
, clustalo
, seq-modeller
, and seqstr-modeller
.
The pdbnum
method relies on PDB numbering, the clustalo
method (default) uses Clustal-Omega, while the seq-modeller
and
seqstr-modeller
methods employ the SALIGN module of MODELLER to
determine the multiple sequence alignment. SALIGN is a general alignment
tool making use of several features of protein sequences and structures.
A fifth possibility is to provide the multiple sequence alignment as an
external file, with the --alignfile <file>
option.
The multiple alignment produced is written in PIR, PAP, and FASTA formats in files $path/$project/align/align.pir, $path/$project/align/align.pap, and $path/$project/align/align.fasta.
The first alignment position is position 1 by default but this can be
changed with the --firstalignpos <pos>
option.
The first and simplest method (--align pdbnum
) consists in deriving
the multiple sequence alignment simply from PDB numbering. It is
assumed in this case that residues in the ensemble of PDB structures
are numbered in a consistent way, and the sequence alignment is
straightforwardly determined by residue numbers. It is the method of
choice for ensembles of structures corresponding to the same sequence
with only small variations like different extremities, point mutations
or engineered positions. Note that this method does not work if
insertion codes are present.
The second method (--align clustalo
) obtains a multiple sequence
alignment with Clustal-Omega, using default options.
The third method (--align seq-modeller
) obtains a multiple sequence
alignment with SALIGN, using only sequence information (feature 1 of
SALIGN). Local alignments are performed, using the 'tree' alignment
method, a BLOSUM type similarity matrix, affine 1D gap penalties of
-450 for opening and -50 for extension, and a maximum of 30 overhanging
residues at the sequence termini not penalized for gaps if not aligned.
The fourth method (--align seqstr-modeller
) obtains a multiple
sequence/structure alignment with SALIGN, using sequence and structure
informations. Three successive cycles are performed with respective
values for feature_weights of (1,0,0,0,1,0), (1,0.5,1,1,1,0), and
(1,1,1,1,1,0). The same options as in the seq-modeller
method are
used. When feature 2 is included, affine 3D gap penalties of 0 for
opening and 3 for extension are used.
The multiple sequence alignment is either read from an external file in
PIR format provided with the --alignfile <file>
option or from
the previously generated $path/$project/align/align.pir file. The
alignment is then stored in a PSS::Alignment object.
A consensus sequence is either read from an external file provided
with the --consensusfile <file>
option or determined from the
multiple sequence alignment. In the latter case, a consensus residue
is defined as the most frequent residue type (gaps excepted) at a
particular position in the multiple alignment. The consensus sequence
is used to define a reference amino acid type in structural statistics
involving sidechains. The consensus sequence is written alone in PIR
and FASTA formats in files $path/$project/align/consensus.pir and
$path/$project/align/consensus.fasta, combined with the multiple
alignment in files $path/$project/align/align_consensus.pir,
$path/$project/align/align_consensus.pap, and
$path/$project/align/align_consensus.fasta, and is stored internally
in a PSS::Sequence object. Amino acid composition and residue
identifiers by alignment position are also provided in files
$path/$project/align/composition.dat and
$path/$project/align/positions.dat.
When the --superpref <ref>
option was not given, the reference
structure for superposition is set to the first entry of the structure
list. When a reference structure was provided without chain identifier,
the first chain is selected.
Structure superposition is performed if the --superp <method>
option has been given. Available methods include theseus
(default)
and modeller
.
The reference structure for superposition can be chosen with the
--superpref <ref>
option (for example '1GZ8:A'). If a chain
identifier is not given as <ref>, for example '1GZ8', the first chain
of the PDB file is detected. Empty chain identifiers are entered as
'1GZ8:_'. If no reference structure is given, the first entry of the
structure list is chosen.
Superposed structures are stored in $path/$project/superp/pdb/$pdb$chain.pdb files.
The first method (--superp theseus
) performs structure superposition
with the Theseus program. Theseus reads the previously determined
sequence alignment and finds the optimal solution to the superposition
problem using the method of maximum likelihood. Default options are
used.
The second method (--superp modeller
) performs structure
superposition with the selection.superpose()
method of MODELLER.
MODELLER reads the previously determined sequence alignment and uses
an iterative least-square fitting algorithm. Residues that have a RMSD
greater than 1.0 angstrom with respect to the reference are excluded of
the selection in the next iteration, until there is no change in the
number of equivalent positions. Only backbone atoms are considered in
the superposition.
PyMOL and VMD scripts to load superposed structures are available at $path/$project/superp/superp.pml and $path/$project/superp/superp.vmd. They can be loaded with the following command lines:
pymol $path/$project/superp/superp.pml vmd -e $path/$project/superp/superp.vmd
Structures are loaded by these scripts from $path/$project/superp/pdb/$pdb$chain.pdb. If the $path variable is relative (default is .), PyMOL and VMD need to be run from the appropriate directory.
If the --superp3d
option was given, a JSmol script is prepared
(see HTML report).
If the --cartstat
option is given, structural statistics are
computed from Cartesian coordinates. It is assumed that structures have
previously been superposed, either with the --superp <ref>
option
or with an external program.
For each position of the sequence alignment, atomic coordinates and B-factors are read from the ensemble of structures. Averages and standard deviations of coordinates, as well as average B-factors are then calculated and averaged separately over backbone and sidechain atoms. Only sidechains of the same type as the consensus residue at a given position are taken into account.
A PDB file containing average coordinates, standard deviation of the position vector (RMSF) in the occupancy column, and average B-factors, is written in file $path/$project/cartesian/average.pdb.
A data file containing the alignment position, consensus residue name, average RMSF for backbone atoms, average RMSF for sidechain atoms, average B-factor for backbone atoms, and average B-factor for sidechain atoms in space delimited columns is written in file $path/$project/cartesian/cartstat.dat. In this file, B-factors are converted to RMSF values for comparison.
If the --dihestat
option is given, structural statistics are
computed from dihedral angles of original structures. Superposition is
indeed not necessary to compare internal coordinates.
For each position of the sequence alignment, dihedral angles
(phi, psi, omega, chi1, chi2, chi3, and chi4) are calculated
by analytical geometry from atomic coordinates and stored in
$path/$project/dihedrals/dihe_$pos.dat files. Dihedral angle
definitions follow IUPAC recommendations. These files contain the
structure name, family name in square brackets, residue number in
the original structure, residue name in the original structure, and
phi, psi, omega, chi1, chi2, chi3, and chi4 angle values, in space
delimited columns. Undefined dihedral angle values are marked as '-'.
A symmetrization of sidechain dihedral chi2 for residues ASP, PHE,
TYR, and of chi3 for GLU, is done to ensure that chemically equivalent
conformations (for example a switch between OD1 and OD2 atoms in ASP)
are seen as identical. This operation can be disabled with the
--nosymmetry
option.
Dihedral angle statistics are then calculated and stored in file $path/$project/dihedrals/dihestat.dat. This file contains the alignment position, consensus residue name, circular averages for phi, psi, omega, and chis angles, then circular standard deviations for the same angles, in space delimited columns. Sidechain dihedral angles are not taken into account in statistics if they belong to a residue of a different type as the consensus residue. If multiple families have been defined, statistics are also calculated for each of them separately and stored in $path/$project/dihedrals/dihestat.$family.dat files.
If the --clust
option is given, structures are subjected to
a cluster analysis, based on the Algorithm::Cluster Perl module, an
interface to the C Clustering Library.
Structural variables taken into account in clustering can be controlled with the poslist file (format is explained above). The first column of this file is used to select alignment positions and the second column defines dihedral angles taken into account at these positions.
Pairwise circular distances are computed from selected dihedral angles at each position. A global distance matrix is then calculated as the square root of the sum of squared circular distances divided by the number of selected angles and stored in the $path/$project/clustering/distmat.dat file. Sidechain dihedral angles are not taken into account in the distance matrix if they belong to a residue of a different type as the consensus residue.
Hierarchical clustering is then performed with the treecluster()
method of Algorithm::Cluster. Variants of hierarchical clustering
available in Algorithm::Cluster are single-, maximum-, and
average-linkage methods, and can be selected with the
--linkage <method>
option (default is 'm' for maximum-linkage). They differ
by the definition of inter-cluster distance. The tree obtained is
stored in the $path/$project/clustering/tree.dat file. Clusters are
then obtained from the tree by cutting it with the cut()
method
of Algorithm::Cluster::Tree, according to the radius parameter selected
with the --radius <radius>
option (default is 60 degrees). The
radius parameter corresponds to the minimum inter-cluster distance
allowed. The list of clusters obtained along with structure members
is stored in the $path/$project/clustering/clusters.dat file.
Additionally, a list of variables taken into account at each position
is stored in $path/$project/clustering/featurelist.dat and a list of
structures taken into account at each position and globally is stored in
$path/$project/clustering/includelist.dat.
In some situations, the number of structures included is zero and clustering cannot be performed. It means that some of the requested variables were defined in none of the structures. The recommended procedure is then to inspect the $path/$project/clustering/includelist.dat file, find the positions responsible for the exclusion of structures, prepare a poslist file without these positions, and run again the cluster analysis.
If the --report
option is given, an HTML report containing a
summary of results and figures obtained for the project is created and
stored in $path/$project/index.html.
A reminder of arguments passed to PSS is first provided, followed by a link to the $path/$project/command file, containing the full command line executed. Then a table of the structures studied with chains and optional family names is included, followed by a link to the $path/$project/structlist.dat file. The multiple alignment along with the consensus sequence are then displayed, and links to $path/$project/align/align.fasta, $path/$project/align/consensus.fasta, $path/$project/align/composition.dat, and $path/$project/align/positions.dat files are provided.
If superposition of structures has been performed, direct links to superposed PDB files are provided, as well as links to PyMOL and VMD scripts to load superposed structures at once with these programs.
If the --superp3d
option was given, a JSmol applet is included
in the HTML report. This allows an interactive 3D visualization of the
superposed structures. The rendering style can be modified and the
individual structures can be displayed or hidden with checkboxes.
If the Cartesian analysis has been performed, plots of root mean square fluctuations of Cartesian coordinates and average B-factors, for backbone or sidechain atoms, as a function of the alignment position are included. These plots are superposable to ease comparison, each plot can be hidden/displayed by clicking on the corresponding link above the plotting area. In addition, a sortable and scrollable table containing the data is provided below the plots. Sorting is obtained by clicking on column headers and is achieved with the SortTable JavaScript library (http://www.kryogenix.org/code/browser/sorttable/). Note that basic combined sorting is possible, for example a sort by amino acid type following a sort by RMSF will preserve the RMSF order for each amino acid type. Reinitialization is obtained by a position sort. Links to the raw data $path/$project/cartesian/cartstat.dat file and to the average structure $path/$project/cartesian/average.pdb file are also provided.
If the dihedrals analysis has been performed, plots of phi, psi, omega, chi1, chi2, chi3, and chi4 dihedral angle standard deviations as a function of the alignment position are included. As for the Cartesian analysis, plots are superposable, tables are sortable and scrollable, and a link to the raw data $path/$project/dihedrals/dihestat.dat file is provided.
Additionally, Ramachandran (phi/psi), chi1/chi2, and individual dihedral angle distributions are plotted for each alignment position defined in the poslist file. If different families were defined in the structlist file, different colors are used to distinguish points belonging to each family. On the left of plots, the 'raw data' link gives access to the $path/$project/dihedrals/dihe_$pos.dat files, the 'globalstat' link allows the visualization of dihedral angle circular averages and standard deviations on top of the distributions, and the 'familystat' link does the same for circular averages and standard deviations by family. Sidechain dihedral angles are not represented in distribution plots if they belong to a residue of a different type as the consensus residue. Note that displaying the distribution plots of many positions can be an heavy task for the browser.
If clustering has been performed, a reminder of dihedral angle variables taken into account is first presented in a table, followed by a link to the $path/$project/poslist.dat file. The clustering results are then presented in a table of cluster members sorted by cluster number. Then a link to the $path/$project/clustering/structlist.dat file, a structlist-formatted file with cluster numbers in the family column, is provided to facilitate the setup of another run of PSS with a definition of families corresponding to identified clusters. At last, a dendrogram plot is included to visualize the clustering tree obtained, with the chosen cutting radius as a vertical line.
A testcase can be found in the test directory of PSS distribution.
Command line to perform the full testcase:
pss --pdb pdb --structlist structlist.dat --poslist poslist.dat --align pdbnum --firstalignpos -2 --superp theseus --superpref 1GZ8:A --superp3d --cartstat --dihestat --clust --radius 60 --report
7 structures listed in the structlist.dat file are read from the pdb folder
The multiple sequence alignment is derived from PDB numbering
Structures are superposed on the A chain of the 1GZ8 structure with Theseus
Structural statistics in Cartesian coordinates are calculated
Structural statistics in dihedral coordinates are calculated
A cluster analysis is performed based on positions and dihedrals defined in the poslist.dat file
A report in HTML format is produced at project/index.html including 3D visualization of superposed structures (dihedral distribution plots are limited to positions defined in poslist.dat)
Run the Install interactive installation script. It will ask for the location of external programs required to perform certain tasks and prepare a pss shell script adapted to your system, which you can put in your PATH to be able to launch PSS from anywhere.
./Install
Clustal-Omega is used for sequence alignment. Theseus is used for structure superposition. MODELLER version ≥ 9 can also be used for sequence alignment and structure superposition. It is also possible to provide the sequence alignment as an external file, and structures superposed with another program can be given as input to PSS. The Algorithm::Cluster Perl module is used for clustering, Gnuplot version ≥ 4.4, Awk, and Grep are used to generate figures, and Wget is used to download structures from the PDB. Clustal-Omega, Theseus, Gnuplot, Awk, Grep, and Wget are included in most GNU/Linux distributions.
MODELLER can be obtained free of charge after registration from http://salilab.org/modeller/download_installation.html and installed following instructions. The generic UNIX installation procedure involves unpacking the compressed tarball and running the Install script. Note the full path of the MODELLER executable.
Algorithm::Cluster can be obtained from CPAN (http://search.cpan.org/dist/Algorithm-Cluster) and installed with the following procedure:
perl Makefile.PL (for system-wide installation) or perl Makefile.PL prefix=/some/other/directory (for user installation)
make make test make install
Note the full path of the Algorithm::Cluster library (for example /some/other/directory/lib/perl5).
We appreciate if users report bugs, feature requests, improvements, or patches to <thomas.gaillard@polytechnique.edu>.
The project homepage is http://thomasgaillard.fr/pss.
Please cite the following article in a publication using PSS:
Protein structural statistics with PSS
Thomas Gaillard, Benjamin B.L. Schwarz, Yassmine Chebaro, Roland H. Stote, and Annick Dejaegere
J. Chem. Inf. Model. 2013, 53, 2471-2482.
DOI:10.1021/ci400233j
1.2
Thomas Gaillard <thomas.gaillard@polytechnique.edu>, with contributions from Benjamin Schwarz, Roland H. Stote, and Annick Dejaegere
Copyright (c) 2006-2016 Thomas Gaillard, Benjamin Schwarz, Roland H. Stote, Annick Dejaegere
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; version 2 or any later version.
PSS - Protein Structural Statistics - main program |