PSS - Protein Structural Statistics - main program


NAME

PSS - Protein Structural Statistics - main program


SYNOPSIS

    pss [options]

PSS reads an ensemble of PDB files of protein structures, performs a multiple sequence alignment, and computes structural statistics for each position of the alignment. Different optional functionalities are proposed: structure superposition, Cartesian coordinate statistics, dihedral angle calculation and statistics, and a cluster analysis based on dihedral angles. An HTML report containing a summary of results and figures can also be generated.


OPTIONS


DESCRIPTION

PSS is a protein structural statistics program for Unix-like environments. It has a command-line interface, and is implemented in Perl as two separate files. The PSS.pl script is the main program, while the PSS.pm library provides object-oriented classes and methods to describe atoms, residues, molecules, sequences, and alignments. A separate documentation for PSS.pm is available in doc/PSS.pm.html.

Input data for PSS consists in a set of PDB files of protein structures. Only standard PDB ATOM records are expected to be correctly treated by PSS. Output consists in various files, stored in a dedicated project directory, whose name can be changed with the --project <name> option. The --path <path> option (default: .) allows to change the location of the project directory. Intermediate results are written in text files, easily accessible and parsable. External program input and output files are also kept. An HTML report can be produced, facilitating the access to results and their analysis, through hyperlinks, 3D visualization of superposed structures, figures, and tables.

The PSS.pl script is divided into subroutines, whose execution is optional and depends on requested options. First and mandatory steps consists in reading structures and/or sequences, and determining a multiple sequence alignment as well as a consensus sequence. Other optional subroutines include structure superposition, Cartesian statistics, dihedral calculation and statistics, clustering, and the production of figures and of an HTML report.

Iterative use of the program, where a first round of calculations is completed by another run with additional analyses, is possible and facilitated. PSS indeed detects when results are already present and reuse them instead of repeating a calculation. Delete the folders or use the --override option to force a new calculation.

read_structlist - Read structure list

A structure list file can be provided as argument (--structlist <structlist>) to have more control on the PDB files read by PSS. The structlist file contains on each line a PDB code or file basename, a PDB chain identifier, and an optional user-defined family name, these fields being separated by spaces (for example '1GZ8 A close'). If a PDB code is detected, it is translated to uppercase letters. An empty chain identifier can be entered as '_'. The first chain identifier of a PDB file can be automatically detected with 'first', this is the default if no chain identifier is entered. At last, all chain identifiers present in a PDB file can be automatically considered with '*'.

guess_structlist - Guess structure list

When a structlist file has not been provided, PDB files are fetched automatically from the execution directory or from another specified directory (with the --pdb <path> option). The first chain of these PDB files is considered.

get_structures - Get structures

PDB entries requested in the structlist file are first looked for in the PDB directory. If not present, attempt is made to download the entries that have the format of a PDB code from the PDB server with Wget.

complete_structlist - Complete structure list

When non-explicit chain identifiers were provided in the structlist file, such as 'first' or '*', they need to be explicited by extracting the information from the PDB files.

read_{structures|sequences} - Read structures and/or sequences

Depending on other options, if atomic coordinates will be needed for analyses, sequences and structures are read and stored in PSS::Molecule objects, otherwise only sequences are read and stored in PSS::Sequence objects.

read_poslist - Read position list

A position list file can be provided as argument (--poslist <poslist>) to restrict the list of amino acid positions and structural variables of interest for certain analyses. The format of this file is two columns separated by spaces.

The first column of the poslist file determines which positions are included in the cluster analysis and in dihedral distribution plots. The second column contains definitions of dihedral angles to be included in clustering at the positions specified in the first column.

The format of the first column consists in position ranges. Ranges can be a single position ('<pos>'), a range of positions ('<first>:<last>'), or all positions ('all'). The reference for position numbering is the multiple sequence alignment.

The format of the second column consists in a list of dihedral angles. Allowed values are 'B' for backbone (phi and psi), 'S' for sidechain (all chi angles), 'A' for all (all angles including omega), or a comma-separated list (no space) of angles among 'phi', 'psi', 'omega', 'chi1', 'chi2', 'chi3', or 'chi4'. The second column is optional (default is 'B').

align_{pdbnum|clustalo|seq-modeller|seqstr-modeller} - Multiple sequence alignment

Obtaining a multiple alignment of input protein sequences is an essential preliminary step in PSS workflow. PSS is indeed meant to analyze ensembles of proteins with similar but different sequences, and it is necessary to establish a correspondence between amino acid positions of the different sequences before structural comparison can be performed.

In PSS, we propose four different approaches to multiple sequence alignment, available as arguments to the --align <method> option: pdbnum, clustalo, seq-modeller, and seqstr-modeller. The pdbnum method relies on PDB numbering, the clustalo method (default) uses Clustal-Omega, while the seq-modeller and seqstr-modeller methods employ the SALIGN module of MODELLER to determine the multiple sequence alignment. SALIGN is a general alignment tool making use of several features of protein sequences and structures.

A fifth possibility is to provide the multiple sequence alignment as an external file, with the --alignfile <file> option.

The multiple alignment produced is written in PIR, PAP, and FASTA formats in files $path/$project/align/align.pir, $path/$project/align/align.pap, and $path/$project/align/align.fasta.

The first alignment position is position 1 by default but this can be changed with the --firstalignpos <pos> option.

align_pdbnum - Multiple sequence alignment from PDB numbering

The first and simplest method (--align pdbnum) consists in deriving the multiple sequence alignment simply from PDB numbering. It is assumed in this case that residues in the ensemble of PDB structures are numbered in a consistent way, and the sequence alignment is straightforwardly determined by residue numbers. It is the method of choice for ensembles of structures corresponding to the same sequence with only small variations like different extremities, point mutations or engineered positions. Note that this method does not work if insertion codes are present.

align_clustalo - Multiple sequence alignment with Clustal-Omega (default)

The second method (--align clustalo) obtains a multiple sequence alignment with Clustal-Omega, using default options.

align_seqmodeller - Multiple sequence alignment with MODELLER

The third method (--align seq-modeller) obtains a multiple sequence alignment with SALIGN, using only sequence information (feature 1 of SALIGN). Local alignments are performed, using the 'tree' alignment method, a BLOSUM type similarity matrix, affine 1D gap penalties of -450 for opening and -50 for extension, and a maximum of 30 overhanging residues at the sequence termini not penalized for gaps if not aligned.

align_seqstrmodeller - Multiple sequence/structure alignment with MODELLER

The fourth method (--align seqstr-modeller) obtains a multiple sequence/structure alignment with SALIGN, using sequence and structure informations. Three successive cycles are performed with respective values for feature_weights of (1,0,0,0,1,0), (1,0.5,1,1,1,0), and (1,1,1,1,1,0). The same options as in the seq-modeller method are used. When feature 2 is included, affine 3D gap penalties of 0 for opening and 3 for extension are used.

read_alignment - Read multiple sequence alignment

The multiple sequence alignment is either read from an external file in PIR format provided with the --alignfile <file> option or from the previously generated $path/$project/align/align.pir file. The alignment is then stored in a PSS::Alignment object.

consensus - Consensus sequence determination

A consensus sequence is either read from an external file provided with the --consensusfile <file> option or determined from the multiple sequence alignment. In the latter case, a consensus residue is defined as the most frequent residue type (gaps excepted) at a particular position in the multiple alignment. The consensus sequence is used to define a reference amino acid type in structural statistics involving sidechains. The consensus sequence is written alone in PIR and FASTA formats in files $path/$project/align/consensus.pir and $path/$project/align/consensus.fasta, combined with the multiple alignment in files $path/$project/align/align_consensus.pir, $path/$project/align/align_consensus.pap, and $path/$project/align/align_consensus.fasta, and is stored internally in a PSS::Sequence object. Amino acid composition and residue identifiers by alignment position are also provided in files $path/$project/align/composition.dat and $path/$project/align/positions.dat.

complete_superposeref - Complete the reference structure for superposition

When the --superpref <ref> option was not given, the reference structure for superposition is set to the first entry of the structure list. When a reference structure was provided without chain identifier, the first chain is selected.

superpose_{theseus|modeller} - Structure superposition

Structure superposition is performed if the --superp <method> option has been given. Available methods include theseus (default) and modeller.

The reference structure for superposition can be chosen with the --superpref <ref> option (for example '1GZ8:A'). If a chain identifier is not given as <ref>, for example '1GZ8', the first chain of the PDB file is detected. Empty chain identifiers are entered as '1GZ8:_'. If no reference structure is given, the first entry of the structure list is chosen.

Superposed structures are stored in $path/$project/superp/pdb/$pdb$chain.pdb files.

superpose_theseus - Structure superposition with Theseus (default)

The first method (--superp theseus) performs structure superposition with the Theseus program. Theseus reads the previously determined sequence alignment and finds the optimal solution to the superposition problem using the method of maximum likelihood. Default options are used.

superpose_modeller - Structure superposition with MODELLER

The second method (--superp modeller) performs structure superposition with the selection.superpose() method of MODELLER. MODELLER reads the previously determined sequence alignment and uses an iterative least-square fitting algorithm. Residues that have a RMSD greater than 1.0 angstrom with respect to the reference are excluded of the selection in the next iteration, until there is no change in the number of equivalent positions. Only backbone atoms are considered in the superposition.

prepare_superp_visu_scripts - Prepare scripts to visualize the superposed structures

PyMOL and VMD scripts to load superposed structures are available at $path/$project/superp/superp.pml and $path/$project/superp/superp.vmd. They can be loaded with the following command lines:

    pymol $path/$project/superp/superp.pml
    vmd -e $path/$project/superp/superp.vmd

Structures are loaded by these scripts from $path/$project/superp/pdb/$pdb$chain.pdb. If the $path variable is relative (default is .), PyMOL and VMD need to be run from the appropriate directory.

If the --superp3d option was given, a JSmol script is prepared (see HTML report).

cartstat - Structural statistics in Cartesian coordinates

If the --cartstat option is given, structural statistics are computed from Cartesian coordinates. It is assumed that structures have previously been superposed, either with the --superp <ref> option or with an external program.

For each position of the sequence alignment, atomic coordinates and B-factors are read from the ensemble of structures. Averages and standard deviations of coordinates, as well as average B-factors are then calculated and averaged separately over backbone and sidechain atoms. Only sidechains of the same type as the consensus residue at a given position are taken into account.

A PDB file containing average coordinates, standard deviation of the position vector (RMSF) in the occupancy column, and average B-factors, is written in file $path/$project/cartesian/average.pdb.

A data file containing the alignment position, consensus residue name, average RMSF for backbone atoms, average RMSF for sidechain atoms, average B-factor for backbone atoms, and average B-factor for sidechain atoms in space delimited columns is written in file $path/$project/cartesian/cartstat.dat. In this file, B-factors are converted to RMSF values for comparison.

dihestat - Structural statistics in dihedral coordinates

If the --dihestat option is given, structural statistics are computed from dihedral angles of original structures. Superposition is indeed not necessary to compare internal coordinates.

For each position of the sequence alignment, dihedral angles (phi, psi, omega, chi1, chi2, chi3, and chi4) are calculated by analytical geometry from atomic coordinates and stored in $path/$project/dihedrals/dihe_$pos.dat files. Dihedral angle definitions follow IUPAC recommendations. These files contain the structure name, family name in square brackets, residue number in the original structure, residue name in the original structure, and phi, psi, omega, chi1, chi2, chi3, and chi4 angle values, in space delimited columns. Undefined dihedral angle values are marked as '-'. A symmetrization of sidechain dihedral chi2 for residues ASP, PHE, TYR, and of chi3 for GLU, is done to ensure that chemically equivalent conformations (for example a switch between OD1 and OD2 atoms in ASP) are seen as identical. This operation can be disabled with the --nosymmetry option.

Dihedral angle statistics are then calculated and stored in file $path/$project/dihedrals/dihestat.dat. This file contains the alignment position, consensus residue name, circular averages for phi, psi, omega, and chis angles, then circular standard deviations for the same angles, in space delimited columns. Sidechain dihedral angles are not taken into account in statistics if they belong to a residue of a different type as the consensus residue. If multiple families have been defined, statistics are also calculated for each of them separately and stored in $path/$project/dihedrals/dihestat.$family.dat files.

clustering - Cluster analysis

If the --clust option is given, structures are subjected to a cluster analysis, based on the Algorithm::Cluster Perl module, an interface to the C Clustering Library.

Structural variables taken into account in clustering can be controlled with the poslist file (format is explained above). The first column of this file is used to select alignment positions and the second column defines dihedral angles taken into account at these positions.

Pairwise circular distances are computed from selected dihedral angles at each position. A global distance matrix is then calculated as the square root of the sum of squared circular distances divided by the number of selected angles and stored in the $path/$project/clustering/distmat.dat file. Sidechain dihedral angles are not taken into account in the distance matrix if they belong to a residue of a different type as the consensus residue.

Hierarchical clustering is then performed with the treecluster() method of Algorithm::Cluster. Variants of hierarchical clustering available in Algorithm::Cluster are single-, maximum-, and average-linkage methods, and can be selected with the --linkage <method> option (default is 'm' for maximum-linkage). They differ by the definition of inter-cluster distance. The tree obtained is stored in the $path/$project/clustering/tree.dat file. Clusters are then obtained from the tree by cutting it with the cut() method of Algorithm::Cluster::Tree, according to the radius parameter selected with the --radius <radius> option (default is 60 degrees). The radius parameter corresponds to the minimum inter-cluster distance allowed. The list of clusters obtained along with structure members is stored in the $path/$project/clustering/clusters.dat file. Additionally, a list of variables taken into account at each position is stored in $path/$project/clustering/featurelist.dat and a list of structures taken into account at each position and globally is stored in $path/$project/clustering/includelist.dat.

In some situations, the number of structures included is zero and clustering cannot be performed. It means that some of the requested variables were defined in none of the structures. The recommended procedure is then to inspect the $path/$project/clustering/includelist.dat file, find the positions responsible for the exclusion of structures, prepare a poslist file without these positions, and run again the cluster analysis.

figures and html - Generate figures and HTML report

If the --report option is given, an HTML report containing a summary of results and figures obtained for the project is created and stored in $path/$project/index.html.

A reminder of arguments passed to PSS is first provided, followed by a link to the $path/$project/command file, containing the full command line executed. Then a table of the structures studied with chains and optional family names is included, followed by a link to the $path/$project/structlist.dat file. The multiple alignment along with the consensus sequence are then displayed, and links to $path/$project/align/align.fasta, $path/$project/align/consensus.fasta, $path/$project/align/composition.dat, and $path/$project/align/positions.dat files are provided.

If superposition of structures has been performed, direct links to superposed PDB files are provided, as well as links to PyMOL and VMD scripts to load superposed structures at once with these programs.

If the --superp3d option was given, a JSmol applet is included in the HTML report. This allows an interactive 3D visualization of the superposed structures. The rendering style can be modified and the individual structures can be displayed or hidden with checkboxes.

If the Cartesian analysis has been performed, plots of root mean square fluctuations of Cartesian coordinates and average B-factors, for backbone or sidechain atoms, as a function of the alignment position are included. These plots are superposable to ease comparison, each plot can be hidden/displayed by clicking on the corresponding link above the plotting area. In addition, a sortable and scrollable table containing the data is provided below the plots. Sorting is obtained by clicking on column headers and is achieved with the SortTable JavaScript library (http://www.kryogenix.org/code/browser/sorttable/). Note that basic combined sorting is possible, for example a sort by amino acid type following a sort by RMSF will preserve the RMSF order for each amino acid type. Reinitialization is obtained by a position sort. Links to the raw data $path/$project/cartesian/cartstat.dat file and to the average structure $path/$project/cartesian/average.pdb file are also provided.

If the dihedrals analysis has been performed, plots of phi, psi, omega, chi1, chi2, chi3, and chi4 dihedral angle standard deviations as a function of the alignment position are included. As for the Cartesian analysis, plots are superposable, tables are sortable and scrollable, and a link to the raw data $path/$project/dihedrals/dihestat.dat file is provided.

Additionally, Ramachandran (phi/psi), chi1/chi2, and individual dihedral angle distributions are plotted for each alignment position defined in the poslist file. If different families were defined in the structlist file, different colors are used to distinguish points belonging to each family. On the left of plots, the 'raw data' link gives access to the $path/$project/dihedrals/dihe_$pos.dat files, the 'globalstat' link allows the visualization of dihedral angle circular averages and standard deviations on top of the distributions, and the 'familystat' link does the same for circular averages and standard deviations by family. Sidechain dihedral angles are not represented in distribution plots if they belong to a residue of a different type as the consensus residue. Note that displaying the distribution plots of many positions can be an heavy task for the browser.

If clustering has been performed, a reminder of dihedral angle variables taken into account is first presented in a table, followed by a link to the $path/$project/poslist.dat file. The clustering results are then presented in a table of cluster members sorted by cluster number. Then a link to the $path/$project/clustering/structlist.dat file, a structlist-formatted file with cluster numbers in the family column, is provided to facilitate the setup of another run of PSS with a definition of families corresponding to identified clusters. At last, a dendrogram plot is included to visualize the clustering tree obtained, with the chosen cutting radius as a vertical line.


TESTCASE

A testcase can be found in the test directory of PSS distribution.

Command line to perform the full testcase:

    pss --pdb pdb --structlist structlist.dat --poslist poslist.dat --align pdbnum --firstalignpos -2 --superp theseus --superpref 1GZ8:A --superp3d --cartstat --dihestat --clust --radius 60 --report


INSTALLATION

Run the Install interactive installation script. It will ask for the location of external programs required to perform certain tasks and prepare a pss shell script adapted to your system, which you can put in your PATH to be able to launch PSS from anywhere.

    ./Install

Clustal-Omega is used for sequence alignment. Theseus is used for structure superposition. MODELLER version ≥ 9 can also be used for sequence alignment and structure superposition. It is also possible to provide the sequence alignment as an external file, and structures superposed with another program can be given as input to PSS. The Algorithm::Cluster Perl module is used for clustering, Gnuplot version ≥ 4.4, Awk, and Grep are used to generate figures, and Wget is used to download structures from the PDB. Clustal-Omega, Theseus, Gnuplot, Awk, Grep, and Wget are included in most GNU/Linux distributions.

MODELLER

MODELLER can be obtained free of charge after registration from http://salilab.org/modeller/download_installation.html and installed following instructions. The generic UNIX installation procedure involves unpacking the compressed tarball and running the Install script. Note the full path of the MODELLER executable.

Algorithm::Cluster

Algorithm::Cluster can be obtained from CPAN (http://search.cpan.org/dist/Algorithm-Cluster) and installed with the following procedure:

    perl Makefile.PL (for system-wide installation)
or
    perl Makefile.PL prefix=/some/other/directory (for user installation)
    make
    make test
    make install

Note the full path of the Algorithm::Cluster library (for example /some/other/directory/lib/perl5).


SUPPORT

We appreciate if users report bugs, feature requests, improvements, or patches to <thomas.gaillard@polytechnique.edu>.


AVAILABILITY

The project homepage is http://thomasgaillard.fr/pss.


CITATION

Please cite the following article in a publication using PSS:

Protein structural statistics with PSS
Thomas Gaillard, Benjamin B.L. Schwarz, Yassmine Chebaro, Roland H. Stote, and Annick Dejaegere
J. Chem. Inf. Model. 2013, 53, 2471-2482.
DOI:10.1021/ci400233j


VERSION

1.2


AUTHOR

Thomas Gaillard <thomas.gaillard@polytechnique.edu>, with contributions from Benjamin Schwarz, Roland H. Stote, and Annick Dejaegere


COPYRIGHT

Copyright (c) 2006-2016 Thomas Gaillard, Benjamin Schwarz, Roland H. Stote, Annick Dejaegere

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; version 2 or any later version.

 PSS - Protein Structural Statistics - main program