Evolutionary inaccuracy of pairwise structural alignments (slide)
EVOLUTIONARY INACCURACY OFPAIRWISE STRUCTURAL ALIGNMENTSPresenter: Nguyen Dinh Chien (阮庭戰)Authors:M. I. Sadowski and W. R. TaylorFrom Division of Mathematical Biology,MRC National Institute for Medical Research, London, UK
Structural alignment attempts to establish homology between two or more polymerstructures based on their shape and 3D confomation. This process is usuallyapplied to protein tertiary structures but can also be used for large RNA molecules. In this study, the authors analyzed the selft-consistency of 7 widely-used structuralalignment methods, such as, SAP, TM-align, MAMMOTH, DALI, CE, and FATCATon a diverse, non-redundant set of 1863 domains from the SCOP database. Results: The degree of inconsistency of the alignments on a residue level is 30%. Producing more consistent alignments than the rest. The methods able to identify good structural alignments is also accessed using geometricmeasures.
INTRODUCTION The problem of alignment pairs of protein structures has attracted a significant level ofresearch effort. Kolodny et al., 2005 and Mayr et al., 2007 are important contributions. Kolodny‘s study testedfind a good solution as judged by geometric criteria, and Mayr’s study agreed the alignedresidues with a set of manually curated ‘gold standard’ alignments. They used geometric measures to assess the ability of aligners. They proposed that, if A and B arehomologous, B and C are homologous, then A and C must also be homologous. In this study, authors compared the most widely-used methods for pairwise structural alignment, andconsidering alignment accuracy relative to other annotation sources: DSSP structural classes andsolvent accessibilities. They also used SCOP folds, GO annotations, topological distances, and several geometric scores toexternal annotations. The different assessment methods highlight different strengths and weaknesses of each methods.
OutlineINTRODUCTIONMETHODSRESULTSDISCUSSIONData setStructural alignment methodsInconsistency measureCalibration of dataOther geometric measuresResidue annotationsAssessment of symmetry
Data set In this study, the authors used a set of 1863 domains, which wasderived from the ASTRAL SCOP10 databases. SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.html ASTRAL: http://astral.berkeley.edu/ The set was restricted to high quality structures by requiring a SPACI(Summary PDB ASTRAL Check Indexhttp://astral.berkeley.edu/spaci.html) score >0.5 and excluding NMR(Nuclear Magnetic Resonance) structures (http://nmr.cit.nih.gov/xplor-nih/xplorMan/node470.html) and those with missing residues.
Structural alignment methods All-versus-all pairwise structural alignments were generated using 7 methods: SAP,DALILite, MAMMOTH-MULT, FATCAT, CE, TM-align, and Fr-TM-align. They selected these methods because of, Many cases are used to compute large sets of alignments for publicly available resources(FATCAT and CE for PDB; DALILite for the DALI FSSP database), or have been used to draw conclusions about fold-space (SAP, TM-align, DALILite andMAMMOTH-MULT) All methods were used with default parameters. They also used Andrea Prlic’s Java implementations of FATCAT and CE.
Inconsistency measure Inconsistency was assessed for all positions in any triplet (in a particular threshold) ofaligned structures. In case, a gap was found at that position in any of the three alignmentsequences, the position was ignored. For each position, they determined whether the condition E(Ai,Bj)∩E(Bj,Ck)∩E(Ai,Ck) wastrue, where The predicate E(Xj,Yj) is defined as meaning position i in sequence X is aligned to position j insequence Y.If condition is false, inconsistency=1, otherwise, inconsistency=0. The proportion of inconsistent positions was found for all aligned triples for each methodat each threshold and calculated as a percentage. All residues in this case is absoluteinconsistency. The subsets of residues with particular annotations is called relative inconsistency.
Calibration of data The RMSD (Root-mean-square deviation) and coverage values were used toapproximate TM-scores for the alignments generated by each method.Approximate TM-score for TM-align – 0.981; real TM-score for them – 0.985.However, approximate TM-score for the other methods were correlated with TM-align asfollows: SAP – 0.739; DALILite – 0.643; FATCAT-0.774; FATCAT (flexible mode)-0.639;CE-0.837; Fr-TM-align-0.923) Next, they compared the fTM score with the methods own summary score to determinewhich was likely to provide the best ranking. RSMDreportedthe-RstructurestheoflengthmeanthebeingLresiduesalignedofnumberthebeingC2004)Skolnick,and(Zhang8.11524.1,13020LDwhereDRLCfTM
Other geometric measures To assess geometric quality of reported alignments, they used the following formularCLLRSI),min( 21 )210,min(111LLWRCMICRSAS100R - RMSDC - alignment coverageL1 and L2 - lengths of the two sequencesW0 - weighting parameterW0=1.5 as in Kolodny et al., 2005
Residue annotations The catalytic site atlas annotations (Porter et al., 2004) and annotationsfrom PDB SITE records to produce datasets of functional residueshttp://www.ebi.ac.uk/thornton-srv/databases/CSA_NEW/ . Secondary structure assignments and accessibility values were taken from DSSP(Define Secondary Structure of Proteins) http://swift.cmbi.ru.nl/gv/dssp/ Assessment of the consistency of the annotations was assessedseparately using chi-square testclass I (π-helix) almost always aligns with class H (α-helix). Isolated β-bridges (B) align mostly with strands (class E) and the remaining non-coil classes align significantly together, suggesting that at greaterdistances these regions are interchangeable.
Residue annotationsFold Description MeanInconsistencySDInconsistencyNb.80 SS R/H Beta Helix 100.00% 0.00% 5a.118 Alpha/alpha superhelix 86.70% 7.30% 10a.24 Four helix up/down bundle 80.00% 8.10% 13b.69 7-bladed beta propellor 58.70% 5.60% 8a.102 alpha/alpha toroid 57.60% 28.50% 8b.55 PH domain like-barrel 8.60% 3.30% 10d.38 Thioesterase 8.00% 2.60% 7d.131 DNA clamp 7.90% 0.40% 5b.34 SH3-like barrel 6.90% 6.20% 11d.37 CBS-domain pair 6.50% 1.80% 5Table S2: Most and least consistent domains. The SCOP folds and concomitant names are shownfor the five most and least consistently aligned domains at the highest threshold across allmethods are shown along with the number of neighbours at that level in the dataset.
Assessment of symmetry Symmetry values for protein structures were derived using the Fouriertransform-based approach described by Taylor et al. (2002) Inconsistency values per domain were the mean for all methods at thehighest threshold, which had 803 members; domains with fewer than 5neighbors for TM-align were culled from the set, leaving 207 domains.
OutlineINTRODUCTIONMETHODSRESULTSDISCUSSIONChoosing a score for ranking:ROC assessmentAssessment of self-consistencyfor structural alignersDetermining structural featuresassociated with inconsistenciesAssessment by geometricmeasures
Choosing a score for ranking: ROCassessment Mean AUC values forROC curves derived fromeach possible score forthe methods presented. CE, DALI, FATCAT (rigidmode) and Fr-TM-align allperform excellently whenscored using theapproximate TM-score. MAMMOTH, FATCAT(ﬂexible mode) and SAP allperformless well regardlessof score.
Assessment of self-consistency forstructural aligners Fig 2: Inconsistency of pairwisestructural alignments. Theproportion of positions failingtransitive consistency is shownfor all alignment pairs in therelevant fraction of the set.The methods appear in theorder FATCAT-flexible,MAMMOTH, CE, FATCAT, TM-align, DALI, Fr-TM-align, SAPfrom top to bottom on the left-hand edge of the graph.
Determining structural features associated withinconsistencies Fig 3: Improved consistencyat residues markedfunctional. Absolute ratesof inconsistency are shownfor functional residues(solid lines) and all residues(dashed lines) for the threemost consistent methods.These appear in the orderDALI, Fr-TM-align, SAPfrom the top downwardsalong the left-hand edge.
Determining structural featuresassociated with inconsistencies Fig 4. Relative inconsistencies forDSSP residue classes.Inconsistencies are shown as apercentage of the absolutevalue for each method. Theupper panel shows results forthe top 0.01% of alignments,the bottom the top 1.5%.
Determining structural featuresassociated with inconsistencies Fig 5. Relative inconsistency forthree methods in relation tosolvent accessibility. Solventaccessibility was split into classesin bins of 20% with 0 being thelowest. Panels are arranged asin Figure 4.
Determining structural features associated withinconsistencies Figure S1: symmetry andinconsistency. Meaninconsistency (X-axis) for 233domains with more than 5neighbours at the highestlevel of structural similarity isplotted against the power ofthe Fourier series as ameasure of the internalsymmetry of the structure (Y-axis, arbitrary units).
Determining structural featuresassociated with inconsistencies Fig 6. Relative inconsistencyas a function of gap distance.Panels are arranged asin Figure 4.
Assessment by geometricmeasures TM-align, FATCAT (flexible),and Fr-TM-align are bestthree methods in all caseregardless of the metric used. SAP and MAMMOTH bothrank as worst by all metrics.
DiscussionEven for the most consistent methods the level of inconsistency is very high.The most significant contributory factor to inconsistent structural alignments isthe treatment of gaps.Another important issue is that optimization of structural similarity is not in allcases the ideal strategy for identifying homology.Flexible alignment is correctly seen as an important innovation in aligningprotein structures, however our results demonstrate that it is not a panacea.The least consistently aligned domains are the repeats such as beta-helicesand the least consistently aligned elements are generally helices.Another possibility for improving the results of large-scale pairwise alignments(e.g. in database search or when using large datasets) is to realign significantlysimilar structures using a consistency criterion