Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Scoring functions and their use in the identification of peptides in mass spectrometry Eugene A Kapp Bioinformatics Walter & Eliza Hall Institute of Medical Research Bioinformatics Summer Course Dec 2012
BLAST / FASTA• Sequence against sequence• Can be used to find weak / distant similarity• Can make gapped alignments MS/MS-based ID• Mass & intensity values against sequence• Looking for identity or near identity• Generally, short peptides
Peptide Information Content zHz+ O O O + O H2N CH C NH CH C NH CH C NH CH C OH R1 R2 R3 H R4 b ion formation and/or y ion formation O O O O +b3 H2N CH C NH CH C NH CH C + H NH CH C OH y1 R1 R2 R3 H R4 + + Neutral pumped away by vacuum system Neutral pumped away by vacuum system Proton Mobility For peptides with non-mobile protons, fragmentation tends to proceed via charge-remote mechanisms. MS/MS spectra will beMobile: zpre> #Arg + #Lys + #His dominated by a few ions, typically:Partially mobile: zpre< #Arg + #Lys + #His and > #Arg C-term side of D, ENon-mobile: zpre< #Arg N-term side of P
Types of MS/MS Scoring Functions •Non-statistical: Correlation, dot-product -> raw score (SEQUEST/Comet) •Blend: raw score -> E-value (X!Tandem) •Statistical: - peptide or fragment ion frequency statistics (Mascot/Andromeda) OR Bayesian model (xxx)SpectrumMill, GutenTag, MyriMatch, Digger, ProteinPilot, Sorcerer,pFind, Peaks, ProteinLynxGS, MSGF, Inspect, OMSSA etc...
X!Tandem MS/MS Scoring X!Tandem’s preliminary score is a dot product of the acquired and model spectra. Because only similar peaks are n considered, this is the y / bScore Ii * P i sum of the intensities of the matched y and b i 0 ions. spectrum predicted? intensities (1,0)100% similar peaks (y/b ions) 0% Image courtesy of Proteome Software
X!TandemHyperscore X!Tandem modifies the preliminary score by multiplying by N factorial for the n number of b and y ions assigned. The use ofH yperScore I i * P i * N b !* N y ! factorials is based on i 0 the hypergeometric distribution. spectrum predicted? intensities (1,0)100% similar peaks (y/b ions) 0% Image courtesy of Proteome Software
Histogram of hyperscores Next, X!Tandem makes a histogram of all the hyperscores for all the peptides in theincorrect database that mightIDs match this spectrum. 60 For example, in this 50 figure, 52 peptides were found with a hyperscore 40 of 19, and one peptide # results 30 with a hyperscore of 83. 20 X!Tandem assumes that the peptide with the 10 highest hyperscore is 0 correct, and all others 0 20 40 60 80 100 are incorrect. hyperscore Image courtesy of Proteome Software
Log histogram If the data on the right side of the histogram, 60 (colored in upper figure) 50 is taken and log- transformed, the data 40 fall on a straight line.# results 30 A straight line is the 20 expected result from a statistical argument that 10 assumes the incorrect 0 results are random. 0 20 40 60 80 100 hyperscore Note: this histogram is 4 calculated 3 .5 independently for each 3 spectrum.log(# results) 2 .5 2 1 .5 1 0 .5 0 20 25 30 35 40 45 50 Image courtesy of Proteome Software
Significant scores X!Tandem has already assumed that the top 60 hyperscore is the only 50 possible correct match. 40 This match is significant# results 30 if it is greater than the point at which the 20 straight line through the 10 log data intersects the log(#results)=0 line. 0 0 20 40 60 80 100 Any hyperscores hyperscore greater than this are 4 unlikely to have arisen 3 .5 by chance. 3log(# results) 2 .5 significant 2 1 .5 1 0 .5 0 20 25 30 35 40 45 50 Image courtesy of Proteome Software
X!Tandem E-value The E-value expresses just how unlikely a 60 greater hyperscore is. 50 X!Tandem calculates 40 the E-value by# results extrapolating the red 30 line of the log 20 histogram. 10 For the example shown, a hyperscore of 83 0 0 20 40 60 80 100 would occur by chance hyperscore where the red line 6 crosses 83. The log of 4 this value — the 2 E-value — is -8.2, aslog(# results) 0 shown. -2 -4 -6 E-value=e-8.2 -8 -1 0 0 20 40 60 80 100
Why is Probability based scoring important?• Human (even expert) judgment is subjective and can be unreliable
Why is Probability based scoring important?• Human (even expert) judgment is subjective and can be unreliable• Standard, statistical tests of significance can be applied to the results• Arbitrary scoring schemes are susceptible to false positives.
Can we calculate a probability that a match is correct?• Yes, if it is a test sample and you know what the answer should be – Matches to the expected protein sequences are defined to be correct – Matches to other sequences are defined to be wrong• If the sample is an unknown, then you have to define “correct” very carefully: – The best match in the database? – The best match out of all possible peptides? – The peptide sequence that is uniquely and completely defined by the MS data?
Probability model: Andromeda (MaxQuant) – Theoretical model Binomial, N N Hypergeometric, P= p k (1-p)N-k Poisson, k n k or EVD ?N is the # of possible fragment ion matches (peplen * 2),n is the # of observed fragment ion matches,k is the # of matchesp (probability of a match) = peak depth / numbinsWhere, peak depth = #of peaks per 100 Da window (max 10)And numbins = 100 / (2 * frag_ion_tol)Score = -10 * log10P
Limitations of E-values or P-values1) P-values or E-values are not well suited for the analysis of large-scale datasets - Do not allowestimation of global error rates (FDR) as a function of filtering threshold (need for multiple testing correction)2) Do not directly incorporate additional useful information (e.g., # of missed cleavages, mass accuracy, retention time etc.)
What makes a good decoy?1) For all target ―real‖ peptides generate decoys ―on-the-fly‖ – easy, built-in.2) Reverse internal peptide residues – if palindrome then randomise LGEDTLISYR LYSILTDEGR3) I/L residues taken into account.4) PTM’s are kept constant but shifted internally within peptide.5) Similar implementation in Crux (Univ. of Washingon – Noble, MacCoss).
Decoy strategies?1) separate or concatenated target decoy sequence database?
Decoy strategies: Imperfection of scoring functions sequestHigh scores for short (random) peptides High scores for larger search space
Decoy strategies: Imperfection of scoring functionsA) Covariance dependency – some spectra score well regardlessB) Reduce co-varying features by using post-processing tools (e.g. PeptideProphet, Percolator, q-ranker etc.) which combine multiple different features.
Summary: Statistical approaches Single spectrum analyses (individual probabilities) Expectation values Similar to sequence similarity searching (BLAST)Global analysis (individual and global error rates) • Target-decoy strategy for global FDR • Distribution modeling (e.g. Peptide prophet, Percolator) for local and global FDR estimation.