Scoring functions and their use in the  identification of peptides in mass             spectrometry                    Eug...
Shotgun proteomics: peptide identification methods
BLAST / FASTA• Sequence against sequence• Can be used to find weak / distant similarity• Can make gapped alignments      M...
Peptide Information Content                                                                                               ...
Spectral information content: “Mobile” Proton                                                   y8                      10...
Spectral information content: “Non-mobile” proton                        100                         -CH3SOH              ...
Aims of MS/MS Scoring Functions (1)                                 Mobile Proton 2+ (1268 unique peptides)               ...
Aims of MS/MS Scoring Functions (1)                             Partially Mobile Proton 2+ (2223 unique peptides)         ...
Aims of MS/MS Scoring Functions (1)                             Non-Mobile Proton 2+ (264 unique peptides)                ...
Aims of MS/MS Scoring Functions (2)                                 Tryptic search               1.0               0.8    ...
Types of MS/MS Scoring Functions   •Non-statistical: Correlation, dot-product ->        raw score (SEQUEST/Comet)   •Blend...
X!Tandem MS/MS Scoring                            X!Tandem’s preliminary                                                  ...
X!TandemHyperscore                                X!Tandem modifies the                                                   ...
Histogram of hyperscores                                           Next, X!Tandem makes                                   ...
Log histogram                                        If the data on the right                                             ...
Significant scores                                      X!Tandem has already                                              ...
X!Tandem E-value              The E-value expresses                                                          just how unli...
Why is Probability based scoring important?• Human (even expert) judgment is subjective  and can be unreliable
Why is Probability based scoring important?• Human (even expert) judgment is subjective  and can be unreliable• Standard, ...
Can we calculate a probability that a match is                        correct?• Yes, if it is a test sample and you know w...
Probability model: Andromeda (MaxQuant) –             Theoretical model                                                   ...
Probability model: Digger – empirical NULL model              for ONE candidate (decoy) peptide                           ...
Limitations of E-values or P-values1) P-values or E-values are not well suited for      the analysis of large-scale datase...
What makes a good decoy?1) For all target ―real‖ peptides generate decoys ―on-the-fly‖ – easy, built-in.2) Reverse interna...
What makes a good decoy?
Decoy strategies?1) separate or concatenated target decoy sequence database?
Decoy strategies:               Imperfection of scoring functions                                                         ...
Decoy strategies:            Imperfection of scoring functionsA) Covariance dependency – some spectra score well regardles...
Summary: Statistical approaches Single spectrum analyses (individual probabilities)       Expectation values       Simi...
Summary: Challenges… Multiplexed spectra Middle-down proteomicsX-linked spectraPTMs: Phosphorylation - multiple modifi...
Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Upcoming SlideShare
Loading in …5
×

Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

927 views

Published on

Scoring functions and their use in the identification of peptides in mass spectrometry

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
927
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
26
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

  1. 1. Scoring functions and their use in the identification of peptides in mass spectrometry Eugene A Kapp Bioinformatics Walter & Eliza Hall Institute of Medical Research Bioinformatics Summer Course Dec 2012
  2. 2. Shotgun proteomics: peptide identification methods
  3. 3. BLAST / FASTA• Sequence against sequence• Can be used to find weak / distant similarity• Can make gapped alignments MS/MS-based ID• Mass & intensity values against sequence• Looking for identity or near identity• Generally, short peptides
  4. 4. Peptide Information Content zHz+ O O O + O H2N CH C NH CH C NH CH C NH CH C OH R1 R2 R3 H R4 b ion formation and/or y ion formation O O O O +b3 H2N CH C NH CH C NH CH C + H NH CH C OH y1 R1 R2 R3 H R4 + + Neutral pumped away by vacuum system Neutral pumped away by vacuum system Proton Mobility For peptides with non-mobile protons, fragmentation tends to proceed via charge-remote mechanisms. MS/MS spectra will beMobile: zpre> #Arg + #Lys + #His dominated by a few ions, typically:Partially mobile: zpre< #Arg + #Lys + #His and > #Arg C-term side of D, ENon-mobile: zpre< #Arg N-term side of P
  5. 5. Spectral information content: “Mobile” Proton y8 100 + ox Pe b11 + VFIMDNCEELIPEYLNFIR y8 Relative Abundance •nP cleavage 50 b11 •metox loss •cP cleavage b10 y9 y6 y4 y5 0 400 600 800 1000 1200 1400 1600 1800 2000 m/z
  6. 6. Spectral information content: “Non-mobile” proton 100 -CH3SOH Pe ox + ~ ~ + RVFIMDNCEELIPEYLNFIR y14 y11 y8 y6 -Pe - (CH3SOH + Pe) •metox loss Relative Abundance •Pe loss 50 •cD cleavage •cE cleavage •nP cleavage MDNCE y14 b6y6 y8 y11 0 400 600 800 1000 1200 1400 1600 1800 2000 m/z
  7. 7. Aims of MS/MS Scoring Functions (1) Mobile Proton 2+ (1268 unique peptides) 250 "Randomised"Number of spectra 200 150 d = 0.55 100 "Correct sequence" 50 0 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 NXCorr
  8. 8. Aims of MS/MS Scoring Functions (1) Partially Mobile Proton 2+ (2223 unique peptides) 340 320 300 280 "Randomised" 260Number of spectra 240 220 200 180 160 d = 0.50 140 120 "Correct sequence" 100 80 60 40 20 0 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 NXCorr
  9. 9. Aims of MS/MS Scoring Functions (1) Non-Mobile Proton 2+ (264 unique peptides) 30 28 26 24 "Randomised" Number of spectra 22 20 d = 0.32 18 16 "Correct sequence" 14 12 10 8 6 4 2 0 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 NXCorr
  10. 10. Aims of MS/MS Scoring Functions (2) Tryptic search 1.0 0.8 ASensitivity 0.6 0.4 Mascot Ion score (AUC=0.98) PeptideProphet (AUC=0.96) Sonar (AUC=0.94) 0.2 Tandem (AUC=0.93) Spectrummill (tag) (AUC=0.91) Sequest XCorr (AUC=0.91) Spectrummill (AUC=0.86) 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1-Specificity Kappet al. Proteomics 2005
  11. 11. Types of MS/MS Scoring Functions •Non-statistical: Correlation, dot-product -> raw score (SEQUEST/Comet) •Blend: raw score -> E-value (X!Tandem) •Statistical: - peptide or fragment ion frequency statistics (Mascot/Andromeda) OR Bayesian model (xxx)SpectrumMill, GutenTag, MyriMatch, Digger, ProteinPilot, Sorcerer,pFind, Peaks, ProteinLynxGS, MSGF, Inspect, OMSSA etc...
  12. 12. X!Tandem MS/MS Scoring X!Tandem’s preliminary score is a dot product of the acquired and model spectra. Because only similar peaks are n considered, this is the y / bScore Ii * P i sum of the intensities of the matched y and b i 0 ions. spectrum predicted? intensities (1,0)100% similar peaks (y/b ions) 0% Image courtesy of Proteome Software
  13. 13. X!TandemHyperscore X!Tandem modifies the preliminary score by multiplying by N factorial for the n number of b and y ions assigned. The use ofH yperScore I i * P i * N b !* N y ! factorials is based on i 0 the hypergeometric distribution. spectrum predicted? intensities (1,0)100% similar peaks (y/b ions) 0% Image courtesy of Proteome Software
  14. 14. Histogram of hyperscores Next, X!Tandem makes a histogram of all the hyperscores for all the peptides in theincorrect database that mightIDs match this spectrum. 60 For example, in this 50 figure, 52 peptides were found with a hyperscore 40 of 19, and one peptide # results 30 with a hyperscore of 83. 20 X!Tandem assumes that the peptide with the 10 highest hyperscore is 0 correct, and all others 0 20 40 60 80 100 are incorrect. hyperscore Image courtesy of Proteome Software
  15. 15. Log histogram If the data on the right side of the histogram, 60 (colored in upper figure) 50 is taken and log- transformed, the data 40 fall on a straight line.# results 30 A straight line is the 20 expected result from a statistical argument that 10 assumes the incorrect 0 results are random. 0 20 40 60 80 100 hyperscore Note: this histogram is 4 calculated 3 .5 independently for each 3 spectrum.log(# results) 2 .5 2 1 .5 1 0 .5 0 20 25 30 35 40 45 50 Image courtesy of Proteome Software
  16. 16. Significant scores X!Tandem has already assumed that the top 60 hyperscore is the only 50 possible correct match. 40 This match is significant# results 30 if it is greater than the point at which the 20 straight line through the 10 log data intersects the log(#results)=0 line. 0 0 20 40 60 80 100 Any hyperscores hyperscore greater than this are 4 unlikely to have arisen 3 .5 by chance. 3log(# results) 2 .5 significant 2 1 .5 1 0 .5 0 20 25 30 35 40 45 50 Image courtesy of Proteome Software
  17. 17. X!Tandem E-value The E-value expresses just how unlikely a 60 greater hyperscore is. 50 X!Tandem calculates 40 the E-value by# results extrapolating the red 30 line of the log 20 histogram. 10 For the example shown, a hyperscore of 83 0 0 20 40 60 80 100 would occur by chance hyperscore where the red line 6 crosses 83. The log of 4 this value — the 2 E-value — is -8.2, aslog(# results) 0 shown. -2 -4 -6 E-value=e-8.2 -8 -1 0 0 20 40 60 80 100
  18. 18. Why is Probability based scoring important?• Human (even expert) judgment is subjective and can be unreliable
  19. 19. Why is Probability based scoring important?• Human (even expert) judgment is subjective and can be unreliable• Standard, statistical tests of significance can be applied to the results• Arbitrary scoring schemes are susceptible to false positives.
  20. 20. Can we calculate a probability that a match is correct?• Yes, if it is a test sample and you know what the answer should be – Matches to the expected protein sequences are defined to be correct – Matches to other sequences are defined to be wrong• If the sample is an unknown, then you have to define “correct” very carefully: – The best match in the database? – The best match out of all possible peptides? – The peptide sequence that is uniquely and completely defined by the MS data?
  21. 21. Probability model: Andromeda (MaxQuant) – Theoretical model Binomial, N N Hypergeometric, P= p k (1-p)N-k Poisson, k n k or EVD ?N is the # of possible fragment ion matches (peplen * 2),n is the # of observed fragment ion matches,k is the # of matchesp (probability of a match) = peak depth / numbinsWhere, peak depth = #of peaks per 100 Da window (max 10)And numbins = 100 / (2 * frag_ion_tol)Score = -10 * log10P
  22. 22. Probability model: Digger – empirical NULL model for ONE candidate (decoy) peptide Ion-series a b b-98 b-18 b-17 b++ b++-98 b++-18 b++-17 y y-98 y-18 y-17 y++ y++-98 y++-18 y++-17 L1 0 2 0 1 0 0 0 0 0 3 0 1 0 0 0 0 0 L2 0 3 0 1 0 0 0 0 0 3 0 1 0 0 0 0 0 L3 1 3 0 2 0 0 0 0 0 3 0 1 0 0 0 0 0Level L4 1 3 0 2 1 0 0 0 0 4 0 1 1 1 0 0 0 L5 1 3 0 2 1 0 0 0 0 4 0 2 1 1 0 0 1 L6 1 3 0 2 1 0 0 0 0 5 0 2 1 1 0 0 1 L7 1 3 0 2 1 0 0 0 0 5 0 2 1 1 0 1 1 L8 1 3 0 2 1 0 0 0 0 5 0 2 2 1 0 1 2 L9 1 3 0 2 2 0 0 0 0 5 0 2 2 1 0 1 2 L10 1 3 0 2 2 0 0 0 0 5 0 2 2 1 0 1 3 1 2 for ALL candidate (decoy) peptides For @ cell calc slope & intercept based on all decoy peptides 60 50 Score (-10*LgP) 40 30 20 10 0 -10 0 1 2 3 4 5 6 7 8 9 # of fragment ion matches Extrapolate for more matches…
  23. 23. Limitations of E-values or P-values1) P-values or E-values are not well suited for the analysis of large-scale datasets - Do not allowestimation of global error rates (FDR) as a function of filtering threshold (need for multiple testing correction)2) Do not directly incorporate additional useful information (e.g., # of missed cleavages, mass accuracy, retention time etc.)
  24. 24. What makes a good decoy?1) For all target ―real‖ peptides generate decoys ―on-the-fly‖ – easy, built-in.2) Reverse internal peptide residues – if palindrome then randomise LGEDTLISYR LYSILTDEGR3) I/L residues taken into account.4) PTM’s are kept constant but shifted internally within peptide.5) Similar implementation in Crux (Univ. of Washingon – Noble, MacCoss).
  25. 25. What makes a good decoy?
  26. 26. Decoy strategies?1) separate or concatenated target decoy sequence database?
  27. 27. Decoy strategies: Imperfection of scoring functions sequestHigh scores for short (random) peptides High scores for larger search space
  28. 28. Decoy strategies: Imperfection of scoring functionsA) Covariance dependency – some spectra score well regardlessB) Reduce co-varying features by using post-processing tools (e.g. PeptideProphet, Percolator, q-ranker etc.) which combine multiple different features.
  29. 29. Summary: Statistical approaches Single spectrum analyses (individual probabilities)  Expectation values  Similar to sequence similarity searching (BLAST)Global analysis (individual and global error rates) • Target-decoy strategy for global FDR • Distribution modeling (e.g. Peptide prophet, Percolator) for local and global FDR estimation.
  30. 30. Summary: Challenges… Multiplexed spectra Middle-down proteomicsX-linked spectraPTMs: Phosphorylation - multiple modifications and sitesPTM cross-talk elucidation

×