BITS - Search engines for mass spec data
Upcoming SlideShare
Loading in...5
×
 

BITS - Search engines for mass spec data

on

  • 1,140 views

This is the third presentation of the BITS training on 'Mass spec data processing'. ...

This is the third presentation of the BITS training on 'Mass spec data processing'.

It reviews the methods for matching mass spectrometry data with protein sequences, with review of useful tools.

Thanks to the Compomics Lab of the VIB for contribution.

Statistics

Views

Total Views
1,140
Views on SlideShare
987
Embed Views
153

Actions

Likes
0
Downloads
16
Comments
0

3 Embeds 153

http://www.bits.vib.be 150
http://translate.googleusercontent.com 2
http://www.linkedin.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

BITS - Search engines for mass spec data BITS - Search engines for mass spec data Presentation Transcript

  • http://www.bits.vib.be/training
  • search engines lennart martens lennart.martens@ugent.be Lennart MARTENS lennart.martens@ebi.ac.uk Computational Omics and Systems Biology Group Proteomics Services Group European Bioinformatics Institute Department of Medical Protein Research, VIB Hinxton, Cambridge United Kingdom Department of Biochemistry, Ghent University www.ebi.ac.ukLennart Martens Ghent, Belgium BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • THREE TYPICAL PRE-PROCESSING STEPSLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Noise thresholding precursor Global thresholding precursor Local thresholdingLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Charge deconvolution (peptides) From: http://www.purdue.edu/dp/bioscience/images/spectrum.jpgLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Charge deconvolution (proteins) From: Gill et al, EMBO Journal, 2000Lennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Centroiding (peak picking) Monoisotopic mass Average mass x xLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Combined results A total ion current chromatogram, corrected by typical pre-processing steps. From: Last et al, Nature Rev. Mol. Cell Bio., 2007Lennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Data size reduction 60 Q-TOF II Q-TOF Esquire HCT Esquire HCT 50 40 File size File size (MB) (MB) 30 51.4 20 24.5 25.8 23.7 10 0.7 0.2 0.3 0.1 0 RAW RAW GZIPped Peak lists Peak lists GZIPped Data type Data type See: Martens et al., Proteomics, 2005Lennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • MS/MS IDENTIFICATIONPEPTIDE FRAGMENTATION FINGERPRINTINGLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Peptide sequences and MS/MS spectra LENNARTintensity LENNAR RT NNART NART LEN LENNART LENNA LENNART ART ENNART T LENN L LE L E N N A R T m/zLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Peptide fragment fingerprinting (PFF) Int YSFVATAER m/z Int HETSINGK in silico in silico Int m/z MILQEESTVYYR digest MS/MS m/z Int SEFASTPINK … m/z protein sequence database peptide sequences theoretical MS/MS spectra 1) YSFVATAER 34 in silico 2) YSFVSAIR 12 3) FFLIGGGGK 12 matching peptide scores experimental MS/MS spectrumLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Three types of PFF identification Spectral comparison theoretical compare experimental database sequence spectrum spectrum Sequencial comparison compare de novo experimental database sequence sequence spectrum Threading comparison thread experimental database sequence spectrum From: Eidhammer, Flikka, Martens, Mikalsen – Wiley 2007Lennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • The most popular algorithms • MASCOT (Matrix Science) http://www.matrixscience.com • SEQUEST (Scripps, Thermo Fisher Scientific) http://fields.scripps.edu/sequest • X!Tandem (The Global Proteome Machine Organization) http://www.thegpm.org/TANDEM • OMSSA (NCBI) http://pubchem.ncbi.nlm.nih.gov/omssa/Lennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Overall concept of scores and cut-offs Incorrect identifications Threshold score Correct identifications False negatives False positives Adapted from: www.proteomesoftware.com – Wiki pagesLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Playing with probabilistic cut-off scores higher stringency 6% 100% 90% 5% 80% 4% identifications 70% 60% 3% 50% false positives 40% 2% 30% 20% 1% 10% 0% 0% p=0.05 p=0.01 p=0.005 p=0.0005Lennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • SEQUEST • Very well established search engine • Can be used for MS/MS (PFF) identifications • Based on a cross-correlation score (includes peak height) • Published core algorithm (patented, licensed to Thermo), Eng, JASMS 1994 • Provides preliminary (Sp) score, rank, cross-correlation score (XCorr), and score difference between the top tow ranks (deltaCn, ∆Cn) • Thresholding is up to the user, and is commonly done per charge state • Many extensions exist to perform a more automatic validation of results = � ∙ (+) =1 1 +75 XCorr = 0 − 151 � XCorr 1 − XCorr 2 =−75 deltaCn= XCorr 1Lennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • SEQUEST: some additional pictures From: MacCoss et al., Anal. Chem. 2002 From: Peng et al., J. Prot. Res.. 2002Lennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Mascot • Very well established search engine, Perkins, Electrophoresis 1999 • Can do MS (PMF) and MS/MS (PFF) identifications • Based on the MOWSE score, • Unpublished core algorithm (trade secret) • Predicts an a priori threshold score that identifications need to pass • From version 2.2, Mascot allows integrated decoy searches • Provides rank, score, threshold and expectation value per identification • Customizable confidence level for the threshold scoreLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Mascot: some additional pictures 40 Average identity threshold 35 y = 8.3761x - 34.089 2 6%R = 0.9985 100% Average identitythreshold 30 25 90% 5% 20 80% 15 70% 4% 10 identifications 60% 5 3% 50% 0 6.50 7.00 7.50 8.00 8.50 40% 2% log10(number of AA) 30% false positives 20% 1% 10% 0% 0% p=0.05 p=0.01 p=0.005 p=0.0005Lennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • X!Tandem • A successful open source search engine, Craig and Beavis, RCMS 2003 • Can be used for MS/MS (PFF) identifications  n  • Based on a hyperscore (Pi is either 0 or 1): HyperScore =  ∑ Ii * Pi  * Nb !* Ny !  i =0  • Relies on a hypergeometric distribution (hence hyperscore) • Published core algorithm, and is freely available • Provides hyperscore and expectancy score (the discriminating one) • X!Tandem is fast and can handle modifications in an iterative fashion • Has rapidly gained popularity as (auxiliary) search engineLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • X!Tandem: some additional pictures 60 4 3.5 50 3 log(# results) 40 # results 2.5 30 2 1.5 20 1 10 0.5 0 0 20 25 30 35 40 45 50 0 20 40 60 80 100 hyperscore hyperscore significance 6 threshold 4 log(# results) 2 0 -2 Adapted from: Brian Searle, ProteomeSoftware, -4 http://www.proteomesoftware.com/XTandem_edited.pdf -6 -8 E-value=e-8.2 -10 0 20 40 60 80 100 hyperscoreLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • A note on how the scores differ SEQUEST Accuracy Score Relative Score XCorr DeltaCn X! Tandem HyperScore E-Value Adapted from: Brian Searle, ProteomeSoftwareLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • OMSSA • A successful open source search engine, Geer, JPR 2004 • Can be used for MS/MS (PFF) identifications • Relies on a Poisson distribution • Published core algorithm, and is freely available • Provides an expectancy score, similar to the BLAST E-value • OMSSA was recently upgraded to take peak intensity into account • Good really good marks in a recently published comparative studyLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • OMSSA: some additional pictures Yeast lysate spectrum, m/z matches of Validation of the Poisson distribution model: fragment peak matches versus all NCBI nr mean number of modelled and measured sequence library. Poisson distribution fitted. matching peaks (against the NCBI nr database) for two mass tolerances. Adapted from: Geer et al., J. Prot. Res., 2004Lennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • COMPARATIVE STUDIESLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Kapp et al., Proteomics, 2005Lennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Balgley et al., Mol. Cell. Proteomics, 2007 1.6x more?!Lennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Combining the output of search algorithms Mascot SEQUEST 3229 3792 212 486 (+4,2%) (+9,6%) ProteinSolver 3203 179 168 Phenyx 40 3186 329 380 (+6,5%) 501 348 (+7,5%) 1776 139 96 195 77 146 Figure courtesy of Dr. Christian Stephan, Medizinisches Proteom-Center, Ruhr-Universität Bochum; Human Brain Proteome ProjectLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • SEQUENCIAL COMPARISON ALGORITHMSLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Sequence tags sequence tag The concept of sequence tags was introduced by Mann and Wilm (Mann,and Wilm, Anal. Chem. 1994, 66: 4390-4399). Image from: Matthias Wilm, EMBL Heidelberg, Germany http://www.narrador.embl-heidelberg.de/GroupPages/PageLink/activities/SeqTag.htmlLennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • GutenTag, DirecTag, TagRecon • Tabb, Anal. Chem. 2003, Tabb, JPR 2008, Dasari, JPR 2010 • Recent implementations of the sequence tag approach • Refine hits by peak mapping in a second stage to resolve ambiguities • Rely on a empirical fragmentation model • Published core algorithms, DirecTag and TagRecon freely available • Most useful to retrieve unexpected peptides (modifications, variations) • Entire workflows exist (e.g., combination with IDPicker)Lennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • GutenTag: some additional pictures From: Tabb et al., Anal. Chem., 2003Lennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • De novo compared to sequence tags Example of a manual de novo of an MS/MS spectrum No more database necessary to extract a sequence! Algorithms References Lutefisk Dancik 1999, Taylor 2000 Sherenga Fernandez-de-Cossio 2000 PEAKS Ma 2003, Zhang 2004 PepNovo Frank 2005, Grossmann 2005 … …Lennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • Thank you! Questions?Lennart Martens BITS MS Data Processing – Search Engineslennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011