• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
BITS - Overview of sequence databases for mass spectrometry data analysis
 

BITS - Overview of sequence databases for mass spectrometry data analysis

on

  • 940 views

This is the fourth presentation of the BITS training on 'Mass spec data processing'. ...

This is the fourth presentation of the BITS training on 'Mass spec data processing'.

It review sequences databases and their flaws in light of mass spectrometry data analysis.

Thanks to the Compomics Lab of the VIB for their contribution.

Statistics

Views

Total Views
940
Views on SlideShare
770
Embed Views
170

Actions

Likes
0
Downloads
7
Comments
0

3 Embeds 170

http://www.bits.vib.be 167
http://translate.googleusercontent.com 2
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    BITS - Overview of sequence databases for mass spectrometry data analysis BITS - Overview of sequence databases for mass spectrometry data analysis Presentation Transcript

    • http://www.bits.vib.be/training
    • sequence databases lennart martens lennart.martens@ugent.be Computational Omics and Systems Biology Group Department of Medical Protein Research, VIB Department of Biochemistry, Ghent UniversityLennart Martens BITS MS Data Processing – Sequence Databases Ghent, Belgiumlennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • PEPTIDES AND REDUNDANCY IN SEQUENCE DATABASESLennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Peptide-level sequence redundancy >Protein 1 >Protein 1 (1-6) LENNARTMARTENS LENNAR >Protein 2 >Protein 1 (7-10) LENNARTMARTENT TMAR >Protein 1 (11-14) TENS = non-redundant protein DB >Protein 2 (1-6) LENNAR = ≠ >Protein 2 (7-10) TMAR >Protein 2 (11-14) non-redundant peptide DB TENT Database content: all peptide sequences in the database Database inform ation: number of unique peptide sequences database information Database inform ation ratio: database contentLennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Information ratios for common databases 12,000,000 100% 93% ratio Content information 10,307,319 90% 10,000,000 Tryptic cleavage, 1 allowed missed cleavage, Mass limits from 600 to 4000 Da. 80% 70% 8,000,000 60% 6,000,000 50% 45% 41% 42% 40% 4,000,000 4,472,356 3,491,778 3,186,806 30% 23% 20% 2,000,000 1,584,806 2,394,844 1,877,500 1,559,685 10% 1,466,927 1,309,625 0 0% UniProtKB/SwissProt UniProtKB/TrEMBL Ensembl human IPI human NCBI nr human human humanLennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • ENRICHING SEQUENCE DATABASESLennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • The influence of the sequence database N C In vivo processing Search ID miss base N C + Enzymatic digest and subsequent NH2-terminal peptide isolation Not in the sequence database!Lennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • An example Mitochondrial Isovaleryl-coA Dehydrogenase MATATRLLGWRVASWRLRPPLAGFVS N -term inal transit peptide (1-29) 30 47 QRAHSLLPVDDAINGLSEEQRQLRE… I sovaleryl-CoA dehydrogenase (30 – 423) …LDGIQCFGGNGYINDFPMGRFLRDA 423 KLYEIGAGTSEVRRLVIGRAFNADFHLennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Extending the information content AHSLLPVDDAINGLSEEQR AHSLLPVDDAINGLSEEQR HSLLPVDDAINGLSEEQR SLLPVDDAINGLSEEQR LLPVDDAINGLSEEQR LPVDDAINGLSEEQR PVDDAINGLSEEQR VDDAINGLSEEQR …… Revised search Search ID miss base base IDLennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Another example: in vivo protein cleavage NH 2 COOH R R R D R Caspase cleavage of this protein (for 50%) NH 2 COOH R R R D R NH 2 COOH NH 2 COOH R R RD R NH2-terminal peptide isolation COOH COOH NH 2 NH 2 R R NOT IN DB!Lennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Solving the issue: bifunctional enzymes COOH NH 2 R result of in vivo result of in vitro protease trypsin Creation of a bifunctional enzyme will generate the correct peptides! Title:Arg-C Title:dual ArgC_Cathep Cleavage:R Cleavage:DX R Restrict:P Restrict:P Cterm Cterm Arg-C definition Arg-C (N-term), Cathepsin (C-term) definitionLennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • DBTOOLKIT AND DATABASE ON DEMANDLennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Working with databases: DBToolkit http:/ / genesis.UGent.be/ dbtoolk it See: M artens et al., Bioinform atics 2005, 21(17): 3584-3585Lennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Summary of DBToolkit functionalities a) Enzymatic digestion using regular or ‘dual’ enzymes  proteins to peptides b) N-terminal or C-terminal ragging  enhancing the information content of the database c) Non-lossy redundancy clearing  raising database information ratio d) Create shuffled and reversed databases  false-positives testing e) Extract sequence-based subsets  a priori prediction of potential success rate f) Map peptides back to proteins (maximal annotation approach)  find all matching proteins, and select primaries etc …Lennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Database on Demand – DBToolkit online http:/ / w w w .ebi.ac.uk/ pride/ dod See: R eisinger et al., P roteom ics 2009, 9(18): 4421-4424Lennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • WHY DOES PROCESSING MATTER?Lennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Serum degradation over time From : Yi et al., Journal of P roteom e R esearch 2007, 6(5): 1768-1781Lennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Plasma degradation over time From : Yi et al., Journal of P roteom e R esearch 2007, 6(5): 1768-1781Lennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • TIME-LABILITY OF SEQUENCE DATABASESLennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Example 1: HUPO PPP actualisation Bringing the P P P from I P I 2.21 to I P I 3.13 1555 Total 1048 Unchanged 67% 507 Changed 33% Of which: 338 Propagated 22% 67% (of ‘Changed’) 169 Defunct 11% 33% (of ‘Changed’) Of which 95 Defunct (RFSQ_XP) 6% 56% (of ‘Defunct’) Both exist, 72 Defunct (Ensembl) 5% 43% (of ‘Defunct’) 1 taxonomy now: RAT 1 immunoglobin 2 UniProt 0% 1% (of ‘Defunct’) 1048 + 345 = 1386 recoverable (89.1%) See: M artens and M ueller et al., P roteom ics 2006, 6(18):5059-75Lennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Example 2: human blood platelets Bringing the P latelets from I P I 2.31 to I P I 3.13 673 Total 578 Unchanged 86% 95 Changed 14% Of which: 78 Propagated 12% 82% (of ‘Changed’) 17 Defunct 3% 18% (of ‘Changed’) Of which 5 Defunct (RFSQ_XP) 1% 29% (of ‘Defunct’) 12 Defunct (Ensembl) 2% 71% (of ‘Defunct’) 578 + 78 = 656 recoverable (97%) See: M artens and M ueller et al., P roteom ics 2006, 6(18):5059-75Lennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Proteins sometimes age badly Adapted from : http:/ / w w w .ebi.ac.uk/ ipiLennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • THE PICR MAPPING SERVICELennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Identifiers through (name)space and time http:/ / w w w .ebi.ac.uk/ tools/ picr Limit search by taxonomy (pessimistic) Submit accessions OR sequences (FASTA) with 500 entry interactive limit (no batch limit) Choose to return all mappings or only active ones Select output format Select one or many databases to map to in one Run request search See: Côté et al., BM C Bioinform atics 2007, 8: 401Lennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Mapping resultsLennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • ESTIMATING FALSE DISCOVERY RATES THE DECOY DATABASE APPROACHLennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Decoy databases, the latest fashion Three main types of decoy DB’s are used: - Reversed databases (easy) LENNARTMARTENS  SNETRAMTRANNEL - Shuffled databases (slightly more difficult) LENNARTMARTENS  NMERLANATERTTN (for instance) - Randomized databases (as difficult as you want it to be) LENNARTMARTENS  GFVLAEPHSEAITK (for instance) The concept is that each peptide identified from the decoy database is an incorrect identification. By counting the number of decoy hits, we can estimate the number of false positives in the original database.Lennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Estimating the FDR (i) 2 × nbr _ decoy _ hits FDR = nbr _ forward _ hits + nbr _ decoy _ hits FDR is the False Discovery Rate – it is a metric that gives you an indication of how many (percent) of your identifications are potentially incorrect. Note that we multiply the number of decoy hits by 2, because we should not only count the actual decoy hits, but also the ‘hidden’ false positives that are present in the forward identifications. The assumption here is that we expect one forward false positive hit per decoy false positive hit, hence the doubling term. From: Elias and Gygi, Nature Methods 2007, 4(3): 207-214Lennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Estimating the FDR (ii) nbr _ decoy _ hits FDR = nbr _ forward _ hits This metric was proposed by Storey and Tibbs for genomics data, and further investigated by Lukas Käll for proteomics. It provides a more accurate (and simpler!) estimate of the FDR, but can be extended to also take into account the (suspected) false positives in the forward set. See: Storey and Tibbs, PNAS 2003, 100(16): 9440-9445 See: Käll et al,., JPR 2008, 7(1): 29-34Lennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
    • Thank you! Questions?Lennart Martens BITS MS Data Processing – Sequence Databaseslennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011