SlideShare a Scribd company logo
1 of 30
Download to read offline
http://www.bits.vib.be/training
sequence databases



                                                       lennart martens
                                                 lennart.martens@ugent.be


                                 Computational Omics and Systems Biology Group
                                     Department of Medical Protein Research, VIB
                                     Department of Biochemistry, Ghent University
Lennart Martens             BITS MS Data Processing – Sequence Databases
                                                        Ghent, Belgium
lennart.m artens@ugent.be      UGent, Gent, Belgium – 16 December 2011
PEPTIDES AND REDUNDANCY
                     IN SEQUENCE DATABASES




Lennart Martens             BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be       UGent, Gent, Belgium – 16 December 2011
Peptide-level sequence redundancy

       >Protein 1                                             >Protein 1 (1-6)
       LENNARTMARTENS                                         LENNAR
       >Protein 2                                             >Protein 1 (7-10)
       LENNARTMARTENT                                         TMAR
                                                              >Protein 1 (11-14)
                                                              TENS
                                                                                   =
          non-redundant protein DB
                                                              >Protein 2 (1-6)
                                                              LENNAR
                                                                                   =
                            ≠                                 >Protein 2 (7-10)
                                                              TMAR
                                                              >Protein 2 (11-14)
          non-redundant peptide DB
                                                              TENT


     Database content:                         all peptide sequences in the database
     Database inform ation:                    number of unique peptide sequences
                                               database information
     Database inform ation ratio:
                                                  database content
Lennart Martens             BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be       UGent, Gent, Belgium – 16 December 2011
Information ratios for common databases
 12,000,000                                                                                                 100%
                     93%
                                                     ratio          Content   information
                                                                                               10,307,319   90%

 10,000,000                             Tryptic cleavage, 1 allowed missed cleavage,
                                              Mass limits from 600 to 4000 Da.                              80%


                                                                                                            70%
  8,000,000

                                                                                                            60%


  6,000,000                                                                                                 50%
                                                                 45%
                                          41%                                   42%
                                                                                                            40%

  4,000,000                                                                    4,472,356
                                                                3,491,778
                                         3,186,806                                                          30%
                                                                                                23%

                                                                                                            20%
  2,000,000        1,584,806
                                                                                              2,394,844
                                                                              1,877,500
                                                               1,559,685                                    10%
                   1,466,927
                                       1,309,625

         0                                                                                                  0%
              UniProtKB/SwissProt   UniProtKB/TrEMBL         Ensembl human    IPI human     NCBI nr human
                    human                 human



Lennart Martens                       BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be                  UGent, Gent, Belgium – 16 December 2011
ENRICHING SEQUENCE DATABASES




Lennart Martens             BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be       UGent, Gent, Belgium – 16 December 2011
The influence of the sequence database


     N                                               C
                            In vivo processing                             Search
                                                                                    ID   miss
                                                                          base

       N                                            C
           +
                            Enzymatic digest and subsequent
                             NH2-terminal peptide isolation


                                         Not in the sequence database!




Lennart Martens             BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be       UGent, Gent, Belgium – 16 December 2011
An example

                  Mitochondrial Isovaleryl-coA Dehydrogenase

             MATATRLLGWRVASWRLRPPLAGFVS
                            N -term inal transit peptide (1-29)
                     30                                                   47
             QRAHSLLPVDDAINGLSEEQRQLRE…
                                I sovaleryl-CoA dehydrogenase (30 – 423)

             …LDGIQCFGGNGYINDFPMGRFLRDA
                                                                               423
             KLYEIGAGTSEVRRLVIGRAFNADFH


Lennart Martens                 BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be           UGent, Gent, Belgium – 16 December 2011
Extending the information content

   AHSLLPVDDAINGLSEEQR                                                AHSLLPVDDAINGLSEEQR
                                                                       HSLLPVDDAINGLSEEQR
                                                                        SLLPVDDAINGLSEEQR
                                                                         LLPVDDAINGLSEEQR
                                                                          LPVDDAINGLSEEQR
                                                                           PVDDAINGLSEEQR
                                                                            VDDAINGLSEEQR
                                                                                       ……


                                                                           Revised search
              Search
                            ID   miss                                      base
             base
                                                                                    ID




Lennart Martens                  BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be            UGent, Gent, Belgium – 16 December 2011
Another example: in vivo protein cleavage

                   NH 2                                                               COOH
                                  R                                   R
                                           R        D                         R

                                                    Caspase cleavage of this protein
                                                                      (for 50%)


                   NH 2                                                               COOH
                                  R                                   R
                                           R        D                         R

           NH 2                       COOH              NH 2                                  COOH
                            R                                                     R
                                  RD                                                      R

                                                   NH2-terminal peptide isolation


                                   COOH                                               COOH
                   NH 2                                        NH 2
                                  R                                                   R


                                                                  NOT IN DB!


Lennart Martens             BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be       UGent, Gent, Belgium – 16 December 2011
Solving the issue: bifunctional enzymes

                                                                 COOH
                                          NH 2
                                                               R


                             result of in vivo             result of in vitro
                                protease                       trypsin



      Creation of a bifunctional enzyme will generate the correct peptides!

                    Title:Arg-C                                         Title:dual ArgC_Cathep
                    Cleavage:R                                          Cleavage:DX R
                    Restrict:P                                          Restrict:P
                    Cterm                                               Cterm

                  Arg-C definition                                        Arg-C (N-term),
                                                                         Cathepsin (C-term)
                                                                             definition

Lennart Martens                BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be            UGent, Gent, Belgium – 16 December 2011
DBTOOLKIT AND
                        DATABASE ON DEMAND




Lennart Martens             BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be       UGent, Gent, Belgium – 16 December 2011
Working with databases: DBToolkit
                            http:/ / genesis.UGent.be/ dbtoolk it




                            See: M artens et al., Bioinform atics 2005, 21(17): 3584-3585

Lennart Martens                   BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be              UGent, Gent, Belgium – 16 December 2011
Summary of DBToolkit functionalities

         a) Enzymatic digestion using regular or ‘dual’ enzymes
                 proteins to peptides
         b) N-terminal or C-terminal ragging
                  enhancing the information content of the database
         c) Non-lossy redundancy clearing
                 raising database information ratio
         d) Create shuffled and reversed databases
                  false-positives testing
         e) Extract sequence-based subsets
                  a priori prediction of potential success rate
         f) Map peptides back to proteins (maximal annotation approach)
                  find all matching proteins, and select primaries
            etc …

Lennart Martens             BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be       UGent, Gent, Belgium – 16 December 2011
Database on Demand – DBToolkit online
                            http:/ / w w w .ebi.ac.uk/ pride/ dod




                            See: R eisinger et al., P roteom ics 2009, 9(18): 4421-4424

Lennart Martens                 BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be            UGent, Gent, Belgium – 16 December 2011
WHY DOES PROCESSING MATTER?




Lennart Martens             BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be       UGent, Gent, Belgium – 16 December 2011
Serum degradation over time




                      From : Yi et al., Journal of P roteom e R esearch 2007, 6(5): 1768-1781

Lennart Martens                 BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be            UGent, Gent, Belgium – 16 December 2011
Plasma degradation over time




                      From : Yi et al., Journal of P roteom e R esearch 2007, 6(5): 1768-1781

Lennart Martens                 BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be            UGent, Gent, Belgium – 16 December 2011
TIME-LABILITY OF
                            SEQUENCE DATABASES




Lennart Martens               BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be         UGent, Gent, Belgium – 16 December 2011
Example 1: HUPO PPP actualisation
              Bringing the P P P from I P I 2.21 to I P I 3.13
             1555       Total
             1048       Unchanged               67%
              507       Changed                 33%
                  Of which:
                        338 Propagated 22%                  67% (of ‘Changed’)
                        169 Defunct    11%                  33% (of ‘Changed’)
                            Of which
                                   95 Defunct (RFSQ_XP) 6%                           56% (of ‘Defunct’)
  Both exist,                      72 Defunct (Ensembl) 5%                           43% (of ‘Defunct’)
  1 taxonomy now: RAT
  1 immunoglobin
                                    2 UniProt           0%                            1% (of ‘Defunct’)


                  1048 + 345 = 1386 recoverable (89.1%)

                       See: M artens and M ueller et al., P roteom ics 2006, 6(18):5059-75

Lennart Martens                BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be           UGent, Gent, Belgium – 16 December 2011
Example 2: human blood platelets
              Bringing the P latelets from I P I 2.31 to I P I 3.13
              673           Total
              578           Unchanged            86%
               95           Changed              14%
                   Of which:
                            78 Propagated 12%                 82% (of ‘Changed’)
                            17 Defunct     3%                 18% (of ‘Changed’)
                              Of which
                                      5 Defunct (RFSQ_XP) 1%                          29% (of ‘Defunct’)
                                     12 Defunct (Ensembl) 2%                          71% (of ‘Defunct’)



                   578 + 78 = 656 recoverable (97%)
                       See: M artens and M ueller et al., P roteom ics 2006, 6(18):5059-75

Lennart Martens                BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be           UGent, Gent, Belgium – 16 December 2011
Proteins sometimes age badly




                               Adapted from : http:/ / w w w .ebi.ac.uk/ ipi

Lennart Martens             BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be       UGent, Gent, Belgium – 16 December 2011
THE PICR MAPPING SERVICE




Lennart Martens             BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be       UGent, Gent, Belgium – 16 December 2011
Identifiers through (name)space and time
                      http:/ / w w w .ebi.ac.uk/ tools/ picr

                                                                                  Limit search by
                                                                                    taxonomy
                                                                                   (pessimistic)
 Submit accessions
    OR sequences
 (FASTA) with 500
  entry interactive
   limit (no batch
        limit)
                                                                                      Choose to
                                                                                      return all
                                                                                    mappings or
                                                                                   only active ones




  Select output format
                                                                                    Select one or
                                                                                  many databases
                                                                                  to map to in one
                                             Run                                       request
                                            search

                            See: Côté et al., BM C Bioinform atics 2007, 8: 401

Lennart Martens             BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be        UGent, Gent, Belgium – 16 December 2011
Mapping results




Lennart Martens             BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be       UGent, Gent, Belgium – 16 December 2011
ESTIMATING FALSE DISCOVERY RATES
           THE DECOY DATABASE APPROACH




Lennart Martens             BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be       UGent, Gent, Belgium – 16 December 2011
Decoy databases, the latest fashion
 Three main types of decoy DB’s are used:

           - Reversed databases (easy)
                      LENNARTMARTENS  SNETRAMTRANNEL

           - Shuffled databases (slightly more difficult)
                      LENNARTMARTENS  NMERLANATERTTN                      (for instance)

           - Randomized databases (as difficult as you want it to be)
                      LENNARTMARTENS  GFVLAEPHSEAITK                      (for instance)



 The concept is that each peptide identified from the decoy database is an incorrect
 identification. By counting the number of decoy hits, we can estimate the number of
 false positives in the original database.

Lennart Martens             BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be       UGent, Gent, Belgium – 16 December 2011
Estimating the FDR (i)



                        2 × nbr _ decoy _ hits
         FDR =
               nbr _ forward _ hits + nbr _ decoy _ hits


 FDR is the False Discovery Rate – it is a metric that gives you an indication of how
 many (percent) of your identifications are potentially incorrect. Note that we multiply
 the number of decoy hits by 2, because we should not only count the actual decoy
 hits, but also the ‘hidden’ false positives that are present in the forward
 identifications. The assumption here is that we expect one forward false positive hit
 per decoy false positive hit, hence the doubling term.

                               From: Elias and Gygi, Nature Methods 2007, 4(3): 207-214


Lennart Martens             BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be       UGent, Gent, Belgium – 16 December 2011
Estimating the FDR (ii)



                                   nbr _ decoy _ hits
                            FDR =
                                  nbr _ forward _ hits


 This metric was proposed by Storey and Tibbs for genomics data, and further
 investigated by Lukas Käll for proteomics. It provides a more accurate (and simpler!)
 estimate of the FDR, but can be extended to also take into account the (suspected)
 false positives in the forward set.



                                 See: Storey and Tibbs, PNAS 2003, 100(16): 9440-9445
                                         See: Käll et al,., JPR 2008, 7(1): 29-34


Lennart Martens              BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be        UGent, Gent, Belgium – 16 December 2011
Thank you!
                  Questions?
Lennart Martens             BITS MS Data Processing – Sequence Databases
lennart.m artens@ugent.be       UGent, Gent, Belgium – 16 December 2011

More Related Content

More from BITS

RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4BITS
 
RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6BITS
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2BITS
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1BITS
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3BITS
 
Productivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformaticsProductivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformaticsBITS
 
Text mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformaticsText mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformaticsBITS
 
The structure of Linux - Introduction to Linux for bioinformatics
The structure of Linux - Introduction to Linux for bioinformaticsThe structure of Linux - Introduction to Linux for bioinformatics
The structure of Linux - Introduction to Linux for bioinformaticsBITS
 
Managing your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformaticsManaging your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformaticsBITS
 
Introduction to Linux for bioinformatics
Introduction to Linux for bioinformaticsIntroduction to Linux for bioinformatics
Introduction to Linux for bioinformaticsBITS
 
BITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS
 
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS
 
BITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome levelBITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome levelBITS
 
BITS - Comparative genomics: gene family analysis
BITS - Comparative genomics: gene family analysisBITS - Comparative genomics: gene family analysis
BITS - Comparative genomics: gene family analysisBITS
 
BITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS
 
BITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry dataBITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry dataBITS
 
BITS - Search engines for mass spec data
BITS - Search engines for mass spec dataBITS - Search engines for mass spec data
BITS - Search engines for mass spec dataBITS
 
BITS - Introduction to Mass Spec data generation
BITS - Introduction to Mass Spec data generationBITS - Introduction to Mass Spec data generation
BITS - Introduction to Mass Spec data generationBITS
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS
 
Marcs (bio)perl course
Marcs (bio)perl courseMarcs (bio)perl course
Marcs (bio)perl courseBITS
 

More from BITS (20)

RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4
 
RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3
 
Productivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformaticsProductivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformatics
 
Text mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformaticsText mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformatics
 
The structure of Linux - Introduction to Linux for bioinformatics
The structure of Linux - Introduction to Linux for bioinformaticsThe structure of Linux - Introduction to Linux for bioinformatics
The structure of Linux - Introduction to Linux for bioinformatics
 
Managing your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformaticsManaging your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformatics
 
Introduction to Linux for bioinformatics
Introduction to Linux for bioinformaticsIntroduction to Linux for bioinformatics
Introduction to Linux for bioinformatics
 
BITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics data
 
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra tool
 
BITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome levelBITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome level
 
BITS - Comparative genomics: gene family analysis
BITS - Comparative genomics: gene family analysisBITS - Comparative genomics: gene family analysis
BITS - Comparative genomics: gene family analysis
 
BITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS - Introduction to comparative genomics
BITS - Introduction to comparative genomics
 
BITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry dataBITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry data
 
BITS - Search engines for mass spec data
BITS - Search engines for mass spec dataBITS - Search engines for mass spec data
BITS - Search engines for mass spec data
 
BITS - Introduction to Mass Spec data generation
BITS - Introduction to Mass Spec data generationBITS - Introduction to Mass Spec data generation
BITS - Introduction to Mass Spec data generation
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2
 
Marcs (bio)perl course
Marcs (bio)perl courseMarcs (bio)perl course
Marcs (bio)perl course
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 

BITS - Overview of sequence databases for mass spectrometry data analysis

  • 2. sequence databases lennart martens lennart.martens@ugent.be Computational Omics and Systems Biology Group Department of Medical Protein Research, VIB Department of Biochemistry, Ghent University Lennart Martens BITS MS Data Processing – Sequence Databases Ghent, Belgium lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 3. PEPTIDES AND REDUNDANCY IN SEQUENCE DATABASES Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 4. Peptide-level sequence redundancy >Protein 1 >Protein 1 (1-6) LENNARTMARTENS LENNAR >Protein 2 >Protein 1 (7-10) LENNARTMARTENT TMAR >Protein 1 (11-14) TENS = non-redundant protein DB >Protein 2 (1-6) LENNAR = ≠ >Protein 2 (7-10) TMAR >Protein 2 (11-14) non-redundant peptide DB TENT Database content: all peptide sequences in the database Database inform ation: number of unique peptide sequences database information Database inform ation ratio: database content Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 5. Information ratios for common databases 12,000,000 100% 93% ratio Content information 10,307,319 90% 10,000,000 Tryptic cleavage, 1 allowed missed cleavage, Mass limits from 600 to 4000 Da. 80% 70% 8,000,000 60% 6,000,000 50% 45% 41% 42% 40% 4,000,000 4,472,356 3,491,778 3,186,806 30% 23% 20% 2,000,000 1,584,806 2,394,844 1,877,500 1,559,685 10% 1,466,927 1,309,625 0 0% UniProtKB/SwissProt UniProtKB/TrEMBL Ensembl human IPI human NCBI nr human human human Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 6. ENRICHING SEQUENCE DATABASES Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 7. The influence of the sequence database N C In vivo processing Search ID miss base N C + Enzymatic digest and subsequent NH2-terminal peptide isolation Not in the sequence database! Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 8. An example Mitochondrial Isovaleryl-coA Dehydrogenase MATATRLLGWRVASWRLRPPLAGFVS N -term inal transit peptide (1-29) 30 47 QRAHSLLPVDDAINGLSEEQRQLRE… I sovaleryl-CoA dehydrogenase (30 – 423) …LDGIQCFGGNGYINDFPMGRFLRDA 423 KLYEIGAGTSEVRRLVIGRAFNADFH Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 9. Extending the information content AHSLLPVDDAINGLSEEQR AHSLLPVDDAINGLSEEQR HSLLPVDDAINGLSEEQR SLLPVDDAINGLSEEQR LLPVDDAINGLSEEQR LPVDDAINGLSEEQR PVDDAINGLSEEQR VDDAINGLSEEQR …… Revised search Search ID miss base base ID Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 10. Another example: in vivo protein cleavage NH 2 COOH R R R D R Caspase cleavage of this protein (for 50%) NH 2 COOH R R R D R NH 2 COOH NH 2 COOH R R RD R NH2-terminal peptide isolation COOH COOH NH 2 NH 2 R R NOT IN DB! Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 11. Solving the issue: bifunctional enzymes COOH NH 2 R result of in vivo result of in vitro protease trypsin Creation of a bifunctional enzyme will generate the correct peptides! Title:Arg-C Title:dual ArgC_Cathep Cleavage:R Cleavage:DX R Restrict:P Restrict:P Cterm Cterm Arg-C definition Arg-C (N-term), Cathepsin (C-term) definition Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 12. DBTOOLKIT AND DATABASE ON DEMAND Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 13. Working with databases: DBToolkit http:/ / genesis.UGent.be/ dbtoolk it See: M artens et al., Bioinform atics 2005, 21(17): 3584-3585 Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 14. Summary of DBToolkit functionalities a) Enzymatic digestion using regular or ‘dual’ enzymes  proteins to peptides b) N-terminal or C-terminal ragging  enhancing the information content of the database c) Non-lossy redundancy clearing  raising database information ratio d) Create shuffled and reversed databases  false-positives testing e) Extract sequence-based subsets  a priori prediction of potential success rate f) Map peptides back to proteins (maximal annotation approach)  find all matching proteins, and select primaries etc … Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 15. Database on Demand – DBToolkit online http:/ / w w w .ebi.ac.uk/ pride/ dod See: R eisinger et al., P roteom ics 2009, 9(18): 4421-4424 Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 16. WHY DOES PROCESSING MATTER? Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 17. Serum degradation over time From : Yi et al., Journal of P roteom e R esearch 2007, 6(5): 1768-1781 Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 18. Plasma degradation over time From : Yi et al., Journal of P roteom e R esearch 2007, 6(5): 1768-1781 Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 19. TIME-LABILITY OF SEQUENCE DATABASES Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 20. Example 1: HUPO PPP actualisation Bringing the P P P from I P I 2.21 to I P I 3.13 1555 Total 1048 Unchanged 67% 507 Changed 33% Of which: 338 Propagated 22% 67% (of ‘Changed’) 169 Defunct 11% 33% (of ‘Changed’) Of which 95 Defunct (RFSQ_XP) 6% 56% (of ‘Defunct’) Both exist, 72 Defunct (Ensembl) 5% 43% (of ‘Defunct’) 1 taxonomy now: RAT 1 immunoglobin 2 UniProt 0% 1% (of ‘Defunct’) 1048 + 345 = 1386 recoverable (89.1%) See: M artens and M ueller et al., P roteom ics 2006, 6(18):5059-75 Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 21. Example 2: human blood platelets Bringing the P latelets from I P I 2.31 to I P I 3.13 673 Total 578 Unchanged 86% 95 Changed 14% Of which: 78 Propagated 12% 82% (of ‘Changed’) 17 Defunct 3% 18% (of ‘Changed’) Of which 5 Defunct (RFSQ_XP) 1% 29% (of ‘Defunct’) 12 Defunct (Ensembl) 2% 71% (of ‘Defunct’) 578 + 78 = 656 recoverable (97%) See: M artens and M ueller et al., P roteom ics 2006, 6(18):5059-75 Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 22. Proteins sometimes age badly Adapted from : http:/ / w w w .ebi.ac.uk/ ipi Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 23. THE PICR MAPPING SERVICE Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 24. Identifiers through (name)space and time http:/ / w w w .ebi.ac.uk/ tools/ picr Limit search by taxonomy (pessimistic) Submit accessions OR sequences (FASTA) with 500 entry interactive limit (no batch limit) Choose to return all mappings or only active ones Select output format Select one or many databases to map to in one Run request search See: Côté et al., BM C Bioinform atics 2007, 8: 401 Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 25. Mapping results Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 26. ESTIMATING FALSE DISCOVERY RATES THE DECOY DATABASE APPROACH Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 27. Decoy databases, the latest fashion Three main types of decoy DB’s are used: - Reversed databases (easy) LENNARTMARTENS  SNETRAMTRANNEL - Shuffled databases (slightly more difficult) LENNARTMARTENS  NMERLANATERTTN (for instance) - Randomized databases (as difficult as you want it to be) LENNARTMARTENS  GFVLAEPHSEAITK (for instance) The concept is that each peptide identified from the decoy database is an incorrect identification. By counting the number of decoy hits, we can estimate the number of false positives in the original database. Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 28. Estimating the FDR (i) 2 × nbr _ decoy _ hits FDR = nbr _ forward _ hits + nbr _ decoy _ hits FDR is the False Discovery Rate – it is a metric that gives you an indication of how many (percent) of your identifications are potentially incorrect. Note that we multiply the number of decoy hits by 2, because we should not only count the actual decoy hits, but also the ‘hidden’ false positives that are present in the forward identifications. The assumption here is that we expect one forward false positive hit per decoy false positive hit, hence the doubling term. From: Elias and Gygi, Nature Methods 2007, 4(3): 207-214 Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 29. Estimating the FDR (ii) nbr _ decoy _ hits FDR = nbr _ forward _ hits This metric was proposed by Storey and Tibbs for genomics data, and further investigated by Lukas Käll for proteomics. It provides a more accurate (and simpler!) estimate of the FDR, but can be extended to also take into account the (suspected) false positives in the forward set. See: Storey and Tibbs, PNAS 2003, 100(16): 9440-9445 See: Käll et al,., JPR 2008, 7(1): 29-34 Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011
  • 30. Thank you! Questions? Lennart Martens BITS MS Data Processing – Sequence Databases lennart.m artens@ugent.be UGent, Gent, Belgium – 16 December 2011