Proteomics repositories
Dr. Juan Antonio Vizcaíno
EMBL-EBI
Hinxton, Cambridge, UK
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
• Why sharing MS proteomics data?
• Types of information stored in MS proteomics
repositories.
• Main existing repositories and their main
characteristics
• No data reprocessing
• Data reprocessing
• Other resources
Overview
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
Corresponding public repositories
Genomics
Transcript-
omics
Proteomics
DNA sequence databases
(GenBank, EMBL, DDJB)
ArrayExpress (EBI), GEO (NCBI)
MS proteomics resources (ProteomeXchange)
Metabolomics MetaboLights (MetabolomeXchange)
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
Data sharing in Proteomics
• Proteomics data can be very complex and its interpretation is
often troublesome and/or controversial.
• In other ‘omics’ fields, data sharing ‘culture’ is well established.
Generally, it is considered to be a good scientific practise.
• In proteomics, the ‘culture’ is definitely evolving in that direction.
A big shift is happening in the last few years.
• Scientific journals and funding agencies are two of the main
drivers.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
Reproducible Science
http://www.nature.com/nature/focus/reproducibility/
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
What is a proteomics publication in 2017?
• Proteomics studies generate potentially large amounts of
data and results.
• Ideally, a proteomics publication needs to:
• Summarize the results of the study
• Provide supporting information for reliability of any
results reported
• Information in a publication:
• Manuscript
• Supplementary material
• Associated data submitted to a public repository
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
Journal Submission Recommendations
• Journal guidelines recommend and/or mandate
submission to proteomics repositories:
 Journals from the Nature group
 Molecular and Cellular Proteomics
 PLOS journals
• Funding agencies are enforcing public deposition of data
to maximize the value of the funds provided.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
HPP guidelines version 2.1
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
• Why sharing MS proteomics data?
• Types of information stored in MS
proteomics repositories
• Main existing repositories and their main
characteristics
• No data reprocessing
• Data reprocessing
• Other resources
Overview
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
Main types of information stored
• 1) Original experimental data recorded by the mass
spectrometer (primary data) -. Raw data and peak lists.
• 2) Identification results inferred from the original primary
data
• 3) Quantification information
• 4) Experimental and technical metadata
• 5) Any other type of information (e.g. scripts)
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
• Why sharing MS proteomics data?
• Types of information stored in MS proteomics
repositories.
• Main existing repositories and their main
characteristics
• No data reprocessing
• Data reprocessing
• Other resources
Overview
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
Proteomics repositories
• Many different workflows need to be supported. They provide
complementary ‘views’.
• No data reprocessing. Data is stored as ‘published’ or
originally analysed:
• PRIDE Archive (focused on MS/MS data, all supported)
• MassIVE (focused on MS/MS data)
• jPOST (focused on MS/MS data)
• PASSEL (only SRM data)
• Data reprocessing (MS/MS data):
• PeptideAtlas and GPMDB
• proteomicsDB and HPM
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
• Why sharing MS proteomics data?
• Types of information stored in MS proteomics
repositories.
• Main existing repositories and their main
characteristics
• No data reprocessing
• Data reprocessing
• Other resources
Overview
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
Resources that don’t reprocess data
1) Resources that try to represent the authors’ analysis
view on the data.
• Various workflows are allowed and they can provide
complementary results.
• Data are not ‘updated’ in time. However, meta-analysis
on top is possible.
• Accumulation of FDRs when datasets are combined.
• Main representatives: PRIDE Archive and MassIVE
(MS/MS data) and PeptideAtlas/PASSEL (SRM data).
• Data standards are essential.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
Proteomics repositories
• Many different workflows need to be supported. They provide
complementary ‘views’.
• No data reprocessing. Data is stored as ‘published’ or
originally analysed:
• PRIDE Archive (focused on MS/MS data, all supported)
• MassIVE (focused on MS/MS data)
• jPOST (focused on MS/MS data)
• PASSEL (only SRM data)
• Data reprocessing (MS/MS data):
• PeptideAtlas and GPMDB
• proteomicsDB and HPM.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
• PRIDE stores mass spectrometry (MS)-based
proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
• Any type of data can be stored.
PRIDE (PRoteomics IDEntifications) Archive
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
MassIVE (UCSD)
http://proteomics.ucsd.edu/service/massive/
• Data repository for MS proteomics data
• Tools available for users to analyse their own data
• Joined ProteomeXchange on June 2014.
• Starting to reanalyse datasets: MassIVE KB resource.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
MassIVE: Do it yourself
1. MSGF+ - Database search engine
2. MSPLIT – Spectral Library Search Engine
3. ENOSI – ProteoGenomic Search Engine
4. MODa - Multi-blind modification database search engine
5. Spectral Networks – spectral alignment-based
analysis and propagation of identifications
6. Multi-pass - MSPLIT, MSGFDB, MODa cascade Search
Workflow
7. MSGFDB - Database search engine
8. MSPLIT-DIA – Spectral Library Search for SWATH
9. Upload your own! (mzIdentML, mzTab, TSV)
http://massive.ucsd.eduhttp://proteomics.ucsd.edu
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
jPOST Repository site
(www.jpost.org)
• Joined ProteomeXchange
on July 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
• Suitable for SRM assays
• Use the PSI standard TraML
plus the output of the most
popular vendor pipelines
• Started in 2012
• Part of the ProteomeXchange
consortium
http://www.peptideatlas.org/passel/
Farrah et al., Proteomics, 2012
PASSEL: repository for SRM data
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
• Why sharing MS proteomics data?
• Types of information stored in MS proteomics
repositories.
• Main existing repositories and their main
characteristics
• No data reprocessing
• Data reprocessing
• Other resources
Overview
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
Proteomics repositories
• Many different workflows need to be supported. They provide
complementary ‘views’.
• No data reprocessing. Data is stored as ‘published’ or
originally analysed:
• PRIDE Archive (focused on MS/MS data, all supported)
• MassIVE (focused on MS/MS data)
• jPOST (focused on MS/MS data)
• PASSEL (only SRM data)
• Data reprocessing (MS/MS data):
• PeptideAtlas and GPMDB
• proteomicsDB and HPM.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
Reprocessing repositories
• These resources collect MS raw data and reprocess it using
one given analysis pipeline, and an up to date protein
sequence database.
• Advantage: They provide a ‘standardized’ and updated view
on the experimental data available.
• Only one common analysis method is used and there can be
information loss.
• Different from the author’s view on the data.
• Main resources: GPMDB and PeptideAtlas (ISB, Seattle).
• Novel resources: proteomicsDB and the Human Proteome Map.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
http://www.peptideatlas.org
- Developed at the Institute for Systems
Biology (ISB, Seattle, USA)
- Peptide identifications from MS/MS
approaches
- Data are reprocessed using the popular
Trans Proteomic Pipeline (TPP)
- Uses PeptideProphet to derive a
probability for the correct identification for
all contained peptides
PeptideAtlas
Deutsch et al., Proteomics, 2005
Desiere et al., NAR, 2006.
Deutsch et al., EMBO Rep, 2008
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
• All peptides IDs are mapped to
Ensembl using ProteinProphet
(to handle protein inference)
• Provides proteotypic peptide
predictions
• Limited metadata available
• Part of the HPP project
Deutsch et al., Proteomics, 2005
Desiere et al., NAR, 2006.
Deutsch et al., EMBO Rep, 2008
PeptideAtlas
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
Builds are updated in a regular basis (usually once a
year)
Examples of builds:
- Human (HPP context)
- Human plasma
- Human urine
- Human phospho-proteome
- Drosophila
- Mouse
- Mouse plasma
- Cow
- Yeast
- Pig
…
PeptideAtlas builds
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
• Originally developed by R.
Beavis & R. Craig
• End point of the GPM
proteomics pipeline, to aid in
the process of validating
peptide MS/MS spectra and
protein coverage patterns.
• Platform for “Proteomics
data analysis, reuse and
validation for biomedical
research”.
http://gpmdb.thegpm.org/ Craig et al., J Proteome Res, 2004
GPMDB (Global Proteome Machine DB)
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
• Data are reprocessed using
the popular X!Tandem or
X!Hunter spectral searching
algorithm
• Also provides proteotypic
peptides
GPMDB
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
• Nice visualization features
• Provides very limited
annotation with GO, BTO
• Some support to targeted
approaches is available
• Part of the HPP consortium
GPMDB
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
Proteomics repositories
• Many different workflows need to be supported. They provide
complementary ‘views’.
• No data reprocessing. Data is stored as ‘published’ or
originally analysed:
• PRIDE Archive (focused on MS/MS data, all supported)
• MassIVE (focused on MS/MS data)
• jPOST (focused on MS/MS data)
• PASSEL (only SRM data)
• Data reprocessing (MS/MS data):
• PeptideAtlas and GPMDB
• proteomicsDB and HPM
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014 Kim et al., Nature, 2014
•Two independent groups claimed to have produced the
first complete draft of the human proteome by MS.
• Some of their findings are controversial and need further
validation… but generated a lot of discussion and put
proteomics in the spotlight.
•Two proteomics resources have been developed:
proteomicsDB and the Human Proteome Map (HPM).Nature cover 29 May 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
ProteomicsDB https://www.proteomicsdb.org/
• Data analysis using Mascot and MaxQuant
• The way the Protein FDR is calculated is controversial
•Quantification information using label free techniques
(Batch effects)
•New datasets are added in a regular basis
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
proteomicsDB (2)
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
Human Proteome Map (HPM)
• Developed by the Pandey group.
• Data reanalysis using Mascot.
• Protein FDR is not mentioned at all in the
corresponding Nature paper.
• Static resource: it will not be updated
any longer.
http://www.humanproteomemap.org/
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
• Why sharing MS proteomics data?
• Types of information stored in MS proteomics
repositories.
• Main existing repositories and their main
characteristics
• No data reprocessing
• Data reprocessing
• Other resources
Overview
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
Chorus
https://chorusproject.org/pages/ind
ex.html
• Developed by M. MacCoss’ group in
Seattle (UW).
• Built on top of Amazon Cloud
technologies
• Provides data analysis capabilities for
the users
• Free for public datasets.
• The objective is to connect the data to
analysis tools in a cloud environment
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
MaxQB
MaxQB
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
CPTAC data portal
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
• Importance of sharing MS proteomics data.
• The main existing proteomics repositories are
complementary in focus and functionality.
• Main characteristics of:
• PeptideAtlas and GPMDB (Reprocess data)
• PASSEL, MassIVE, jPOST and PRIDE Archive
(they do not reprocess data, not their main
mission).
• New resources: proteomicsDB, HPM.
Conclusions
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017
• Perez-Riverol et al., Proteomics, 2015. PMID: 25158685
Recommended reading
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2017
Hinxton, 20 July 2017

Proteomics repositories

  • 1.
    Proteomics repositories Dr. JuanAntonio Vizcaíno EMBL-EBI Hinxton, Cambridge, UK
  • 2.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 • Why sharing MS proteomics data? • Types of information stored in MS proteomics repositories. • Main existing repositories and their main characteristics • No data reprocessing • Data reprocessing • Other resources Overview
  • 3.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 Corresponding public repositories Genomics Transcript- omics Proteomics DNA sequence databases (GenBank, EMBL, DDJB) ArrayExpress (EBI), GEO (NCBI) MS proteomics resources (ProteomeXchange) Metabolomics MetaboLights (MetabolomeXchange)
  • 4.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 Data sharing in Proteomics • Proteomics data can be very complex and its interpretation is often troublesome and/or controversial. • In other ‘omics’ fields, data sharing ‘culture’ is well established. Generally, it is considered to be a good scientific practise. • In proteomics, the ‘culture’ is definitely evolving in that direction. A big shift is happening in the last few years. • Scientific journals and funding agencies are two of the main drivers.
  • 5.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 Reproducible Science http://www.nature.com/nature/focus/reproducibility/
  • 6.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 What is a proteomics publication in 2017? • Proteomics studies generate potentially large amounts of data and results. • Ideally, a proteomics publication needs to: • Summarize the results of the study • Provide supporting information for reliability of any results reported • Information in a publication: • Manuscript • Supplementary material • Associated data submitted to a public repository
  • 7.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 Journal Submission Recommendations • Journal guidelines recommend and/or mandate submission to proteomics repositories:  Journals from the Nature group  Molecular and Cellular Proteomics  PLOS journals • Funding agencies are enforcing public deposition of data to maximize the value of the funds provided.
  • 8.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 HPP guidelines version 2.1
  • 9.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 • Why sharing MS proteomics data? • Types of information stored in MS proteomics repositories • Main existing repositories and their main characteristics • No data reprocessing • Data reprocessing • Other resources Overview
  • 10.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 Main types of information stored • 1) Original experimental data recorded by the mass spectrometer (primary data) -. Raw data and peak lists. • 2) Identification results inferred from the original primary data • 3) Quantification information • 4) Experimental and technical metadata • 5) Any other type of information (e.g. scripts)
  • 11.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 Current PSI Standard File Formats for MS • mzMLMS data • mzIdentMLIdentification • mzQuantMLQuantitation • mzTabFinal Results • TraMLSRM
  • 12.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 • Why sharing MS proteomics data? • Types of information stored in MS proteomics repositories. • Main existing repositories and their main characteristics • No data reprocessing • Data reprocessing • Other resources Overview
  • 13.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 Proteomics repositories • Many different workflows need to be supported. They provide complementary ‘views’. • No data reprocessing. Data is stored as ‘published’ or originally analysed: • PRIDE Archive (focused on MS/MS data, all supported) • MassIVE (focused on MS/MS data) • jPOST (focused on MS/MS data) • PASSEL (only SRM data) • Data reprocessing (MS/MS data): • PeptideAtlas and GPMDB • proteomicsDB and HPM
  • 14.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 • Why sharing MS proteomics data? • Types of information stored in MS proteomics repositories. • Main existing repositories and their main characteristics • No data reprocessing • Data reprocessing • Other resources Overview
  • 15.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 Resources that don’t reprocess data 1) Resources that try to represent the authors’ analysis view on the data. • Various workflows are allowed and they can provide complementary results. • Data are not ‘updated’ in time. However, meta-analysis on top is possible. • Accumulation of FDRs when datasets are combined. • Main representatives: PRIDE Archive and MassIVE (MS/MS data) and PeptideAtlas/PASSEL (SRM data). • Data standards are essential.
  • 16.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 Proteomics repositories • Many different workflows need to be supported. They provide complementary ‘views’. • No data reprocessing. Data is stored as ‘published’ or originally analysed: • PRIDE Archive (focused on MS/MS data, all supported) • MassIVE (focused on MS/MS data) • jPOST (focused on MS/MS data) • PASSEL (only SRM data) • Data reprocessing (MS/MS data): • PeptideAtlas and GPMDB • proteomicsDB and HPM.
  • 17.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 • PRIDE stores mass spectrometry (MS)-based proteomics data: • Peptide and protein expression data (identification and quantification) • Post-translational modifications • Mass spectra (raw data and peak lists) • Technical and biological metadata • Any other related information • Full support for tandem MS approaches • Any type of data can be stored. PRIDE (PRoteomics IDEntifications) Archive http://www.ebi.ac.uk/pride/archive Martens et al., Proteomics, 2005 Vizcaíno et al., NAR, 2016
  • 18.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 MassIVE (UCSD) http://proteomics.ucsd.edu/service/massive/ • Data repository for MS proteomics data • Tools available for users to analyse their own data • Joined ProteomeXchange on June 2014. • Starting to reanalyse datasets: MassIVE KB resource.
  • 19.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 MassIVE: Do it yourself 1. MSGF+ - Database search engine 2. MSPLIT – Spectral Library Search Engine 3. ENOSI – ProteoGenomic Search Engine 4. MODa - Multi-blind modification database search engine 5. Spectral Networks – spectral alignment-based analysis and propagation of identifications 6. Multi-pass - MSPLIT, MSGFDB, MODa cascade Search Workflow 7. MSGFDB - Database search engine 8. MSPLIT-DIA – Spectral Library Search for SWATH 9. Upload your own! (mzIdentML, mzTab, TSV) http://massive.ucsd.eduhttp://proteomics.ucsd.edu
  • 20.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 jPOST Repository site (www.jpost.org) • Joined ProteomeXchange on July 2016
  • 21.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 • Suitable for SRM assays • Use the PSI standard TraML plus the output of the most popular vendor pipelines • Started in 2012 • Part of the ProteomeXchange consortium http://www.peptideatlas.org/passel/ Farrah et al., Proteomics, 2012 PASSEL: repository for SRM data
  • 22.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 • Why sharing MS proteomics data? • Types of information stored in MS proteomics repositories. • Main existing repositories and their main characteristics • No data reprocessing • Data reprocessing • Other resources Overview
  • 23.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 Proteomics repositories • Many different workflows need to be supported. They provide complementary ‘views’. • No data reprocessing. Data is stored as ‘published’ or originally analysed: • PRIDE Archive (focused on MS/MS data, all supported) • MassIVE (focused on MS/MS data) • jPOST (focused on MS/MS data) • PASSEL (only SRM data) • Data reprocessing (MS/MS data): • PeptideAtlas and GPMDB • proteomicsDB and HPM.
  • 24.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 Reprocessing repositories • These resources collect MS raw data and reprocess it using one given analysis pipeline, and an up to date protein sequence database. • Advantage: They provide a ‘standardized’ and updated view on the experimental data available. • Only one common analysis method is used and there can be information loss. • Different from the author’s view on the data. • Main resources: GPMDB and PeptideAtlas (ISB, Seattle). • Novel resources: proteomicsDB and the Human Proteome Map.
  • 25.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 http://www.peptideatlas.org - Developed at the Institute for Systems Biology (ISB, Seattle, USA) - Peptide identifications from MS/MS approaches - Data are reprocessed using the popular Trans Proteomic Pipeline (TPP) - Uses PeptideProphet to derive a probability for the correct identification for all contained peptides PeptideAtlas Deutsch et al., Proteomics, 2005 Desiere et al., NAR, 2006. Deutsch et al., EMBO Rep, 2008
  • 26.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 • All peptides IDs are mapped to Ensembl using ProteinProphet (to handle protein inference) • Provides proteotypic peptide predictions • Limited metadata available • Part of the HPP project Deutsch et al., Proteomics, 2005 Desiere et al., NAR, 2006. Deutsch et al., EMBO Rep, 2008 PeptideAtlas
  • 27.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 Builds are updated in a regular basis (usually once a year) Examples of builds: - Human (HPP context) - Human plasma - Human urine - Human phospho-proteome - Drosophila - Mouse - Mouse plasma - Cow - Yeast - Pig … PeptideAtlas builds
  • 28.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 • Originally developed by R. Beavis & R. Craig • End point of the GPM proteomics pipeline, to aid in the process of validating peptide MS/MS spectra and protein coverage patterns. • Platform for “Proteomics data analysis, reuse and validation for biomedical research”. http://gpmdb.thegpm.org/ Craig et al., J Proteome Res, 2004 GPMDB (Global Proteome Machine DB)
  • 29.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 • Data are reprocessed using the popular X!Tandem or X!Hunter spectral searching algorithm • Also provides proteotypic peptides GPMDB
  • 30.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 • Nice visualization features • Provides very limited annotation with GO, BTO • Some support to targeted approaches is available • Part of the HPP consortium GPMDB
  • 31.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 Proteomics repositories • Many different workflows need to be supported. They provide complementary ‘views’. • No data reprocessing. Data is stored as ‘published’ or originally analysed: • PRIDE Archive (focused on MS/MS data, all supported) • MassIVE (focused on MS/MS data) • jPOST (focused on MS/MS data) • PASSEL (only SRM data) • Data reprocessing (MS/MS data): • PeptideAtlas and GPMDB • proteomicsDB and HPM
  • 32.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 Draft Human proteome papers published in 2014 Wilhelm et al., Nature, 2014 Kim et al., Nature, 2014 •Two independent groups claimed to have produced the first complete draft of the human proteome by MS. • Some of their findings are controversial and need further validation… but generated a lot of discussion and put proteomics in the spotlight. •Two proteomics resources have been developed: proteomicsDB and the Human Proteome Map (HPM).Nature cover 29 May 2014
  • 33.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 ProteomicsDB https://www.proteomicsdb.org/ • Data analysis using Mascot and MaxQuant • The way the Protein FDR is calculated is controversial •Quantification information using label free techniques (Batch effects) •New datasets are added in a regular basis
  • 34.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 proteomicsDB (2)
  • 35.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 Human Proteome Map (HPM) • Developed by the Pandey group. • Data reanalysis using Mascot. • Protein FDR is not mentioned at all in the corresponding Nature paper. • Static resource: it will not be updated any longer. http://www.humanproteomemap.org/
  • 36.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 • Why sharing MS proteomics data? • Types of information stored in MS proteomics repositories. • Main existing repositories and their main characteristics • No data reprocessing • Data reprocessing • Other resources Overview
  • 37.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 Chorus https://chorusproject.org/pages/ind ex.html • Developed by M. MacCoss’ group in Seattle (UW). • Built on top of Amazon Cloud technologies • Provides data analysis capabilities for the users • Free for public datasets. • The objective is to connect the data to analysis tools in a cloud environment
  • 38.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 MaxQB MaxQB
  • 39.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 CPTAC data portal
  • 40.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 • Importance of sharing MS proteomics data. • The main existing proteomics repositories are complementary in focus and functionality. • Main characteristics of: • PeptideAtlas and GPMDB (Reprocess data) • PASSEL, MassIVE, jPOST and PRIDE Archive (they do not reprocess data, not their main mission). • New resources: proteomicsDB, HPM. Conclusions
  • 41.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017 • Perez-Riverol et al., Proteomics, 2015. PMID: 25158685 Recommended reading
  • 42.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2017 Hinxton, 20 July 2017