1. Proteomics repositories
Dr. Juan Antonio Vizcaíno
PRIDE Group Coordinator
Proteomics Services Team
EMBL-EBI
Hinxton, Cambridge, UK
2. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
• Why sharing MS proteomics data?
• Types of information stored in MS proteomics
repositories.
• Main existing repositories and their main
characteristics
• No data reprocessing
• Data reprocessing
• Recently developed resources
• Other resources
Overview
3. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Genomics
Transcriptomics
Proteomics
From the genome to the proteome
4. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Corresponding public repositories
Genomics
Transcript-
omics
Proteomics
DNA sequence databases
(GenBank, EMBL, DDJB)
ArrayExpress (EBI), GEO (NCBI)
MS proteomics resources (ProteomeXchange)
5. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Data sharing in Proteomics
• Proteomics data can be very complex and its interpretation is
often troublesome and/or controversial.
• In other ‘omics’ fields, data sharing ‘culture’ is well established.
Generally, it is considered to be a good scientific practise.
• In proteomics, the ‘culture’ is definitely evolving in that direction.
A big shift is happening in the last few years.
• Scientific journals and funding agencies are two of the main
drivers.
6. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
• 1) Data producers are not always the best data analysts
Sharing of data allows analysts access to real data, and in turn allows
better analysis tools to be developed
• 2) Meta-analysis of data can recycle previous findings for new
tasks
Putting findings in the context of other findings increases their scope
• 3) Sharing data allows independent review of the findings
When actual replication of an experiment is often impossible, a re-
analysis or spot checks on the obtained data become vitally important
• 4) Direct benefit for the field
Development of fragmentation models, spectral libraries, SRM
assays, ...
Data sharing. Why?
7. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
What is a proteomics publication in 2015?
• Proteomics studies generate potentially large amounts of
data and results.
• Ideally, a proteomics publication needs to:
• Summarize the results of the study
• Provide supporting information for reliability of any
results reported
• Information in a publication:
• Manuscript
• Supplementary material
• Associated data submitted to a public repository
8. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
• Why sharing MS proteomics data?
• Types of information stored in MS
proteomics repositories
• Main existing repositories and their main
characteristics
• No data reprocessing
• Data reprocessing
• Recently developed resources
• Other resources
Overview
9. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Main types of information stored
• 1) Original experimental data recorded by the mass
spectrometer (primary data) -. Raw data and peak lists.
• 2) Identification results inferred from the original primary
data
• 3) Quantification information
• 4) Experimental and technical metadata
• 5) Any other type of information (e.g. scripts)
10. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Data types/ PSI standard formats
• mzTabFinal Results
• TraMLSRM
• mzQuantMLQuantitation
• mzIdentMLIdentification
• mzMLMS data
11. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
• Why sharing MS proteomics data?
• Types of information stored in MS proteomics
repositories.
• Main existing repositories and their main
characteristics
• No data reprocessing
• Data reprocessing
• Recently developed resources
• Other resources
Overview
12. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
• Main public MS-based proteomics repositories:
- PROteomics IDEntifications database (PRIDE Archive, EBI)
- Global Proteome Machine (GPMDB)
- PeptideAtlas (ISB, Seattle)
• Many others, more specialized:
Among others: Human Proteinpedia, Genome Annotation Proteomics Pipeline
(GAPP),…
• Recently developed ones: ProteomicsDB, CHORUS, MassIVE, iProx.
• Very diverse: different aims, functionalities,… but also complementary.
• Main focus is MS/MS data.
Proteomics repositories
13. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Proteomics repositories (2)
• Many different workflows need to be supported. They provide
complementary ‘views’.
• No data reprocessing. Data is stored as ‘published’ or
originally analysed:
• PRIDE Archive (MS/MS data)
• MassIVE (MS/MS data)
• PASSEL (SRM data)
• Data reprocessing (MS/MS data):
• PeptideAtlas
• GPMDB
14. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
• Why sharing MS proteomics data?
• Types of information stored in MS proteomics
repositories.
• Main existing repositories and their main
characteristics
• No data reprocessing
• Data reprocessing
• Recently developed resources
• Other resources
Overview
15. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Resources that don’t reprocess data
1) Resources that try to represent the authors’ analysis
view on the data.
• Various workflows are allowed and they can provide
complementary results.
• Data are not ‘updated’ in time. However, meta-analysis
on top is possible.
• Accumulation of FDRs when datasets are combined.
• Main representatives: PRIDE Archive and MassIVE
(MS/MS data) and PeptideAtlas/PASSEL (SRM data).
• Data standards are essential.
16. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
PRIDE (PRoteomics IDEntifications) Archive
http://www.ebi.ac.uk/pride
• PRIDE stores mass spectrometry based
proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak
lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2013
17. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
MassIVE (UCSD)
• Mass spectrometry Interactive Virtual Environment
• Project led by Nuno Bandeira (Center for Computational Mass
Spectrometry, UCSD)
• Dataset storage and data submission
• MassIVE 1.0 – Tranche-like functionality
• Imported all data from Tranche
• Under development (they want to explore interaction among
users). Not published yet.
http://proteomics.ucsd.edu/ProteoSAFe/datasets.jsp
18. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
MassIVE (UCSD)
http://proteomics.ucsd.edu/service/massive/
• Data repository for MS proteomics data
• Tools available for users to analyse their own data
• Joined ProteomeXchange on June 2014.
19. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
• Suitable for SRM assays
• Use the PSI standard TraML
plus the output of the most
popular vendor pipelines
• Just started in 2012
• Part of the PX consortium
http://www.peptideatlas.org/passel/
Farrah et al., Proteomics, 2012
PASSEL: repository for SRM data
20. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
• Why sharing MS proteomics data?
• Types of information stored in MS proteomics
repositories.
• Main existing repositories and their main
characteristics
• No data reprocessing
• Data reprocessing
• Recently developed resources
• Other resources
Overview
21. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Proteomics repositories (2)
17/12/2015 21
• Many different workflows need to be supported. They provide
complementary ‘views’.
• No data reprocessing. Data is stored as ‘published’ or
originally analysed:
• PRIDE (MS/MS data)
• MassIVE (MS/MS data)
• PASSEL (SRM data)
• Data reprocessing (MS/MS data):
• PeptideAtlas
• GPMDB
22. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Reprocessing repositories
• These resources collect MS raw data and reprocess it using
one given analysis pipeline, and an up to date protein
sequence database.
• Advantage: They provide a ‘standardized’ and updated view
on the experimental data available.
• Only one common analysis method is used and there can be
information loss.
• Different from the author’s view on the data.
• Main resources: GPMDB and PeptideAtlas (ISB, Seattle).
23. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
http://www.peptideatlas.org
- Developed at the Institute for Systems
Biology (ISB, Seattle, USA)
- Peptide identifications from MS/MS
approaches
- Data are reprocessed using the popular
Trans Proteomic Pipeline (TPP)
- Uses PeptideProphet to derive a
probability for the correct identification for
all contained peptides
PeptideAtlas
24. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
• All peptides IDs are mapped to
Ensembl using ProteinProphet
(to handle protein inference)
• Provides proteotypic peptide
predictions
• Limited metadata available
• Part of the HPP project
Deutsch et al., Proteomics, 2005
Desiere et al., NAR, 2006.
Deutsch et al., EMBO Rep, 2008
PeptideAtlas
25. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Builds are updated in a regular basis (usually once a
year)
Examples of builds:
- Human (HPP context)
- Human plasma
- Human urine
- Drosophila
- Mouse
- Mouse plasma
- Cow
- Yeast
…
PeptideAtlas builds
26. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
• Originally developed by R.
Beavis & R. Craig
• End point of the GPM
proteomics pipeline, to aid in
the process of validating
peptide MS/MS spectra and
protein coverage patterns.
http://gpmdb.thegpm.org/ Craig et al., J Proteome Res, 2004
GPMDB (Global Proteome Machine DB)
27. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
• Data are reprocessed using
the popular X!Tandem or
X!Hunter spectral searching
algorithm
• Also provides proteotypic
peptides
GPMDB
28. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
• Nice visualization features
• Provides very limited
annotation with GO, BTO
• Some support to targeted
approaches is available
• Part of the HPP consortium
GPMDB
29. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
http://thehpp.org/
The Human Proteome Project (HPP)
30. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Reprocesses data Reprocesses data No reprocessing
Editorial control Editorial control No editorial control
Limited annotation Limited annotation Detailed annotation
Main MS proteomics resources
31. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Reprocesses data Reprocesses data No reprocessing
Editorial control Editorial control No editorial control
Limited annotation Limited annotation Detailed annotation
Main MS proteomics resources
32. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
• Why sharing MS proteomics data?
• Types of information stored in MS proteomics
repositories.
• Main existing repositories and their main
characteristics
• No data reprocessing
• Data reprocessing
• Recently developed resources
• Other resources
Overview
33. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014 Kim et al., Nature, 2014
•Two independent groups claimed to have produced the
first complete draft of the human proteome by MS.
• Some of their findings are controversial and need further
validation… but generated a lot of discussion and put
proteomics in the spotlight.
•Two proteomics resources have been developed:
proteomicsDB and the Human Proteome Map (HPM).Nature cover 29 May 2014
34. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
ProteomicsDB https://www.proteomicsdb.org/
• Data analysis using Mascot and MaxQuant
• The way the Protein FDR is calculated is controversial
•Quantification information using label free techniques
•New datasets are added in a regular basis
36. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Human Proteome Map (HPM)
• Developed by the Pandey group.
•Data reanalysis using Mascot.
• Protein FDR is not mentioned at all
in the corresponding Nature paper.
http://www.humanproteomemap.org/
37. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Chorus
https://chorusproject.org/pages/ind
ex.html
• Developed by M. MacCoss’ group
• Built on top of Amazon cloud
technologies
• Provides data analysis capabilities for
the users
• Free for public datasets. A fee needs
to be paid for storing private
information.
38. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
• Why sharing MS proteomics data?
• Types of information stored in MS proteomics
repositories.
• Main existing repositories and their main
characteristics
• No data reprocessing
• Data reprocessing
• Recently developed resources
• Other resources
Overview
39. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
MaxQB
Human Proteinpedia
Other repositories
40. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
COPaKB
Cardiac Organellar Protein Atlas Knowledgebase
International collaboration (EMBL-EBI involved)
Windows Client and iPad App
Submit data for analysis in dta and mzML formats
Data submitted to a ProLuCID pipeline
No MS data download
42. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Pep2pro (Arabidopsis)
http://fgcz-pep2pro.uzh.ch/
Centered on Arabidopsis data
Download spectra by spectra
Quantitative information
Linked to gelmap.de (2DE)
43. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Example of a repository of supporting data
annotation: Steve Gygi’s lab Supporting data from publications
Spectra annotation results
and PTM evaluation data
Quantitative data
No data downloads
https://gygi.med.harvard.edu/phosphomouse
45. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Why are repositories not more popular?
1. Don’t want to share data
• Researchers don’t like to be shown that they did not analyze the
data as well as they could have.
• Their FDR may be higher than they reported/think.
• Researchers are worried that they missed something in the data
that they could discover if they go back to it at a later date
• Don’t want other authors to get a publication from their data.
• However, this philosophy is changing gradually…
Slide from R. Chalkley
46. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
Why are repositories not more popular? (2)
2. Submission burden
• Getting data into correct format may require some work
• Author is not necessarily computer-savvy
• Having to also supply metadata is seen as a burden, if the
information is already present in an associated manuscript
• Associated raw data may be many GB in size; file transfer to
repository could take a while
Authors are impatient: want to spend time doing science, not
administration!
Slide from R. Chalkley
47. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
• Importance of sharing MS proteomics data
• The main existing proteomics repositories are
complementary in focus and functionality
• Main characteristics of:
• PeptideAtlas and GPMDB (Reprocess data)
• PASSEL, MassIVE and PRIDE Archive (at
present they do not reprocess data).
• New resources: proteomicsDB, HPM, Chorus
Conclusions
48. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2015
Hinxton, 10 December 2015
• Vizcaíno et al., J. Proteomics, 2010. PMID: 20615486
• Perez-Riverol et al., Proteomics, 2015. PMID: 25158685
Recommended reading