PRIDE and ProteomeXchange – Making proteomics data accessible and reusable
1. PRIDE and ProteomeXchange – Making
proteomics data accessible and reusable
Dr. Yasset Perez-Riverol
Twitter: @ypriverol
Github: ypriverol
Bioinformatician - PRIDE Group
Proteomics Services Team
EMBL-EBI
Hinxton, Cambridge, UK
2. Proteomics Services, EBI-EMBL
Yasset Perez-Riverol
yperez@ebi.ac.uk
Protein Sequences
IntAct
Interactions
PRIDE
MS/MS Data
Uniprot
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
Reactome
Pathways
Biomodels
3. Overview
• The ProteomeXchange (PX) consortium
• PRIDE and ProteomeXchange
Yasset Perez-Riverol
yperez@ebi.ac.uk
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
• PRIDE Components.
• Current and future developments.
4. ProteomeXchange Consortium
• Goal: Development of a framework to allow
standard data submission and dissemination
pipelines between the main existing proteomics
repositories.
• Includes PeptideAtlas (ISB, Seattle), PRIDE
(Cambridge, UK) and MassIVE (UCSD, San Diego).
• Common identifier space (PXD identifiers)
• Two supported data workflows: MS/MS and SRM.
• Main objective: Make data available and
reusable.
http://www.proteomexchange.org
Yasset Perez-Riverol
yperez@ebi.ac.uk
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
5. ProteomeXchange data workflow
Results
Raw Data*
Yasset Perez-Riverol
yperez@ebi.ac.uk
ProteomeCentral
PRIDE
(MS/MS data)
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
Metadata /
Manuscript
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving r e positories
PASSEL
(SRM data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
6. Yasset Perez-Riverol
yperez@ebi.ac.uk
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
MassIVE (UCSD)
http://proteomics.ucsd.edu/service/massive/
• Just joined ProteomeXchange on June 2014
7. http://www.peptideatlas.org/passel/
Yasset Perez-Riverol
yperez@ebi.ac.uk
• Suitable for SRM assays
• Part of PeptideAtlas set of
resources.
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
Farrah et al., Proteomics, 2012
PASSEL: repository for SRM data
8. Pride: Protein identification Database
http://www.ebi.ac.uk/pride/archive/
Yasset Perez-Riverol
yperez@ebi.ac.uk
Vizcaíno et al., N. A Research, 2014
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
9. PX Submission workflow for MS/MS data
1. Mass spectrometer output files: raw data (binary files) or
Yasset Perez-Riverol
yperez@ebi.ac.uk
peak list spectra in a standardized format (mzML, mzXML).
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
2. Result files:
a. Complete submissions: Result files can be converted to
PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by
PRIDE, search engine output files will be stored and
provided in their original form.
3. Metadata: Sufficiently detailed description of sample origin,
workflow, instrumentation, submitter based on Ontologies and
Controlled Vocabularies.
4. Other files: Optional files:
a. QUANT: Quantification related results e. FASTA
b. PEAK: Peak list files
c. OTHER: Any other file type
Published
Raw
Files
Other
files
Ternent et al., Proteomics, 2014
10. Complete submissions using mzIdentML
Yasset Perez-Riverol
yperez@ebi.ac.uk
An increasing number of tools support export to mzIdentML 1.1
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
Search
Engine
Results +
MS files
Search
engines
mzIdentML
- Mascot
- MSGF+
- Myrimatch and related tools from D. Tabb’s lab
- OpenMS
- PEAKS
- ProCon (ProteomeDiscoverer, Sequest)
- Scaffold
- TPP via the idConvert tool (ProteoWizard)
- ProteinPilot (planned by the end of 2014)
- Others: library for X!Tandem conversion, lab
internal pipelines, …
- Referenced spectral files need to be submitted as well
(all open formats are supported).
Updated list:
http://www.psidev.info/tools-implementing-mzIdentML#.
11. Metadata • Key-Value pairs
Protein • Table-based
Peptide • Table-based
PSM • Table-based
Small Molecule • Table-based
Yasset Perez-Riverol
yperez@ebi.ac.uk
• Basic information about experiment and sample
• Basic information about protein identifications
• Information about quantified peptides
• Information about identified spectra
• Basic information about identified small molecules
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
mzTab
http://mztab.googlecode.com
J. Griss et al., MCP, 2014
12. PRIDE Components: Submission Process
PRIDE Converter PRIDE Inspector PX Submission Tool
Yasset Perez-Riverol
yperez@ebi.ac.uk
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
13. PRIDE Components: PX submission tool
• Capture the mappings between the different types of files.
• Add the mandatory metadata annotation.
• Make the file upload process straightforward to the submitter (It transfers all the
files using Aspera or FTP).
• Command line alternative: some scripting is needed.
Yasset Perez-Riverol
yperez@ebi.ac.uk
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
Published
Raw
Other
files
http://www.proteomexchange.org/submission
PX
submission
tool
14. Available for complete submissions
Yasset Perez-Riverol
yperez@ebi.ac.uk
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
Wang et al., Nat. Biotechnology, 2012
PRIDE Inspector 2.0
PRIDE Inspector 2.0 supports:
- PRIDE XML
- mzIdentML + all types of spectra files
- mzML
- mzTab Quantitation (work in progress)
https://github.com/PRIDE-Toolsuite/
15. Pride Components: Pipelines and Visualization
Yasset Perez-Riverol
yperez@ebi.ac.uk
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
Submission
validation
Pipeline
• QC of files submitted.
• Metadata check.
Submission
pipeline.
• Add Project to Database (files location, general statistics,
metadata)
Publication
pipeline
• Conversion of files to mztab
• Conversion spectra peaks to mgf
• Index de information in Solr server
16. Pride Components: Services & Web components
Yasset Perez-Riverol
yperez@ebi.ac.uk
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
17. ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
Yasset Perez-Riverol
yperez@ebi.ac.uk
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
18. ProteomeXchange: 1329 datasets up until October 2014
Yasset Perez-Riverol
yperez@ebi.ac.uk
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
Origin:
271 USA
166 Germany
115 United Kingdom
73 Switzerland
70 China
68 Netherlands
67 France
55 Canada
44 Spain
42 Belgium
33 Sweden
31 Australia
31 Denmark
31 Japan
20 India
20 Norway
19 Taiwan
17 Ireland
16 Austria
14 Finland
14 Italy
12 Republic of Korea
11 Brazil
9 Russia
8 Israel
7 Singapore …
Type:
437 PRIDE complete
792 PRIDE partial
63 PeptideAtlas/PASSEL complete
14 MassIVE
23 reprocessed
Publicly Accessible:
691 datasets, 52% of all
86% PRIDE
12% PASSEL
2% MassIVE
Top Species studied by at least 10
datasets:
577 Homo sapiens
165 Mus musculus
56 Saccharomyces cerevisiae
53 Arabidopsis thaliana
29 Rattus norvegicus
22 Escherichia coli
17 Bos taurus
16 Mycobacterium tuberculosis
13 Oryza sativa
13 Drosophila melanogaster
13 Glycine max
~ 290 species in total
Data volume:
Total: ~55 TB
Number of all files: ~131,000
PXD000320-324: ~ 5 TB
PXD000065: ~ 1.4TB
Datasets/year:
2012: 102
2013: 527
2014: 700
19. Journals and Data Deposition
Yasset Perez-Riverol
yperez@ebi.ac.uk
Journal
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
Number of Submissions
21. Future developments
• Make the data reusable.
• Integration of different Protein expression resources
• PRIDE
• PeptideAtlas
• ProteomicsDB
• Human Proteome Map
Yasset Perez-Riverol
yperez@ebi.ac.uk
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
PXD
Identifier
Hits
Dataset title
PXD000561 153512
A draft map of the human
proteome
PXD000865 51639
Mass spectrometry based draft of
the human proteome
22. Web Services PROXI PROXI PROXI PROXI PROXI
Yasset Perez-Riverol
yperez@ebi.ac.uk
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
PROXI Clients
Repositories
&
Databases
Registry
Data Perez-Riverol Y, Proteomics, 20014
23. Conclusions
• ProteomeXchange is widely used.
• PRIDE contains most of the MS/MS datasets.
• It has now a new consortium member: MassIVE (UCSD).
• Around half of the datasets are already public.
• Different open source tools available to facilitate the process:
• File transfer speed should not be a problem (Aspera support)
• Data depostion enables and promotes data reuse.
• ProteomeXchange is open to new members.
Yasset Perez-Riverol
yperez@ebi.ac.uk
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)
24. Acknowledgements
PRIDE Team
Juan A. Vizcaino (Group Leader)
Attila Csordas
Rui Wang
Florian Reisinger
Jose A. Dianes
Tobias Ternent
Yasset Perez-Riverol
Noemi del Toro
Henning Hermjakob
Yasset Perez-Riverol
yperez@ebi.ac.uk
PeptideAtlas Team (ISB, Seattle)
Eric Deutsch
Terry Farrah
Zhi Sun
MAssIVE
Nuno Bandeira
And many other PX partners and
stakeholders
BioHackthon 2014
Miyagi, Japan (Nov 9-14, 2014)