The document discusses PRIDE and ProteomeXchange, resources for sharing public proteomics datasets. It describes how PRIDE stores mass spectrometry-based proteomics data and supports data sharing in the field. It also outlines the ProteomeXchange consortium which aims to standardize data submission and dissemination between proteomics repositories, and how data can be submitted to PRIDE using tools that support standard file formats.
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Share and explore public proteomics datasets
1. PRIDE and ProteomeXchange: Share and
explore public proteomics datasets like never
before
Dr. Juan Antonio Vizcaíno
PRIDE Group Coordinator
Proteomics Services Team
EMBL-EBI
Hinxton, Cambridge, UK
2. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
• PRIDE Archive (in the context of ProteomeXchange and
the PSI standards)
• How to submit data to PRIDE: PRIDE tools
• ProteomeCentral, submission and access stats
Overview
3. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Data sharing in Proteomics
• Public availability of data in proteomics enables:
• Reinterpretation (e.g. data reprocessing with different aims)
• Improved analysis software.
• Change in protein sequence databases (e.g. proteogenomics
studies).
• Consider new post-translational modifications.
• validation of the experimental results reported.
• Specific use cases for proteomics: spectral libraries,
fragmentation models, SRM transitions,…
4. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride
• PRIDE stores mass spectrometry (MS)-
based proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2013
5. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
PRIDE Archive
• New PRIDE DB archival system from 01/2014. Three iterations
released so far. Still work in progress.
• Very flexible, its development has happened in parallel with:
• Implementation of ProteomeXchange.
• New community PSI data standards: mzIdentML, mzQuantML and
mzTab.
6. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
ProteomeXchange Consortium
• Goal: Development of a framework to allow standard
data submission and dissemination pipelines
between the main existing proteomics repositories.
• Includes PeptideAtlas (ISB, Seattle), PRIDE
(Cambridge, UK) and MassIVE (UCSD, San Diego).
• Tranche and Peptidome initially included but
discontinued.
• Common identifier space (PXD identifiers)
• Two supported data workflows: MS/MS and SRM.
• Main objective: Make life easier for researchers
http://www.proteomexchange.org Vizcaíno et al., Nat Biotechnol, 2014
7. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Current PSI Standard File Formats for MS
• mzTab (Griss et al., MCP, 2014)Final Results
• TraML (Deutsch et al., MCP, 2012)SRM
• mzQuantML (Walter et al., MCP, 2013)Quantitation
• mzIdentML (Jones et al., MCP, 2012)Identification
• mzML (Martens et al., MCP, 2011)MS data
8. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Current PSI Standard File Formats for MS
• mzTab (Griss et al., MCP, 2014)Final Results
• TraML (Deutsch et al., MCP, 2012)SRM
• mzQuantML (Walter et al., MCP, 2013)Quantitation
• mzIdentML (Jones et al., MCP, 2012)Identification
• mzML (Martens et al., MCP, 2011)MS data
9. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
mzTab format: tab delimited format (ident/quant)
http://code.google.com/p/mztab/
J. Griss et al., MCP, 2014
Q.W. Xu et al., Proteomics, 2014
10. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Ways to access data in PRIDE Archive
• PRIDE web interface
• File repository
• REST web service
• PRIDE Inspector tool
12. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
PRIDE Archive web interface (2)
• Next: visualization of
spectra (in a couple of
weeks)
13. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Programmatic access: PRIDE REST web service
http://www.ebi.ac.uk/pride/ws/archive/
• Intending to replace the
most popular functionality
provided by the PRIDE
Biomart interface (now
discontinued)
14. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
• Introduction to PRIDE Archive (in the context of
ProteomeXchange and the PSI standards)
• How to submit data to PRIDE: PRIDE tools
• ProteomeCentral, submission and access stats
• A sneak peak about data reuse
Overview
15. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
ProteomeCentral
Metadata /
Manuscript
Raw Data*
Results
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
PRIDE
(MS/MS data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
ProteomeXchange data workflow: PRIDE
16. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Manuscript published detailing the process
Ternent et al., Proteomics, 2014http://www.proteomexchange.org/submission
Example dataset:
PXD000764
- Title: “Discovery of new CSF biomarkers for meningitis in children”
- 12 runs: 4 controls and 8 infected samples
- Identification and quantification data
17. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
PX Data workflow for MS/MS data
1. Mass spectrometer output files: raw data (binary files) or peak list
spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to
PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by
PRIDE, search engine output files will be stored and provided in
their original form.
3. Metadata: Sufficiently detailed description of sample origin,
workflow, instrumentation, submitter.
4. Other files: Optional files:
a. QUANT: Quantification related results e. FASTA
b. PEAK: Peak list files f. SP_LIBRARY
c. GEL: Gel images
d. OTHER: Any other file type
Published
Raw
Files
Other files
18. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Complete vs Partial submissions: experimental metadata
Complete Partial
General experimental metadata about the projects is similar.
However, at the assay level information in partial submissions is not so detailed
19. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Complete
Partial
Complete vs Partial submissions: processed results
For complete submissions, it is possible to connect the spectra with the identification
processed results and they can be visualized.
20. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
PX Data workflow for MS/MS data
1. Mass spectrometer output files: raw data (binary files) or peak list
spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to
PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by
PRIDE, search engine output files will be stored and provided in
their original form.
3. Metadata: Sufficiently detailed description of sample origin,
workflow, instrumentation, submitter.
4. Other files: Optional files (the list can be extended):
a. QUANT: Quantification related results e. FASTA
b. PEAK: Peak list files f. SP_LIBRARY
c. GEL: Gel images
d. OTHER: Any other file type
Published
Raw
Files
Other files
21. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
PRIDE Components: Submission Process
PRIDE Converter 2
PRIDE Inspector PX Submission Tool
mzIdentML
PRIDE XML
1
22. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Search
output
files
Spectra
files
Original data files ‘RESULT’ file generation Final ‘RESULT’ file
PRIDE
XML
‘RESULT’
Before: only file conversion to PRIDE XML
File conversion
PRIDE
Converter
23. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Tools ‘RESULT’ file generation Final ‘RESULT’ file
mzIdentML
‘RESULT’
Now: native file export
Spectra
files
Mascot
ProteinPilot
Scaffold
PEAKS
MSGF+
Others
Native File export
24. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Complete submissions
Search
Engine
Results + MS
files
Search
engines
mzIdentML
- Mascot
- MSGF+
- Myrimatch and related tools from D. Tabb’s lab
- OpenMS
- PEAKS
- PeptideShaker
- ProCon (ProteomeDiscoverer, Sequest)
- Scaffold
- TPP via the idConvert tool (ProteoWizard)
- ProteinPilot (from version 5.0)
- Others: library for X!Tandem conversion, lab
internal pipelines, …
- Crux
An increasing number of tools support export to mzIdentML 1.1
- Referenced spectral files need to be submitted as well
(all open formats are supported).
Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.
25. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Tools ‘RESULT’ file generation Final ‘RESULT’ file
mzTab
‘RESULT’
In the near future: native file export
Spectra
files
Mascot
ProteinPilot
Scaffold
PEAKS
MSGF+
Others
Native File export
26. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
PRIDE Components: Submission Process
PRIDE Converter 2
PRIDE Inspector PX Submission Tool
mzIdentML
PRIDE XML
2
27. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
PRIDE Inspector 2
Wang et al., Nat. Biotechnology, 2012
PRIDE Inspector 2
PRIDE Inspector 2 supports:
- PRIDE XML
- mzIdentML + all types of spectra files
- mzML
- mzTab Quantitation (work in progress)
https://github.com/PRIDE-Toolsuite/
28. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
PRIDE Inspector 2
PRIDE Inspector 2
https://github.com/PRIDE-Toolsuite/
New visualisation
functionality for Protein
Groups
29. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
PRIDE Inspector 2
PRIDE Inspector 2
Private review of files
submitted to PRIDE
https://github.com/PRIDE-Toolsuite/
30. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
PRIDE Components: Submission Process
PRIDE Converter 2
PRIDE Inspector PX Submission Tool
mzIdentML
PRIDE XML
3
31. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
• It selects and captures the mappings between the different types of files included in the
submission.
• It transfers all the files using Aspera (default) or FTP.
PX submission tool
Published
Raw
Other
files
http://www.proteomexchange.org/submission
PX
submission
tool
• Command line alternative: some scripting is needed
33. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Fast file transfer with Aspera
- Aspera is the default file transfer protocol to PRIDE:
- PX Submission tool
- Command line
- Up to 50X faster than FTP
File transfer speed should not
be a problem!!
- Also now available for downloading files
34. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Partial submissions can be used to store other data workflows
• Everything can be stored, not only MS/MS data (~90% of datasets):
very flexible mechanism to be able to capture all types of datasets
• PRIDE does not store SRM data (it goes to PASSEL)
• Top down proteomics datasets (10 datasets).
• Mass Spectrometry Imaging datasets (1 dataset).
• Data independent acquisition techniques: e.g. SWATH-MS (9
datasets), HDMSE (1 dataset).
35. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
C
D
From original publication [13] Reconstructed ProteomeXchange data
1. Thermo RAW data / UDP
2. Mirion Software (JLU)
1. Thermo RAW data / UDP
2. Convert to imzML
3. Upload to PRIDE
(EBI, Cambridge, UK)
4. Download from PRIDE
5. Display in MSiReader
- Vendor-independent data format
- Freely available software (open source)
- ‘open data‘ – free to reuse
- Anybody can do this!
Römpp et al., 2014, Anal Bioanal Chem, in press
PRIDE database
European
Bioinformatics
Institute,
Cambridge, UK
3. Upload
4. Download
36. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
• Introduction to PRIDE Archive (in the context of
ProteomeXchange and the PSI standards)
• How to submit data to PRIDE: PRIDE tools
• ProteomeCentral, submission and access stats
Overview
37. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
ProteomeCentral
Metadata /
Manuscript
Raw Data*
Results
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
PRIDE
(MS/MS data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
ProteomeXchange data workflow: ProteomeCentral
38. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
39. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
41. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Origin:
322 USA
197 Germany
148 United Kingdom
91 Netherlands
85 France
81 China
80 Switzerland
61 Canada
48 Belgium
47 Spain
45 Denmark
42 Australia
40 Japan
37 Sweden
28 Austria
22 India
21 Norway
21 Taiwan
20 Ireland
20 Finland
17 Italy
14 Brazil
13 Republic of Korea
13 Russia
10 Israel
9 Singapore …
ProteomeXchange: 1620 datasets up until 8th January 2015
Type:
526 PRIDE complete (32.5%)
982 PRIDE partial (60.6%)
63 PeptideAtlas/PASSEL complete
24 MassIVE
25 reprocessed
Publicly Accessible:
814 datasets, 50% of all
90% PRIDE
8% PASSEL
2% MassIVE
Data volume:
Total: ~71 TB
Number of all files: ~160,000
PXD000320-324: ~ 5 TB
PXD000065: ~ 1.4TB
Top Species studied by at least 10 datasets:
712 Homo sapiens
193 Mus musculus
65 Saccharomyces cerevisiae
61 Arabidopsis thaliana
35 Rattus norvegicus
34 Escherichia coli
17 Bos taurus
17 Glycine max
17 Mycobacterium tuberculosis
16 Drosophila melanogaster
14 Oryza sativa
~ 310 species in total
Datasets/year:
2012: 102
2013: 527
2014: 963
2015: 28
43. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Access statistics: PRIDE File repository
2014: The rise of proteomics data re-use
45. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Vaudel M, Barsnes H, Berven FS, Sickmann A, Martens L:
Proteomics 2011;11(5):996-9.
http://searchgui.googlecode.com http://peptide-shaker.googlecode.com
Vaudel M, Burkhart J, Zahedi RP, Berven FS, Sickmann A, Martens L, Barsnes
H:
Nature Biotechnology 2015; 33(1):22-4.
PeptideShaker facilitates reuse of PRIDE data
46. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014
•Around 60% of the data used for the
analysis comes from previous experiments,
most of them stored in proteomics repositories
such as PRIDE/ProteomeXchange, PASSEL
or MassIVE.
•They complement that data with “exotic”
tissues.
47. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
• Data submission and data reuse in the field are rising.
• PRIDE and ProteomeXchange enable this for you.
• Data standards are key for us.
• Quantification data depends on mzTab support.
Conclusions
48. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Aknowledgements: People
Attila Csordas
Tobias Ternent
Noemi del Toro
Rui Wang
Florian Reisinger
Jose A. Dianes
Johannes Griss
Steven Lewis
Yasset Perez-Riverol
Henning Hermjakob
All ProteomeXchange partners,
especially Eric Deutsch and Nuno
Bandeira
Acknowledgements: The PRIDE Team and collaborators
49. Juan A. Vizcaíno
juan@ebi.ac.uk
Midwinter Proteomics Bioinformatics Seminar
Semmering, 15 January 2015
Acknowledgements: Funding
pride-ebi@ebi.ac.uk
pride-support@ebi.ac.uk
http://www.proteomexchange.org
http://code.google.com/p/pride-converter-2/
@pride_ebi
Acknowledgements