ProteomeXchange: Update for the C-HPP Consortium.
10th C-HPP Workshop: “Proteome data management and identification of missing proteins".
Bangkok, Thailand. 09/08/2015. Remote presentation.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
AHUPO_Vizcaino_remote_presentation_082014
1. ProteomeXchange: Update for the C-HPP
Consortium
Dr. Juan Antonio Vizcaíno
PRIDE Group Coordinator
Proteomics Services Team
EMBL-EBI
Hinxton, Cambridge, UK
2. Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
• Miscellaneous
• Your questions for the discussion
3. Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
• Miscellaneous
• Your questions for the discussion
4. ProteomeXchange Consortium
• Goal: Development of a framework to allow
standard data submission and dissemination
pipelines between the main existing proteomics
repositories.
• Includes PeptideAtlas (ISB, Seattle), PRIDE
(Cambridge, UK) and (very recently) MassIVE
(UCSD, San Diego).
• EU FP7 CA (01/2011-> 06/2014).
• Common identifier space (PXD identifiers)
• Two supported data workflows: MS/MS and SRM.
• Main objective: Make life easier for researchers
http://www.proteomexchange.org Vizcaíno et al., Nat Biotechnol, 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
5. ProteomeXchange data workflow
Results
Raw Data*
Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE
(MS/MS data)
10th C-HPP Workshop
Bangkok, 9 August 2014
ProteomeCentral
Metadata /
Manuscript
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving r e positories
PASSEL
(SRM data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
6. MassIVE (UCSD)
• Just joined ProteomeXchange on June 2014
• Similar role to PRIDE (although not yet formalised).
Juan A. Vizcaíno
juan@ebi.ac.uk
http://proteomics.ucsd.edu/service/massive/
10th C-HPP Workshop
Bangkok, 9 August 2014
7. ProteomeXchange data workflow
Results
Raw Data*
Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE
(MS/MS data)
10th C-HPP Workshop
Bangkok, 9 August 2014
ProteomeCentral
Metadata /
Manuscript
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving r e positories
PASSEL
(SRM data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
8. Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
• Miscellaneous
• Your questions for the discussion
9. PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
• Focused on MS/MS
approaches
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2013
10. Manuscript just out detailing the process
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
Ternent et al., Proteomics, 2014, in press
http://www.proteomexchange.org/submission
11. • The challenge: PRIDE is an archival database, aiming to provide long term
access to proteomics data from all workflows
• Huge variety of proteomics workflows and file formats establishes a data
management nightmare
• Previously, we had to decline submissions since there were in formats we
could not handle.
• Solution: Complete and Partial submissions
• Complete submission
• All data in standard formats,
accessible through PRIDE
Inspector and web interface
(not yet)
• Results searchable in DB
• Submission gets DOI
• Metadata, raw data, results are mandatory in both cases, just not parsed for
Juan A. Vizcaíno
juan@ebi.ac.uk
Complete vs Partial submissions
10th C-HPP Workshop
Bangkok, 9 August 2014
partial submissions
• Partial submission
• Part of data in non-standard formats
• Files are made available to
download
• Only metadata searchable
• No DOI
12. PX Data workflow for MS/MS data
Juan A. Vizcaíno
juan@ebi.ac.uk
1. Mass spectrometer output files: raw data (binary files) or
peak list spectra in a standardized format (mzML, mzXML).
10th C-HPP Workshop
Bangkok, 9 August 2014
Raw
Files
13. PX Data workflow for MS/MS data
Juan A. Vizcaíno
juan@ebi.ac.uk
1. Mass spectrometer output files: raw data (binary files) or
peak list spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted
to PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported
by PRIDE, search engine output files will be stored and
provided in their original form.
10th C-HPP Workshop
Bangkok, 9 August 2014
Published
Raw
Files
14. Juan A. Vizcaíno
juan@ebi.ac.uk
Complete submissions
An increasing number of tools support export to mzIdentML 1.1
10th C-HPP Workshop
Bangkok, 9 August 2014
Search
Engine
Results +
MS files
Search
engines
mzIdentML
- Mascot
- MSGF+
- Myrimatch and related tools from D. Tabb’s lab
- OpenMS
- PEAKS
- ProCon (ProteomeDiscoverer, Sequest)
- Scaffold
- TPP via the idConvert tool (ProteoWizard)
- ProteinPilot (planned by the end of 2014)
- Others: library for X!Tandem conversion, lab
internal pipelines, …
- Referenced spectral files need to be submitted as well
Updated list:
http://www.psidev.info/tools-implementing-mzIdentML#.
15. Available for complete submissions
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
Wang et al., Nat. Biotechnology, 2012
PRIDE Inspector 2.0
PRIDE Inspector 2.0 supports:
- PRIDE XML
- mzIdentML + all types of spectra files
- mzML
- mzTab Ident (work in progress)
http://code.google.com/p/pride-toolsuite/
wiki/PRIDEInspector
16. PX Data workflow for MS/MS data
Juan A. Vizcaíno
juan@ebi.ac.uk
1. Mass spectrometer output files: raw data (binary files) or
peak list spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted
to PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported
by PRIDE, search engine output files will be stored and
provided in their original form.
3. Metadata: Sufficiently detailed description of sample origin,
workflow, instrumentation, submitter.
10th C-HPP Workshop
Bangkok, 9 August 2014
Published
Raw
Files
17. PX Data workflow for MS/MS data
Juan A. Vizcaíno
juan@ebi.ac.uk
1. Mass spectrometer output files: raw data (binary files) or
peak list spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to
PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by
PRIDE, search engine output files will be stored and
provided in their original form.
3. Metadata: Sufficiently detailed description of sample origin,
workflow, instrumentation, submitter.
4. Other files: Optional files:
a. QUANT: Quantification related results e. FASTA
b. PEAK: Peak list files f. SP_LIBRARY
c. GEL: Gel images
d. OTHER: Any other file type
10th C-HPP Workshop
Bangkok, 9 August 2014
Published
Raw
Files
Other
files
18. PX submission tool
• Capture the mappings between the different types of files.
• Make the file upload process straightforward to the submitter (It transfers all the
files using Aspera or FTP).
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
Published
Raw
Other
files
http://www.proteomexchange.org/submission
PX
submission
tool
• Command line alternative: some scripting is needed
19. Fast file transfer with Aspera
Part C: Difficulties in Connections:
Q1. How to make the connections between local server and central DBs much
faster and accessible (e.g. local server and ProteomeXchange)?
Juan A. Vizcaíno
juan@ebi.ac.uk
- Aspera is the default file transfer protocol to PRIDE:
- PX Submission tool
- Command line
- Up to 50X faster than FTP
10th C-HPP Workshop
Bangkok, 9 August 2014
File transfer speed should
not be a problem!!
20. PX submission tool
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
21. PX submission tool: HPP tags
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
22. • The challenge: PRIDE is an archival database, aiming to provide long term
access to proteomics data from all workflows
• Huge variety of proteomics workflows and file formats establishes a data
management nightmare
• Previously, we had to decline submissions since there were in formats we
could not handle.
• Solution: Complete and Partial submissions
• Complete submission
• All data in standard formats,
accessible through PRIDE
Inspector and web interface
(not yet)
• Results searchable in DB
• Submission gets DOI
• Metadata, raw data, results are mandatory in both cases, just not parsed for
Juan A. Vizcaíno
juan@ebi.ac.uk
Complete vs Partial submissions
10th C-HPP Workshop
Bangkok, 9 August 2014
partial submissions
• Partial submission
• Part of data in non-standard formats
• Files are made available to
download
• Only metadata searchable
• No DOI
23. • Everything can be stored, not only MS/MS data: very flexible
mechanism to be able to capture all types of datasets
• Top down proteomics datasets
• Mass Spectrometry Imaging datasets
• Data independent acquisition techniques: e.g. SWATH-MS datasets, among
Juan A. Vizcaíno
juan@ebi.ac.uk
Partial submissions
10th C-HPP Workshop
Bangkok, 9 August 2014
other DIA approaches
24. ProteomeXchange: 1,148 datasets up until August 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
Origin:
235 USA
142 Germany
97 United Kingdom
67 Switzerland
64 Netherlands
62 China
60 France
48 Canada
43 Spain
36 Belgium
32 Sweden
29 Australia
26 Denmark
23 Japan
18 Taiwan
17 India
16 Ireland
14 Norway
14 Italy
12 Finland
11 Republic of Korea
10 Brazil
8 Austria
7 Israel
7 Singapore …
Type:
386 PRIDE complete
687 PRIDE partial
51 PeptideAtlas/PASSEL complete
1 MassIVE
23 reprocessed
Publicly Accessible:
544 datasets, 50% of all
90% PRIDE
10% PASSEL
Top Species studied by at least 10
datasets:
510 Homo sapiens
142 Mus musculus
46 Saccharomyces cerevisiae
45 Arabidopsis thaliana
23 Rattus norvegicus
16 Escherichia coli
15 Bos taurus
15 Mycobacterium tuberculosis
13 Oryza sativa
12 Drosophila melanogaster
12 Glycine max
~ 265 species in total
Data volume:
Total: ~51 TB
Number of all files: ~130,000
PXD000320-324: ~ 5 TB
PXD000065: ~ 1.4TB
Datasets/year:
2012: 102
2013: 527
2014: 519
25. Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
• Miscellaneous
• Your questions for the discussion
26. ProteomeXchange data workflow
Results
Raw Data*
Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE
(MS/MS data)
10th C-HPP Workshop
Bangkok, 9 August 2014
ProteomeCentral
Metadata /
Manuscript
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving r e positories
PASSEL
(SRM data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
27. ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
28. ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
29. Get notified about new PX datasets
- Subscribe to the RSS Feed to receive information about
the new datasets:
http://groups.google.com/group/proteomexchange/feed/
rss_v2_0_msgs.xml
Juan A. Vizcaíno
juan@ebi.ac.uk
Proteome Central Researchers
10th C-HPP Workshop
Bangkok, 9 August 2014
30. Reuse of datasets in PeptideAtlas can be tracked
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
31. Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
• Miscellaneous
• Your questions for the discussion
32. HPP datasets are now tagged
The Projects are now tagged and can be browsed as a group of data sets."
"
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
Tags for: HPP, C-HPP and
B/D-HPP
33. HPP PX datasets
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
34. HPP PX datasets: some numbers
Since January 2014, we started capturing the PI information
- 25 HPP datasets: 22 C-HPP and 3 B/D-HPP
- Countries represented in C-HPP:
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
- 5 Spain
- 4 South Korea
- 3 Brazil, China
Only a small proportion of the datasets have been made
publicly available, at least through ProteomeXchange
35. For the near future in PRIDE…
• Complete data workflow for data visualization in
PRIDE Archive web
• Improvement of existing PRIDE REST API
• Incorporation of reprocessed data in PRIDE (in
collaboration with Prof. L. Martens (VIB/Ghent) and Dr.
A. Jones (U. Liverpool)
• Integration of the data in the EBI Molecular Atlas
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
36. Conclusions
• ProteomeXchange is widely used. It has now a new
consortium member: MassIVE (UCSD).
• PRIDE is getting a lot of data via ProteomeXchange. Pipeline
in production since summer 2012. More than 1,100 datasets
have been already submitted.
• Half of them already public.
• Different open source tools available to facilitate the process:
• File transfer speed should not be a problem (Aspera support)
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
37. Acknowledgements
PRIDE Team
Attila Csordas
Rui Wang
Florian Reisinger
Jose A. Dianes
Tobias Ternent
Yasset Perez-Riverol
Noemi del Toro
Henning Hermjakob
Juan A. Vizcaíno
juan@ebi.ac.uk
EU FP7 grant number 260558
10th C-HPP Workshop
Bangkok, 9 August 2014
PeptideAtlas Team (ISB, Seattle)
Eric Deutsch
Terry Farrah
Zhi Sun
Andrew R. Jones
Lennart Martens
Juan Pablo Albar
Martin Eisenacher
Gil Omenn
And many other PX partners and
stakeholders
38. Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
Questions?
39. Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
• Miscellaneous
• Your questions for the discussion
40. Questions for discussion
Part A: Proteomic Dataset with PXD:
Q1. How to make deposited through ProteomeXchange or published
data available for the consortium members as well as public DB
managers (GPMDB, neXtProt, PeptideAtlas and ProteinAtlas)?
• There is plenty of documentation available:
• http://www.proteomexchange.com/submission
• PRIDE documentation
• Original paper (PMID: 24727771) and submission tutorial paper
(PMID: 25047258).
• Resources need to be devoted to this…
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
41. Questions for discussion
Part A: Proteomic Dataset with PXD:
Q2. What can we do for such inaccessible datasets in the public DB?
• Contact the author directly and convince him/her to make a
submission to ProteomeXchange.
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
42. Questions for discussion
Part A: Proteomic Dataset with PXD:
Q3. Do we need to ask people to place a link with PXD identifier on the
Wiki in order to see which chromosome team placed which datasets
online (for sharing)?
• NO, in the PX submission tool it is possible to specify the tags for
HPP and/or C-HPP or B/D-HPP.
• Visit ‘ProteomeCentral’ and look for them.
• Visit PRIDE and look for them there (specific tags)
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
43. HPP datasets are now tagged
The Projects are now tagged and can be browsed as a group of data sets."
"
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
Tags for: HPP, C-HPP and
B/D-HPP
44. Questions for discussion
Part B: Proteogenomic Dataset:
Q1. How to make easy access to or deposit proteogenomic dataset
including RNAseq and other types of genetic data?
• Not easy to do at present.
• NCBI and EBI have created “BioSamples databases” to be able to
link different studies performed using the same sample.
• Proteomics data could be submitted to PRIDE/ProteomeXchange
and RNAseq data to e.g. ArrayExpress (EBI), linked by the same
sample number.
• Sample IDs coming from the BioSamples DB to be integrated in
the PX submission tool and in PRIDE (expected before the end of
Juan A. Vizcaíno
juan@ebi.ac.uk
10th C-HPP Workshop
Bangkok, 9 August 2014
the year).
45. Biosamples DB
Juan A. Vizcaíno
juan@ebi.ac.uk
http://www.ebi.ac.uk/biosamples/
10th C-HPP Workshop
Bangkok, 9 August 2014
46. Fast file transfer with Aspera
Part C: Difficulties in Connections:
Q1. How to make the connections between local server and central DBs much
faster and accessible (e.g. local server and ProteomeXchange)?
Juan A. Vizcaíno
juan@ebi.ac.uk
- Aspera is the default file transfer protocol to PRIDE:
- PX Submission tool
- Command line
- Up to 50X faster than FTP
10th C-HPP Workshop
Bangkok, 9 August 2014
File transfer speed should
not be a problem!!