ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

ProteomeXchange Experience: PXD
Identifiers and Release of Data on
Acceptance, Uploading Large Data Sets
Dr. Juan Antonio Vizcaíno
PRIDE Group Coordinator
Proteomics Services Team
EMBL-EBI
Hinxton, Cambridge, UK

Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Some HPP related things

ProteomeXchange Consortium
• Goal: Development of a framework to allow
standard data submission and dissemination
pipelines between the main existing proteomics
repositories.
• Includes PeptideAtlas (ISB, Seattle), PRIDE
(Cambridge, UK) and (very recently) MassIVE
(UCSD, San Diego).
• Common identifier space (PXD identifiers)
• Two supported data workflows: MS/MS and SRM.
• Main objective: Make life easier for researchers
http://www.proteomexchange.org
Juan A. Vizcaíno
juan@ebi.ac.uk

ProteomeXchange data workflow
Results
Raw Data*
Juan A. Vizcaíno
juan@ebi.ac.uk
ProteomeCentral
PRIDE
(MS/MS data)
Metadata /
Manuscript
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014

MassIVE (UCSD)
• Just joined ProteomeXchange on June 2014
• Only partial submissions. A few datasets so far.
Juan A. Vizcaíno
juan@ebi.ac.uk
http://proteomics.ucsd.edu/service/massive/

Juan A. Vizcaíno
juan@ebi.ac.uk
• Suitable for SRM assays
• Part of PeptideAtlas set of
resources.
http://www.peptideatlas.org/passel/
Farrah et al., Proteomics, 2012
PASSEL: repository for SRM data

Manuscript just out detailing the process
http://www.proteomexchange.org/submission Ternent et al., Proteomics, 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
Example dataset:
PXD000764
- Title: “Discovery of new CSF biomarkers for meningitis in children”
- 12 runs: 4 controls and 8 infected samples
- Identification and quantification data

PX Data workflow for MS/MS data
Juan A. Vizcaíno
juan@ebi.ac.uk
1. Mass spectrometer output files: raw data (binary files) or
peak list spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to
PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by
PRIDE, search engine output files will be stored and
provided in their original form.
3. Metadata: Sufficiently detailed description of sample origin,
workflow, instrumentation, submitter.
4. Other files: Optional files:
a. QUANT: Quantification related results e. FASTA
b. PEAK: Peak list files f. SP_LIBRARY
c. GEL: Gel images
d. OTHER: Any other file type
Published
Raw
Files
Other
files

Complete vs Partial submissions: experimental metadata
Complete Partial
General experimental metadata about the projects is similar.
However, at the assay level information in partial submissions is not so detailed
Juan A. Vizcaíno
juan@ebi.ac.uk

Complete vs Partial submissions: processed results
For complete submissions, it is possible to connect the spectra with the identification
processed results and they can be visualized.
Juan A. Vizcaíno
juan@ebi.ac.uk
Complete
Partial

Complete submissions using mzIdentML
Juan A. Vizcaíno
juan@ebi.ac.uk
An increasing number of tools support export to mzIdentML
1.1
Search
Engine
Results +
MS files
Search
engines
mzIdentML
- Mascot
- MSGF+
- Myrimatch and related tools from D. Tabb’s lab
- OpenMS
- PEAKS
- ProCon (ProteomeDiscoverer, Sequest)
- Scaffold
- TPP via the idConvert tool (ProteoWizard)
- ProteinPilot (planned by the end of 2014)
- Others: library for X!Tandem conversion, lab
internal pipelines, …
- Referenced spectral files need to be submitted as well
(all open formats are supported).
Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.

Tools ‘RESULT’ file generation Final ‘RESULT’ file
Juan A. Vizcaíno
juan@ebi.ac.uk
mzIdentML
‘RESULT’
Now: native file export
Spectra
files
Mascot
ProteinPilo
t
Scaffold
PEAKS
MSGF+
Others
Native File export

Available for complete submissions
Juan A. Vizcaíno
juan@ebi.ac.uk
Wang et al., Nat. Biotechnology, 2012
PRIDE Inspector 2.0
PRIDE Inspector 2.0 supports:
- PRIDE XML
- mzIdentML + all types of spectra files
- mzML
- mzTab Ident (work in progress)
http://code.google.com/p/pride-toolsuite/
wiki/PRIDEInspector

PX submission tool: data submission
• Capture the mappings between the different types of files.
• Add the mandatory metadata annotation.
• Make the file upload process straightforward to the submitter (It transfers all the
files using Aspera or FTP).
• Command line alternative: some scripting is needed.
Juan A. Vizcaíno
juan@ebi.ac.uk
Published
Raw
Other
files
http://www.proteomexchange.org/submission
PX
submission
tool

Uploading large datasets: Aspera
- Aspera is the default file transfer protocol to PRIDE:
- PX Submission tool
- Command line
- Up to 50X faster than FTP
Juan A. Vizcaíno
juan@ebi.ac.uk
File transfer speed should
not be a problem!!

ProteomeXchange: 1329 datasets up until October 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
Origin:
271 USA
166 Germany
115 United Kingdom
73 Switzerland
70 China
68 Netherlands
67 France
55 Canada
44 Spain
42 Belgium
33 Sweden
31 Australia
31 Denmark
31 Japan
20 India
20 Norway
19 Taiwan
17 Ireland
16 Austria
14 Finland
14 Italy
12 Republic of Korea
11 Brazil
9 Russia
8 Israel
7 Singapore …
Type:
437 PRIDE complete
792 PRIDE partial
63 PeptideAtlas/PASSEL complete
14 MassIVE
23 reprocessed
Publicly Accessible:
691 datasets, 52% of all
86% PRIDE
12% PASSEL
2% MassIVE
Top Species studied by at least 10
datasets:
577 Homo sapiens
165 Mus musculus
56 Saccharomyces cerevisiae
53 Arabidopsis thaliana
29 Rattus norvegicus
22 Escherichia coli
17 Bos taurus
16 Mycobacterium tuberculosis
13 Oryza sativa
13 Drosophila melanogaster
13 Glycine max
~ 290 species in total
Data volume:
Total: ~55 TB
Number of all files: ~131,000
PXD000320-324: ~ 5 TB
PXD000065: ~ 1.4TB
Datasets/year:
2012: 102
2013: 527
2014: 700

Public data release: when does it happen?
• When the author tells us to do it (the authors can do it by
Juan A. Vizcaíno
juan@ebi.ac.uk
themselves)
• When we find out that a dataset has been published
• We look for PXD identifiers in PubMed abstracts.
• If your PXD identifier is not in the abstract, a paper may have
been published and the data is still private. Let us know!
• New web form in the PRIDE web to facilitate the process

Partial submissions can be used to store
other data types
• Everything can be stored, not only MS/MS data: very flexible
mechanism to be able to capture all types of datasets
• PRIDE does not store SRM data (it goes to PASSEL)
• Top down proteomics datasets.
• Mass Spectrometry Imaging datasets.
• Data independent acquisition techniques: e.g. SWATH-MS datasets.
Juan A. Vizcaíno
juan@ebi.ac.uk

Imaging MS datasets: partial submissions
3. Upload
4. Download
From original publication [13] Reconstructed ProteomeXchange data
Juan A. Vizcaíno
juan@ebi.ac.uk
C
D
1. Thermo RAW data / UDP
2. Mirion Software (JLU)
1. Thermo RAW data / UDP
2. Convert to imzML
3. Upload to PRIDE repository
(EBI, Cambridge, UK)
4. Download from PRIDE
5. Display in MSiReader
PRIDE
Database
European
Bioinformatics
Institute,
Cambridge, UK
- Vendor-independent data format
- Freely available software (open source)
- ‚open data‘ – free to reuse
- Anybody can do this!

ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
Juan A. Vizcaíno
juan@ebi.ac.uk

Get notified about new PX datasets
- Subscribe to the RSS Feed to receive information about
the new datasets:
http://groups.google.com/group/proteomexchange/feed/r
ss_v2_0_msgs.xml
Juan A. Vizcaíno
juan@ebi.ac.uk
Proteome Central Researchers

PX submission tool: HPP tags
Juan A. Vizcaíno
juan@ebi.ac.uk

HPP datasets are now tagged
The Projects are now tagged and can be browsed as a group of data sets.
Juan A. Vizcaíno
juan@ebi.ac.uk
Tags for: HPP, C-HPP and
B/D-HPP

HPP PX datasets: some numbers
Since January 2014, we started capturing the PI information
- 25 HPP datasets: 22 C-HPP and 3 B/D-HPP
- Countries represented in C-HPP:
Juan A. Vizcaíno
juan@ebi.ac.uk
- 5 Spain
- 4 South Korea
- 3 Brazil, China
Only a small proportion of the datasets have been made
publicly available, at least through ProteomeXchange

Which are the most accessed datasets?
PXD Identifier Hits Dataset title Publication
PXD000561 153512 A draft map of the human proteome
Juan A. Vizcaíno
juan@ebi.ac.uk
Kim et al.,
Nature,2014.
PMID: 24870542
PXD000851 111587
Membrane proteomic analysis of
colorectal cancer tissue
Kume et al., MCP,
2014.
PMID:24687888
PXD000865 51639
Mass spectrometry based draft of
the human proteome
Wilhelm et al., 2014,
Nature,
PMID:24870543

Which are the most accessed datasets?
Juan A. Vizcaíno
juan@ebi.ac.uk
Total Numbers

Conclusions
• ProteomeXchange is widely used.
• PRIDE contains most of the MS/MS datasets.
• It has now a new consortium member: MassIVE (UCSD).
• Around half of the datasets are already public.
• Different open source tools available to facilitate the process:
• File transfer speed should not be a problem (Aspera support)
• Data depostion enables and promotes data reuse.
• ProteomeXchange is open to new members.
Juan A. Vizcaíno
juan@ebi.ac.uk

Acknowledgements
Juan A. Vizcaíno
juan@ebi.ac.uk
PeptideAtlas Team (ISB, Seattle)
Eric Deutsch
Terry Farrah
Zhi Sun
Andrew R. Jones
Lennart Martens
Juan Pablo Albar
Martin Eisenacher
Gil Omenn
Nuno Bandeira
And many other PX partners and
stakeholders
PRIDE Team
Attila Csordas
Rui Wang
Florian Reisinger
Jose A. Dianes
Tobias Ternent
Yasset Perez-Riverol
Noemi del Toro
Henning Hermjakob
EU FP7 grant number 260558

Juan A. Vizcaíno
juan@ebi.ac.uk
Questions?

Connecting different data types
How to connect different data types (genomics, metabolomics, etc)?
Juan A. Vizcaíno
juan@ebi.ac.uk
It can be used for:
- ArrayExpress/ GEO
Identifiers
- MetaboLights identifiers
- etc, etc

Pilot project started in the context of ELIXIR
Juan A. Vizcaíno
juan@ebi.ac.uk
B2SAFE
B2SAFE
4
3
CSC
BILS
Site B
Site C
ELIXIR EUDAT CDI
B2SAFE
B2SAFE
PRIDE
EMBL-EBI

ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets

Similar to ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets (20)

More from Juan Antonio Vizcaino

More from Juan Antonio Vizcaino (20)

Recently uploaded

Recently uploaded (20)

ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets