SlideShare a Scribd company logo
1 of 42
PRIDE and ProteomeXchange: A golden
age for working with public proteomics
data
Dr. Juan Antonio Vizcaíno
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Overview
• PRIDE Archive and ProteomeXchange
• Reuse of public proteomics data
• PRIDE Cluster
• Open analysis pipelines
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Overview
• PRIDE Archive and ProteomeXchange
• Reuse of public proteomics data
• PRIDE Cluster
• Open analysis pipelines
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
• PRIDE stores mass spectrometry (MS)-based
proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
• Any type of data can be stored.
PRIDE (PRoteomics IDEntifications) Archive
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
ProteomeXchange: A Global, distributed proteomics
database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
jPOST
(MS/MS data)
Mandatory raw data deposition
since July 2015
• Goal: Development of a framework to allow standard data submission and
dissemination pipelines between the main existing proteomics repositories.
http://www.proteomexchange.org
New in 2016
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Stats (1): Data submissions to PRIDE Archive continue to
increase
1,950 datasets submitted to PRIDE Archive in 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Countries with at least
100 submitted datasets :
1019 USA
734 Germany
492 United Kingdom
470 China
273 France
209 Netherlands
173 Canada
165 Switzerland
157 Australia
148 Austria
142 Denmark
137 Spain
115 Sweden
109 Japan
100 India
Stats (2): 5,198 ProteomeXchange datasets in PRIDE (by March
2017)
Type:
3835 ‘Partial’ submissions (73.8%)
1363 ‘Complete’ submissions (26.2%)
Released: 3462 datasets (66.6%)
Unpublished: 1736 datasets (33.4%)
Data volume in PRIDE:
Total: ~320 TB
Number of files: ~670,000
PXD000320-324: ~ 4 TB
PXD002319-26 ~2.4 TB
PXD001471 ~1.6 TB
Top Species represented (at least
100 datasets):
2267 Homo sapiens
765 Mus musculus
201 Saccharomyces cerevisiae
169 Arabidopsis thaliana
154 Rattus norvegicus
124Escherichia coli
~ 1000 species in total
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
PRIDE Inspector Toolsuite: QC before
submission
Wang et al., Nat. Biotechnology, 2012
Perez-Riverol et al., Bioinformatics,
2015
Perez-Riverol et al., MCP, 2016
• PRIDE Inspector - standalone tool to enable visualisation and validation of MS
data.
• Build on top of ms-data-core-api - open source algorithms and libraries for
computational proteomics.
• Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE
XML.
• Broad functionality.
https://github.com/PRIDE-Utilities/ms-data-core-api
https://github.com/PRIDE-Toolsuite/pride-inspector
Summary and QC charts Peptide spectra annotation and
visualization
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Overview
• PRIDE Archive and ProteomeXchange
• Reuse of public proteomics data
• PRIDE Cluster
• Open analysis pipelines
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Public proteomics datasets are being increasingly
reused…
Martens & Vizcaíno, Trends Bioch Sci, 2017
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Downloads are increasing
Data download volume for PRIDE Archive in 2016: 243 TB
0
50
100
150
200
250
300
2013 2014 2015 2016
Downloads in TBs
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
What do researchers think about the main uses
of public proteomics data?
• Verify published results
• Build Spectral libraries
• Find interesting datasets for reanalysis:
• Find new splice isoforms
• Discover new PTMs
Source: MassIVE workshop, ASMS 2017, Nuno Bandeira et al.
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Data sharing in Proteomics
Vaudel et al., Proteomics, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Data sharing in Proteomics
Vaudel et al., Proteomics, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Data repurposing
• Individual authors can reprocess raw data with new
hypotheses in mind (not taken into account by the
original authors).
• Proteogenomics studies.
• Discovery of new PTMs.
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Examples of repurposing datasets: proteogenomics
Data in public resources can be used for genome annotation purposes
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Public datasets from different omics: OmicsDI
http://www.omicsdi.org/
• Aims to integrate of ‘omics’ datasets (proteomics,
transcriptomics, metabolomics and genomics at present).
PRIDE
MassIVE
jPOST
PASSEL
GPMDB
ArrayExpress
Expression Atlas
MetaboLights
Metabolomics Workbench
GNPS
EGA
Perez-Riverol et al., Nat Biotechnol, 2017
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
OmicsDI: Portal for omics datasets
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Repurposing: new PTMs found
• Some examples (using phosphoproteomics data sets):
• O-GlcNAc-6-phosphate1
• Phosphoglyceryl2
• ADP-ribosylation3
1Hahne & Kuster, Mol Cell Proteomics (2012) 11 10 1063-9
2Moellering & Cravatt, Science (2013) 341 549-553
3Matic et al., Nat Methods (2012) 9 771-2
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Overview
• PRIDE Archive and ProteomeXchange
• Reuse of public proteomics data
• PRIDE Cluster
• Open analysis pipelines
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Introduction to Spectrum Clustering
spectra-cluster algorithm
Unidentified
spectrum
Spectrum identified
as peptide A
Spectrum identified
as peptide B
Consensus
spectra
(= data reduction)
Input Mass
Spectra
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
The spectra-cluster toolsuite
Clustering
• Command-line tool, graphical user interface and Hadoop
implementation of the spectra-cluster algorithm.
• Stand-alone tools optimised for small datasets
Develop-
ment
• Parser APIs for Java and Python
• spectra-cluster Java API to facilitate the development
of new clustering algorithms
Analysis
• Growing collection of simple-to-use tools for detailed analysis
• spectra-cluster-py Python framework available for the
development of own scripts
https://spectra-cluster.github.io
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
PRIDE Cluster - Concept
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
NMMAACDPR NMMAACDPR
Consensus spectrum
PPECPDFDPPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
Spectrum
clustering
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
PRIDE Cluster: Implementation
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2013
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
PRIDE Cluster: Second Implementation
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2013
• Clustered all public spectra in
PRIDE by summer 2015.
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
• Griss et al., Nat. Methods,
2016
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
One perfect cluster in PRIDE Cluster web
- 880 PSMs give the same peptide ID
- 4 species
- 28 datasets
- Same instruments
http://www.ebi.ac.uk/pride/cluster/
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
PRIDE Cluster - Concept
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
2. Inferring identifications for originally unidentified spectra
Not identified
PPECPDFDPPR
PPECPDFDPPR
PPECPDFDPPR
PPECPDFDPPR
PPECPDFDPPR
Not identified
Not identified
Consensus spectrum
PPECPDFDPPR
Not identified
Originally submitted spectra
Spectrum
clustering
Identifications can be
inferred
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
2. Inferring identifications for originally unidentified spectra
29
• 9.1 M unidentified spectra were contained in clusters with a reliable
identification.
• These are candidate new identifications (that need to be
confirmed), often missed due to search engine settings
• Example: 49,263 reliable clusters (containing 560,000 identified and
130,000 unidentified spectra) contained phosphorylated peptides,
many of them from non-enriched studies.
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
3. Consistently unidentified clusters
Not identified
Not identified
Not identified
Not identified
Consensus spectrum
Not identified
Not identified
Originally submitted spectra
Spectrum
clustering
Method to target
recurrent unidentified
spectra
??
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Consistently unidentified clusters (Recurring Unidentified Spectra)
• 19 M clusters contain only unidentified spectra.
• Most of them are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides -> Potential
Biomarkers??
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
3. Consistently unidentified clusters
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
PRIDE Cluster as a Public Data Mining Resource
33
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• Spectral archives (including the Recurring Unidentified Spectra)
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Main uses of public proteomics data
• Verify published results
• Build Spectral libraries
• Find interesting datasets for reanalysis:
• Find new splice isoforms
• Discover new PTMs
Source: MassIVE workshop, ASMS 2017, Nuno Bandeira et al.
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Overview
• PRIDE Archive and ProteomeXchange
• Reuse of public proteomics data
• PRIDE Cluster
• Open analysis pipelines
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Open analysis pipelines
• Goal: Development of open, reproducible and modular pipelines
based on Open MS for DDA (Data Dependent Acquisition)
approaches.
• Deployment in the EMBL-”Embassy Cloud”, with the goal that in the
future, they can be deployed in other cloud infrastructures, and
be reused by anyone in the community.
• Connected to PRIDE, bringing the tools closer to the data.
• We can use these pipelines to reanalyse PRIDE data.
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Open analysis pipelines
• Goal: Development of open, reproducible and modular pipelines
based on Open MS for DDA (Data Dependent Acquisition)
approaches.
• Deployment in the EMBL-”Embassy Cloud”, with the goal that in the
future, they can be deployed in other cloud infrastructures, and
be reused by anyone in the community.
• Connected to PRIDE, bringing the tools closer to the data.
• We can use these pipelines to reanalyse PRIDE data.
• Recent 3-year BBSRC grant awarded to do the same for DIA
approaches (to start on December 2017).
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
• Main challenges:
• Lack of suitable metadata in existing datasets (essential for clinical
studies).
• Lack of specialised resources for clinicians.
• Challenges related to patient-identifiable data.
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
PRIDE data
dissemination
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Summary
• Public proteomics datasets are on the rise! Reliable
infrastructure now exits.
• A lot of possibilities open for reuse of this data.
• New purposes: proteogenomics, new PTMs.
• It is possible to detect spectra that are consistently
unidentified across hundreds of datasets (maybe peptide
variants, or peptides with PTMs not initially considered).
• Starting to work in open and reproducible analysis pipelines.
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017
Aknowledgements: People
Attila Csordas
Tobias Ternent
Mathias Walzer
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Former team members, especially
Rui Wang, Florian Reisinger, Noemi
del Toro, Jose A. Dianes & Henning
Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!
@pride_ebi
@proteomexchange
Juan A. Vizcaíno
juan@ebi.ac.uk
EMBL-EBI-Industry Programme Workshop
Hinxton, 15 June 2017

More Related Content

Similar to PRIDE and ProteomeXchange: A golden age for working with public proteomics data

Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Juan Antonio Vizcaino
 
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...Juan Antonio Vizcaino
 
Introduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRIntroduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRJuan Antonio Vizcaino
 
Experiences to learn from the MS proteomics field
Experiences to learn from the MS proteomics fieldExperiences to learn from the MS proteomics field
Experiences to learn from the MS proteomics fieldJuan Antonio Vizcaino
 
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Juan Antonio Vizcaino
 
An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...Juan Antonio Vizcaino
 
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...Juan Antonio Vizcaino
 
How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?Juan Antonio Vizcaino
 
PRIDE and ProteomeXchange: Training webinar
PRIDE and ProteomeXchange: Training webinarPRIDE and ProteomeXchange: Training webinar
PRIDE and ProteomeXchange: Training webinarJuan Antonio Vizcaino
 

Similar to PRIDE and ProteomeXchange: A golden age for working with public proteomics data (20)

Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
 
ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016
 
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
 
Introduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRIntroduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIR
 
Experiences to learn from the MS proteomics field
Experiences to learn from the MS proteomics fieldExperiences to learn from the MS proteomics field
Experiences to learn from the MS proteomics field
 
ProteomeXchange update
ProteomeXchange updateProteomeXchange update
ProteomeXchange update
 
PRIDE and ProteomeXchange
PRIDE and ProteomeXchangePRIDE and ProteomeXchange
PRIDE and ProteomeXchange
 
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
 
An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...
 
Human microbiome project
Human microbiome projectHuman microbiome project
Human microbiome project
 
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
 
How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
The ELIXIR Proteomics Community
The ELIXIR Proteomics CommunityThe ELIXIR Proteomics Community
The ELIXIR Proteomics Community
 
PRIDE-ProteomeXchange
PRIDE-ProteomeXchangePRIDE-ProteomeXchange
PRIDE-ProteomeXchange
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
ProteomeXchange update
ProteomeXchange updateProteomeXchange update
ProteomeXchange update
 
Pride cluster presentation
Pride cluster presentation Pride cluster presentation
Pride cluster presentation
 
PRIDE and ProteomeXchange: Training webinar
PRIDE and ProteomeXchange: Training webinarPRIDE and ProteomeXchange: Training webinar
PRIDE and ProteomeXchange: Training webinar
 
Pride and ProteomeXchange
Pride and ProteomeXchangePride and ProteomeXchange
Pride and ProteomeXchange
 

More from Juan Antonio Vizcaino

Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Juan Antonio Vizcaino
 
Introduction to the PSI standard data formats
Introduction to the PSI standard data formatsIntroduction to the PSI standard data formats
Introduction to the PSI standard data formatsJuan Antonio Vizcaino
 
Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Juan Antonio Vizcaino
 
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...Juan Antonio Vizcaino
 
Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Juan Antonio Vizcaino
 
Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Juan Antonio Vizcaino
 
Enabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataEnabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataJuan Antonio Vizcaino
 
The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)Juan Antonio Vizcaino
 
Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016Juan Antonio Vizcaino
 

More from Juan Antonio Vizcaino (18)

Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...
 
Introduction to the PSI standard data formats
Introduction to the PSI standard data formatsIntroduction to the PSI standard data formats
Introduction to the PSI standard data formats
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
PRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchangePRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchange
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018
 
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
 
PSI-Proteome Informatics update
PSI-Proteome Informatics updatePSI-Proteome Informatics update
PSI-Proteome Informatics update
 
The ELIXIR Proteomics community
The ELIXIR Proteomics community The ELIXIR Proteomics community
The ELIXIR Proteomics community
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
 
Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017
 
Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?
 
ProteomeXchange update 2017
ProteomeXchange update 2017ProteomeXchange update 2017
ProteomeXchange update 2017
 
Enabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataEnabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics data
 
The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)
 
Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016
 
Reuse of public data in proteomics
Reuse of public data in proteomicsReuse of public data in proteomics
Reuse of public data in proteomics
 

Recently uploaded

Welcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayWelcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayZachary Labe
 
Cytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxCytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxVarshiniMK
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Temporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of MasticationTemporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of Masticationvidulajaib
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 

Recently uploaded (20)

Welcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayWelcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work Day
 
Cytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxCytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptx
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Temporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of MasticationTemporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of Mastication
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 

PRIDE and ProteomeXchange: A golden age for working with public proteomics data

  • 1. PRIDE and ProteomeXchange: A golden age for working with public proteomics data Dr. Juan Antonio Vizcaíno EMBL-European Bioinformatics Institute Hinxton, Cambridge, UK
  • 2. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Overview • PRIDE Archive and ProteomeXchange • Reuse of public proteomics data • PRIDE Cluster • Open analysis pipelines
  • 3. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Overview • PRIDE Archive and ProteomeXchange • Reuse of public proteomics data • PRIDE Cluster • Open analysis pipelines
  • 4. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 • PRIDE stores mass spectrometry (MS)-based proteomics data: • Peptide and protein expression data (identification and quantification) • Post-translational modifications • Mass spectra (raw data and peak lists) • Technical and biological metadata • Any other related information • Full support for tandem MS approaches • Any type of data can be stored. PRIDE (PRoteomics IDEntifications) Archive http://www.ebi.ac.uk/pride/archive Martens et al., Proteomics, 2005 Vizcaíno et al., NAR, 2016
  • 5. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 ProteomeXchange: A Global, distributed proteomics database PASSEL (SRM data) PRIDE (MS/MS data) MassIVE (MS/MS data) Raw ID/Q Meta jPOST (MS/MS data) Mandatory raw data deposition since July 2015 • Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories. http://www.proteomexchange.org New in 2016 Vizcaíno et al., Nat Biotechnol, 2014 Deutsch et al., NAR, 2017
  • 6. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Stats (1): Data submissions to PRIDE Archive continue to increase 1,950 datasets submitted to PRIDE Archive in 2016
  • 7. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Countries with at least 100 submitted datasets : 1019 USA 734 Germany 492 United Kingdom 470 China 273 France 209 Netherlands 173 Canada 165 Switzerland 157 Australia 148 Austria 142 Denmark 137 Spain 115 Sweden 109 Japan 100 India Stats (2): 5,198 ProteomeXchange datasets in PRIDE (by March 2017) Type: 3835 ‘Partial’ submissions (73.8%) 1363 ‘Complete’ submissions (26.2%) Released: 3462 datasets (66.6%) Unpublished: 1736 datasets (33.4%) Data volume in PRIDE: Total: ~320 TB Number of files: ~670,000 PXD000320-324: ~ 4 TB PXD002319-26 ~2.4 TB PXD001471 ~1.6 TB Top Species represented (at least 100 datasets): 2267 Homo sapiens 765 Mus musculus 201 Saccharomyces cerevisiae 169 Arabidopsis thaliana 154 Rattus norvegicus 124Escherichia coli ~ 1000 species in total
  • 8. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 PRIDE Inspector Toolsuite: QC before submission Wang et al., Nat. Biotechnology, 2012 Perez-Riverol et al., Bioinformatics, 2015 Perez-Riverol et al., MCP, 2016 • PRIDE Inspector - standalone tool to enable visualisation and validation of MS data. • Build on top of ms-data-core-api - open source algorithms and libraries for computational proteomics. • Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE XML. • Broad functionality. https://github.com/PRIDE-Utilities/ms-data-core-api https://github.com/PRIDE-Toolsuite/pride-inspector Summary and QC charts Peptide spectra annotation and visualization
  • 9. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Overview • PRIDE Archive and ProteomeXchange • Reuse of public proteomics data • PRIDE Cluster • Open analysis pipelines
  • 10. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Public proteomics datasets are being increasingly reused… Martens & Vizcaíno, Trends Bioch Sci, 2017
  • 11. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Downloads are increasing Data download volume for PRIDE Archive in 2016: 243 TB 0 50 100 150 200 250 300 2013 2014 2015 2016 Downloads in TBs
  • 12. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 What do researchers think about the main uses of public proteomics data? • Verify published results • Build Spectral libraries • Find interesting datasets for reanalysis: • Find new splice isoforms • Discover new PTMs Source: MassIVE workshop, ASMS 2017, Nuno Bandeira et al.
  • 13. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Data sharing in Proteomics Vaudel et al., Proteomics, 2016
  • 14. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Data sharing in Proteomics Vaudel et al., Proteomics, 2016
  • 15. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Data repurposing • Individual authors can reprocess raw data with new hypotheses in mind (not taken into account by the original authors). • Proteogenomics studies. • Discovery of new PTMs.
  • 16. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Examples of repurposing datasets: proteogenomics Data in public resources can be used for genome annotation purposes
  • 17. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Public datasets from different omics: OmicsDI http://www.omicsdi.org/ • Aims to integrate of ‘omics’ datasets (proteomics, transcriptomics, metabolomics and genomics at present). PRIDE MassIVE jPOST PASSEL GPMDB ArrayExpress Expression Atlas MetaboLights Metabolomics Workbench GNPS EGA Perez-Riverol et al., Nat Biotechnol, 2017
  • 18. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 OmicsDI: Portal for omics datasets
  • 19. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Repurposing: new PTMs found • Some examples (using phosphoproteomics data sets): • O-GlcNAc-6-phosphate1 • Phosphoglyceryl2 • ADP-ribosylation3 1Hahne & Kuster, Mol Cell Proteomics (2012) 11 10 1063-9 2Moellering & Cravatt, Science (2013) 341 549-553 3Matic et al., Nat Methods (2012) 9 771-2
  • 20. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Overview • PRIDE Archive and ProteomeXchange • Reuse of public proteomics data • PRIDE Cluster • Open analysis pipelines
  • 21. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Introduction to Spectrum Clustering spectra-cluster algorithm Unidentified spectrum Spectrum identified as peptide A Spectrum identified as peptide B Consensus spectra (= data reduction) Input Mass Spectra
  • 22. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 The spectra-cluster toolsuite Clustering • Command-line tool, graphical user interface and Hadoop implementation of the spectra-cluster algorithm. • Stand-alone tools optimised for small datasets Develop- ment • Parser APIs for Java and Python • spectra-cluster Java API to facilitate the development of new clustering algorithms Analysis • Growing collection of simple-to-use tools for detailed analysis • spectra-cluster-py Python framework available for the development of own scripts https://spectra-cluster.github.io
  • 23. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 PRIDE Cluster - Concept NMMAACDPR NMMAACDPR PPECPDFDPPR NMMAACDPR NMMAACDPR NMMAACDPR Consensus spectrum PPECPDFDPPR Threshold: At least 3 spectra in a cluster and ratio >70%. Originally submitted identified spectra Spectrum clustering
  • 24. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 PRIDE Cluster: Implementation • Clustered all public, identified spectra in PRIDE • EBI compute farm, LSF • 20.7 M identified spectra • 610 CPU days, two calendar weeks • Validation, calibration • Feedback into PRIDE datasets • EBI farm, LSF • Griss et al., Nat. Methods, 2013
  • 25. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 PRIDE Cluster: Second Implementation • Clustered all public, identified spectra in PRIDE • EBI compute farm, LSF • 20.7 M identified spectra • 610 CPU days, two calendar weeks • Validation, calibration • Feedback into PRIDE datasets • EBI farm, LSF • Griss et al., Nat. Methods, 2013 • Clustered all public spectra in PRIDE by summer 2015. • Apache Hadoop. • Starting with 256 M spectra. • 190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide). • 66 M identified spectra • Result: 28 M clusters • 5 calendar days on 30 node Hadoop cluster, 340 CPU cores • Griss et al., Nat. Methods, 2016
  • 26. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 One perfect cluster in PRIDE Cluster web - 880 PSMs give the same peptide ID - 4 species - 28 datasets - Same instruments http://www.ebi.ac.uk/pride/cluster/
  • 27. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 PRIDE Cluster - Concept
  • 28. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 2. Inferring identifications for originally unidentified spectra Not identified PPECPDFDPPR PPECPDFDPPR PPECPDFDPPR PPECPDFDPPR PPECPDFDPPR Not identified Not identified Consensus spectrum PPECPDFDPPR Not identified Originally submitted spectra Spectrum clustering Identifications can be inferred
  • 29. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 2. Inferring identifications for originally unidentified spectra 29 • 9.1 M unidentified spectra were contained in clusters with a reliable identification. • These are candidate new identifications (that need to be confirmed), often missed due to search engine settings • Example: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.
  • 30. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 3. Consistently unidentified clusters Not identified Not identified Not identified Not identified Consensus spectrum Not identified Not identified Originally submitted spectra Spectrum clustering Method to target recurrent unidentified spectra ??
  • 31. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Consistently unidentified clusters (Recurring Unidentified Spectra) • 19 M clusters contain only unidentified spectra. • Most of them are likely to be derived from peptides. • They could correspond to PTMs or variant peptides -> Potential Biomarkers?? • With various methods, we found likely identifications for about 20%. • Vast amount of data mining remains to be done.
  • 32. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 3. Consistently unidentified clusters
  • 33. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 PRIDE Cluster as a Public Data Mining Resource 33 • http://www.ebi.ac.uk/pride/cluster • Spectral libraries for 16 species. • Spectral archives (including the Recurring Unidentified Spectra) • All clustering results, as well as specific subsets of interest available. • Source code (open source) and Java API
  • 34. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Main uses of public proteomics data • Verify published results • Build Spectral libraries • Find interesting datasets for reanalysis: • Find new splice isoforms • Discover new PTMs Source: MassIVE workshop, ASMS 2017, Nuno Bandeira et al.
  • 35. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Overview • PRIDE Archive and ProteomeXchange • Reuse of public proteomics data • PRIDE Cluster • Open analysis pipelines
  • 36. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Open analysis pipelines • Goal: Development of open, reproducible and modular pipelines based on Open MS for DDA (Data Dependent Acquisition) approaches. • Deployment in the EMBL-”Embassy Cloud”, with the goal that in the future, they can be deployed in other cloud infrastructures, and be reused by anyone in the community. • Connected to PRIDE, bringing the tools closer to the data. • We can use these pipelines to reanalyse PRIDE data.
  • 37. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Open analysis pipelines • Goal: Development of open, reproducible and modular pipelines based on Open MS for DDA (Data Dependent Acquisition) approaches. • Deployment in the EMBL-”Embassy Cloud”, with the goal that in the future, they can be deployed in other cloud infrastructures, and be reused by anyone in the community. • Connected to PRIDE, bringing the tools closer to the data. • We can use these pipelines to reanalyse PRIDE data. • Recent 3-year BBSRC grant awarded to do the same for DIA approaches (to start on December 2017).
  • 38. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 • Main challenges: • Lack of suitable metadata in existing datasets (essential for clinical studies). • Lack of specialised resources for clinicians. • Challenges related to patient-identifiable data.
  • 39. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 PRIDE data dissemination
  • 40. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Summary • Public proteomics datasets are on the rise! Reliable infrastructure now exits. • A lot of possibilities open for reuse of this data. • New purposes: proteogenomics, new PTMs. • It is possible to detect spectra that are consistently unidentified across hundreds of datasets (maybe peptide variants, or peptides with PTMs not initially considered). • Starting to work in open and reproducible analysis pipelines.
  • 41. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017 Aknowledgements: People Attila Csordas Tobias Ternent Mathias Walzer Gerhard Mayer (de.NBI) Johannes Griss Yasset Perez-Riverol Manuel Bernal-Llinares Andrew Jarnuczak Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob Acknowledgements: The PRIDE Team All data submitters !!! @pride_ebi @proteomexchange
  • 42. Juan A. Vizcaíno juan@ebi.ac.uk EMBL-EBI-Industry Programme Workshop Hinxton, 15 June 2017