SlideShare a Scribd company logo
1 of 47
Is it feasible to identify novel biomarkers by
mining public proteomics data?
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK
E-mail: juan@ebi.ac.uk
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Data resources at EMBL-EBI
Genes, genomes & variation
ArrayExpress
Expression Atlas PRIDE
InterPro Pfam UniProt
ChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene & protein expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
Gene Ontology
Experimental Factor
Ontology
Literature & ontologies
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Corresponding public repositories
Genomics
Transcript-
omics
Proteomics
DNA sequence databases
(GenBank, EMBL, DDJB)
ArrayExpress (EBI), GEO (NCBI)
PRIDE (ProteomeXchange)
Metabolomics MetaboLights (MetabolomeXchange)
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• “Big data approach” -> PRIDE Cluster
• Open analysis pipelines
• Future perspectives
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• “Big data approach” -> PRIDE Cluster
• Open analysis pipelines
• Future perspectives
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
One slide intro to MS based proteomics
Hein et al., Handbook of Systems Biology, 2012
Proteins -> potential drug targets
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
• PRIDE stores mass spectrometry (MS)-based
proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
• Any type of data can be stored.
PRIDE (PRoteomics IDEntifications) Archive
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Stats (1): Data submissions to PRIDE Archive continue to
increase
1,950 datasets submitted to PRIDE Archive in 2016
… and still growing…
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Countries with at least
100 submitted datasets :
1019 USA
734 Germany
492 United Kingdom
470 China
273 France
209 Netherlands
173 Canada
165 Switzerland
157 Australia
148 Austria
142 Denmark
137 Spain
115 Sweden
109 Japan
100 India
Stats (2): 5,198 ProteomeXchange datasets in PRIDE (by March
2017)
Type:
3835 ‘Partial’ submissions (73.8%)
1363 ‘Complete’ submissions (26.2%)
Released: 3462 datasets (66.6%)
Unpublished: 1736 datasets (33.4%)
Data volume in PRIDE:
Total: ~385 TB
Number of files: ~670,000
PXD000320-324: ~ 4 TB
PXD002319-26 ~2.4 TB
PXD001471 ~1.6 TB
Top Species represented (at least
100 datasets):
2267 Homo sapiens
765 Mus musculus
201 Saccharomyces cerevisiae
169 Arabidopsis thaliana
154 Rattus norvegicus
124Escherichia coli
~ 1000 species in total
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Stats (3): Data growth in EBI resources
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Stats (4): Data growth in EBI resources
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• PRIDE Cluster
• Open analysis pipelines
• Future perspectives
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Reuse of public proteomics data is growing very fast
Martens & Vizcaíno, Trends Bioch Sci, 2017 Vaudel et al., Proteomics, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
MS proteomics: Discovery proteomics (DDA)
in vivo in silico
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
The “dark” proteome
Sequence-based
search engines
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
The “dark” proteome
• Only ~25-30% of spectra in a typical proteomics
experiments are identified.
• What does that fraction of unidentified spectra correspond to?
• For sure, there will be artefacts (e.g. chimeric spectra).
• Undetected protein variants:
• What it is not included in the searched database cannot be
found.
• Peptide containing unexpected Post-Translational
Modifications (PTMs).
• Big potential to find biological relevant “proteoforms”.
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Concept of “proteoform”
Could any of these “undetected” proteoforms be a biomarker?
Smith et al., Nat Methods, 2013
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Public data reanalysis -> Data repurposing
• Individual authors can reprocess raw data with new
hypotheses in mind (not taken into account by
the original authors).
• Proteogenomics studies.
• Discovery of new PTMs.
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Across-omics -> Proteogenomics approaches
• Proteomics data is combined with genomics and/or
transcriptomics information, typically by using sequence
databases generated from DNA sequencing efforts, RNA-Seq
experiments, Ribo-Seq approaches, and long-non-coding
RNAs.
• Proteogenomics can be used to improve genome
annotation and are increasingly utilized to understand the
information flow from genotype to phenotype in complex
diseases such as cancer and to support personalized
medicine studies.
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
MS proteomics: ProteoGenomics
in vivo in silico
Personal genomes
Personal
proteomes
Personalised medicine
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Examples of repurposing datasets: proteogenomics
Data in public resources can be used for genome annotation purposes ->
Discovery of short ORFs, translated lncRNAs, etc
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Public datasets from different omics: OmicsDI
http://www.omicsdi.org/
• Aims to integrate of ‘omics’ datasets (proteomics,
transcriptomics, metabolomics and genomics at present).
PRIDE
MassIVE
jPOST
PASSEL
GPMDB
ArrayExpress
Expression Atlas
MetaboLights
Metabolomics Workbench
GNPS
EGA
Perez-Riverol et al., Nat Biotechnol, 2017
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
OmicsDI: Portal for omics datasets
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Repurposing: new PTMs found
• Some examples (using phosphoproteomics data sets):
• O-GlcNAc-6-phosphate1
• Phosphoglyceryl2
• ADP-ribosylation3
1Hahne & Kuster, Mol Cell Proteomics (2012) 11 10 1063-9
2Moellering & Cravatt, Science (2013) 341 549-553
3Matic et al., Nat Methods (2012) 9 771-2
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Recent examples of meta-analysis studies
Lund-Johanssen et al., Nat Methods, 2016 Drew et al., Mol Systems Biol, 2017
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• “Big data approach” -> PRIDE Cluster
• Open analysis pipelines
• Future perspectives
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Introduction to Spectrum Clustering
spectra-cluster algorithm
Unidentified
spectrum
Spectrum identified
as peptide A
Spectrum identified
as peptide B
Consensus
spectra
(= data reduction)
Input Mass
Spectra
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
The spectra-cluster toolsuite
Clustering
• Command-line tool, graphical user interface and Hadoop
implementation of the spectra-cluster algorithm.
• Stand-alone tools optimised for small datasets
Develop-
ment
• Parser APIs for Java and Python
• spectra-cluster Java API to facilitate the development
of new clustering algorithms
Analysis
• Growing collection of simple-to-use tools for detailed analysis
• spectra-cluster-py Python framework available for the
development of own scripts
https://spectra-cluster.github.io
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
PRIDE Cluster - Concept
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
NMMAACDPR NMMAACDPR
Consensus spectrum
PPECPDFDPPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
Spectrum
clustering
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
PRIDE Cluster: Second Implementation
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2013
• Clustered all public spectra in
PRIDE by summer 2015.
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
• Griss et al., Nat. Methods,
2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
One perfect cluster in PRIDE Cluster web
- 880 PSMs give the same peptide ID
- 4 species
- 28 datasets
- Same instruments
http://www.ebi.ac.uk/pride/cluster/
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
3. Consistently unidentified clusters
Not identified
Not identified
Not identified
Not identified
Consensus spectrum
Not identified
Not identified
Originally submitted spectra
Spectrum
clustering
Method to target
recurrent unidentified
spectra
??
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Consistently unidentified clusters (Recurring Unidentified Spectra)
• 19 M clusters contain only unidentified spectra.
• Most of them are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides -> Potential
Biomarkers?
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
3. Consistently unidentified clusters
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
PRIDE Cluster as a Public Data Mining Resource
36
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• Spectral archives (including the Recurring Unidentified Spectra)
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Applications of spectrum clustering…
37
• Applicable to small groups of “similar” datasets (e.g. in clinical
setting):
• Can be used to target spectra that are “consistently” unidentified.
• Unidentified spectra could represent PTMs or sequence variants.
• Try “more-expensive” computational analysis methods (e.g.
spectral searches, de novo).
• Improve protein quantification.
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• “Big data approach” -> PRIDE Cluster
• Open analysis pipelines
• Future perspectives
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Open analysis pipelines
• Goal: Development of open, reproducible and modular pipelines
based on Open MS for DDA (Data Dependent Acquisition)
approaches.
• Deployment in the EMBL-”Embassy Cloud”, with the goal that in the
future, they can be deployed in other cloud infrastructures, and
be reused by anyone in the community (e.g. hospitals).
• Connected to PRIDE, bringing the tools closer to the data.
• We can use these pipelines to reanalyse PRIDE data.
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Open analysis pipelines
• Goal: Development of open, reproducible and modular pipelines
based on Open MS for DDA (Data Dependent Acquisition)
approaches.
• Deployment in the EMBL-”Embassy Cloud”, with the goal that in the
future, they can be deployed in other cloud infrastructures, and
be reused by anyone in the community (e.g. hospitals).
• Connected to PRIDE, bringing the tools closer to the data.
• We can use these pipelines to reanalyse PRIDE data.
• Recent 3-year BBSRC grant awarded to do the same for DIA
approaches (to start on December 2017).
• In collaboration with Manchester University/ Stoller Center.
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Overview
• Short introduction to proteomics and PRIDE
• Reuse of public proteomics data
• “Big data approach” -> PRIDE Cluster
• Open analysis pipelines
• Future perspectives
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
• Main challenges:
• Lack of suitable metadata in existing datasets (essential for clinical
studies).
• Lack of specialised resources for clinicians.
• Challenges related to patient-identifiable data.
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Is proteomics data patient identifiable?
A couple of papers on
this topic in 2016
Clear guidelines and policy
still need to be developed
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Summary
• Public proteomics datasets are on the rise! Reliable
infrastructure now exists.
• A lot of possibilities open for reuse of this data.
• New purposes: proteogenomics, new PTMs.
• It is possible to mine public data using spectrum clustering
looking for new proteoforms (new potential biomarkers?)
• Starting to work in open and reproducible analysis pipelines.
• Aim: In the future they are made available to everyone in the
community.
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
Aknowledgements: People
Attila Csordas
Tobias Ternent
Mathias Walzer
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Former team members, especially
Rui Wang, Florian Reisinger, Noemi
del Toro, Jose A. Dianes & Henning
Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!
@pride_ebi
@proteomexchange
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017
www.hupo2017.ie
Dublin 17-21st September
Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics in Laboratory Medicine
London, 30 June 2017

More Related Content

What's hot

dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020dkNET
 
Introduction to proteomics
Introduction to proteomicsIntroduction to proteomics
Introduction to proteomicsShryli Shreekar
 
Pathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & BlockchainPathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & BlockchainNatalio Krasnogor
 
Analysis with biological pathways:
Analysis with biological pathways: Analysis with biological pathways:
Analysis with biological pathways: Chris Evelo
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOAEBI
 
UniProt & Ontologies
UniProt & OntologiesUniProt & Ontologies
UniProt & OntologiesEric Jain
 
Searching for predictors of male fecundity
Searching for predictors of male fecunditySearching for predictors of male fecundity
Searching for predictors of male fecundityChirag Patel
 
Building a Model Organism Metabolome Database
Building a  Model Organism Metabolome DatabaseBuilding a  Model Organism Metabolome Database
Building a Model Organism Metabolome DatabaseChristoph Steinbeck
 
Multi-Omics Bioinformatics across Application Domains
Multi-Omics Bioinformatics across Application DomainsMulti-Omics Bioinformatics across Application Domains
Multi-Omics Bioinformatics across Application DomainsChristoph Steinbeck
 
MTH Khan CV-Detailed Oct10-2015
MTH Khan CV-Detailed Oct10-2015MTH Khan CV-Detailed Oct10-2015
MTH Khan CV-Detailed Oct10-2015Mahmud TH Khan
 

What's hot (13)

Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
 
Introduction to proteomics
Introduction to proteomicsIntroduction to proteomics
Introduction to proteomics
 
Pathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & BlockchainPathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & Blockchain
 
Analysis with biological pathways:
Analysis with biological pathways: Analysis with biological pathways:
Analysis with biological pathways:
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOA
 
PRIDE-ProteomeXchange
PRIDE-ProteomeXchangePRIDE-ProteomeXchange
PRIDE-ProteomeXchange
 
UniProt & Ontologies
UniProt & OntologiesUniProt & Ontologies
UniProt & Ontologies
 
Searching for predictors of male fecundity
Searching for predictors of male fecunditySearching for predictors of male fecundity
Searching for predictors of male fecundity
 
Building a Model Organism Metabolome Database
Building a  Model Organism Metabolome DatabaseBuilding a  Model Organism Metabolome Database
Building a Model Organism Metabolome Database
 
The Future of Computational Models for Predicting Human Toxicities
The Future of Computational Models for Predicting Human ToxicitiesThe Future of Computational Models for Predicting Human Toxicities
The Future of Computational Models for Predicting Human Toxicities
 
Multi-Omics Bioinformatics across Application Domains
Multi-Omics Bioinformatics across Application DomainsMulti-Omics Bioinformatics across Application Domains
Multi-Omics Bioinformatics across Application Domains
 
MTH Khan CV-Detailed Oct10-2015
MTH Khan CV-Detailed Oct10-2015MTH Khan CV-Detailed Oct10-2015
MTH Khan CV-Detailed Oct10-2015
 

Similar to Is it feasible to identify novel biomarkers by mining public proteomics data?

Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...Juan Antonio Vizcaino
 
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...Juan Antonio Vizcaino
 
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Juan Antonio Vizcaino
 
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...Juan Antonio Vizcaino
 
An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...Juan Antonio Vizcaino
 
Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsJuan Antonio Vizcaino
 
How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?Juan Antonio Vizcaino
 
Data submissions and archiving raw data in life sciences. A pilot with Proteo...
Data submissions and archiving raw data in life sciences. A pilot with Proteo...Data submissions and archiving raw data in life sciences. A pilot with Proteo...
Data submissions and archiving raw data in life sciences. A pilot with Proteo...Rafael C. Jimenez
 
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Juan Antonio Vizcaino
 

Similar to Is it feasible to identify novel biomarkers by mining public proteomics data? (20)

Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
Reuse of public data in proteomics
Reuse of public data in proteomicsReuse of public data in proteomics
Reuse of public data in proteomics
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...
 
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
 
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
 
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016
 
Pride cluster presentation
Pride cluster presentation Pride cluster presentation
Pride cluster presentation
 
OpenTox Europe 2013
OpenTox Europe 2013OpenTox Europe 2013
OpenTox Europe 2013
 
Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasets
 
Proteomexchange
ProteomexchangeProteomexchange
Proteomexchange
 
How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?
 
Data submissions and archiving raw data in life sciences. A pilot with Proteo...
Data submissions and archiving raw data in life sciences. A pilot with Proteo...Data submissions and archiving raw data in life sciences. A pilot with Proteo...
Data submissions and archiving raw data in life sciences. A pilot with Proteo...
 
Bms 2010
Bms 2010Bms 2010
Bms 2010
 
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
 
Pride and ProteomeXchange
Pride and ProteomeXchangePride and ProteomeXchange
Pride and ProteomeXchange
 

More from Juan Antonio Vizcaino

ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...Juan Antonio Vizcaino
 
The ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateThe ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateJuan Antonio Vizcaino
 
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics dataPRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics dataJuan Antonio Vizcaino
 
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...Juan Antonio Vizcaino
 
Enabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataEnabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataJuan Antonio Vizcaino
 
Introduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRIntroduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRJuan Antonio Vizcaino
 
The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)Juan Antonio Vizcaino
 

More from Juan Antonio Vizcaino (15)

PRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchangePRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchange
 
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
 
PSI-Proteome Informatics update
PSI-Proteome Informatics updatePSI-Proteome Informatics update
PSI-Proteome Informatics update
 
ProteomeXchange update
ProteomeXchange updateProteomeXchange update
ProteomeXchange update
 
The ELIXIR Proteomics community
The ELIXIR Proteomics community The ELIXIR Proteomics community
The ELIXIR Proteomics community
 
The ELIXIR Proteomics Community
The ELIXIR Proteomics CommunityThe ELIXIR Proteomics Community
The ELIXIR Proteomics Community
 
The ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateThe ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 update
 
PRIDE and ProteomeXchange
PRIDE and ProteomeXchangePRIDE and ProteomeXchange
PRIDE and ProteomeXchange
 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
 
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics dataPRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
 
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
 
ProteomeXchange update 2017
ProteomeXchange update 2017ProteomeXchange update 2017
ProteomeXchange update 2017
 
Enabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataEnabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics data
 
Introduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRIntroduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIR
 
The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)
 

Recently uploaded

A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.k64182334
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett SquareIsiahStephanRadaza
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
The Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravityThe Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravitySubhadipsau21168
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Recombination DNA Technology (Microinjection)
Recombination DNA Technology (Microinjection)Recombination DNA Technology (Microinjection)
Recombination DNA Technology (Microinjection)Jshifa
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 

Recently uploaded (20)

A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett Square
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
The Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravityThe Black hole shadow in Modified Gravity
The Black hole shadow in Modified Gravity
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Recombination DNA Technology (Microinjection)
Recombination DNA Technology (Microinjection)Recombination DNA Technology (Microinjection)
Recombination DNA Technology (Microinjection)
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 

Is it feasible to identify novel biomarkers by mining public proteomics data?

  • 1. Is it feasible to identify novel biomarkers by mining public proteomics data? Dr. Juan Antonio Vizcaíno Proteomics Team Leader EMBL-European Bioinformatics Institute Hinxton, Cambridge, UK E-mail: juan@ebi.ac.uk
  • 2. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017
  • 3. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Data resources at EMBL-EBI Genes, genomes & variation ArrayExpress Expression Atlas PRIDE InterPro Pfam UniProt ChEMBL ChEBI Molecular structures Protein Data Bank in Europe Electron Microscopy Data Bank European Nucleotide Archive European Variation Archive European Genome-phenome Archive Gene & protein expression Protein sequences, families & motifs Chemical biology Reactions, interactions & pathways IntAct Reactome MetaboLights Systems BioModels Enzyme Portal BioSamples Ensembl Ensembl Genomes GWAS Catalog Metagenomics portal Europe PubMed Central Gene Ontology Experimental Factor Ontology Literature & ontologies
  • 4. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Corresponding public repositories Genomics Transcript- omics Proteomics DNA sequence databases (GenBank, EMBL, DDJB) ArrayExpress (EBI), GEO (NCBI) PRIDE (ProteomeXchange) Metabolomics MetaboLights (MetabolomeXchange)
  • 5. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Overview • Short introduction to proteomics and PRIDE • Reuse of public proteomics data • “Big data approach” -> PRIDE Cluster • Open analysis pipelines • Future perspectives
  • 6. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Overview • Short introduction to proteomics and PRIDE • Reuse of public proteomics data • “Big data approach” -> PRIDE Cluster • Open analysis pipelines • Future perspectives
  • 7. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 One slide intro to MS based proteomics Hein et al., Handbook of Systems Biology, 2012 Proteins -> potential drug targets
  • 8. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 • PRIDE stores mass spectrometry (MS)-based proteomics data: • Peptide and protein expression data (identification and quantification) • Post-translational modifications • Mass spectra (raw data and peak lists) • Technical and biological metadata • Any other related information • Full support for tandem MS approaches • Any type of data can be stored. PRIDE (PRoteomics IDEntifications) Archive http://www.ebi.ac.uk/pride/archive Martens et al., Proteomics, 2005 Vizcaíno et al., NAR, 2016
  • 9. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Stats (1): Data submissions to PRIDE Archive continue to increase 1,950 datasets submitted to PRIDE Archive in 2016 … and still growing…
  • 10. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Countries with at least 100 submitted datasets : 1019 USA 734 Germany 492 United Kingdom 470 China 273 France 209 Netherlands 173 Canada 165 Switzerland 157 Australia 148 Austria 142 Denmark 137 Spain 115 Sweden 109 Japan 100 India Stats (2): 5,198 ProteomeXchange datasets in PRIDE (by March 2017) Type: 3835 ‘Partial’ submissions (73.8%) 1363 ‘Complete’ submissions (26.2%) Released: 3462 datasets (66.6%) Unpublished: 1736 datasets (33.4%) Data volume in PRIDE: Total: ~385 TB Number of files: ~670,000 PXD000320-324: ~ 4 TB PXD002319-26 ~2.4 TB PXD001471 ~1.6 TB Top Species represented (at least 100 datasets): 2267 Homo sapiens 765 Mus musculus 201 Saccharomyces cerevisiae 169 Arabidopsis thaliana 154 Rattus norvegicus 124Escherichia coli ~ 1000 species in total
  • 11. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Stats (3): Data growth in EBI resources
  • 12. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Stats (4): Data growth in EBI resources
  • 13. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Overview • Short introduction to proteomics and PRIDE • Reuse of public proteomics data • PRIDE Cluster • Open analysis pipelines • Future perspectives
  • 14. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Reuse of public proteomics data is growing very fast Martens & Vizcaíno, Trends Bioch Sci, 2017 Vaudel et al., Proteomics, 2016
  • 15. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 MS proteomics: Discovery proteomics (DDA) in vivo in silico
  • 16. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 The “dark” proteome Sequence-based search engines
  • 17. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 The “dark” proteome • Only ~25-30% of spectra in a typical proteomics experiments are identified. • What does that fraction of unidentified spectra correspond to? • For sure, there will be artefacts (e.g. chimeric spectra). • Undetected protein variants: • What it is not included in the searched database cannot be found. • Peptide containing unexpected Post-Translational Modifications (PTMs). • Big potential to find biological relevant “proteoforms”.
  • 18. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Concept of “proteoform” Could any of these “undetected” proteoforms be a biomarker? Smith et al., Nat Methods, 2013
  • 19. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Public data reanalysis -> Data repurposing • Individual authors can reprocess raw data with new hypotheses in mind (not taken into account by the original authors). • Proteogenomics studies. • Discovery of new PTMs.
  • 20. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Across-omics -> Proteogenomics approaches • Proteomics data is combined with genomics and/or transcriptomics information, typically by using sequence databases generated from DNA sequencing efforts, RNA-Seq experiments, Ribo-Seq approaches, and long-non-coding RNAs. • Proteogenomics can be used to improve genome annotation and are increasingly utilized to understand the information flow from genotype to phenotype in complex diseases such as cancer and to support personalized medicine studies.
  • 21. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 MS proteomics: ProteoGenomics in vivo in silico Personal genomes Personal proteomes Personalised medicine
  • 22. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Examples of repurposing datasets: proteogenomics Data in public resources can be used for genome annotation purposes -> Discovery of short ORFs, translated lncRNAs, etc
  • 23. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Public datasets from different omics: OmicsDI http://www.omicsdi.org/ • Aims to integrate of ‘omics’ datasets (proteomics, transcriptomics, metabolomics and genomics at present). PRIDE MassIVE jPOST PASSEL GPMDB ArrayExpress Expression Atlas MetaboLights Metabolomics Workbench GNPS EGA Perez-Riverol et al., Nat Biotechnol, 2017
  • 24. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 OmicsDI: Portal for omics datasets
  • 25. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Repurposing: new PTMs found • Some examples (using phosphoproteomics data sets): • O-GlcNAc-6-phosphate1 • Phosphoglyceryl2 • ADP-ribosylation3 1Hahne & Kuster, Mol Cell Proteomics (2012) 11 10 1063-9 2Moellering & Cravatt, Science (2013) 341 549-553 3Matic et al., Nat Methods (2012) 9 771-2
  • 26. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Recent examples of meta-analysis studies Lund-Johanssen et al., Nat Methods, 2016 Drew et al., Mol Systems Biol, 2017
  • 27. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Overview • Short introduction to proteomics and PRIDE • Reuse of public proteomics data • “Big data approach” -> PRIDE Cluster • Open analysis pipelines • Future perspectives
  • 28. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Introduction to Spectrum Clustering spectra-cluster algorithm Unidentified spectrum Spectrum identified as peptide A Spectrum identified as peptide B Consensus spectra (= data reduction) Input Mass Spectra
  • 29. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 The spectra-cluster toolsuite Clustering • Command-line tool, graphical user interface and Hadoop implementation of the spectra-cluster algorithm. • Stand-alone tools optimised for small datasets Develop- ment • Parser APIs for Java and Python • spectra-cluster Java API to facilitate the development of new clustering algorithms Analysis • Growing collection of simple-to-use tools for detailed analysis • spectra-cluster-py Python framework available for the development of own scripts https://spectra-cluster.github.io
  • 30. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 PRIDE Cluster - Concept NMMAACDPR NMMAACDPR PPECPDFDPPR NMMAACDPR NMMAACDPR NMMAACDPR Consensus spectrum PPECPDFDPPR Threshold: At least 3 spectra in a cluster and ratio >70%. Originally submitted identified spectra Spectrum clustering
  • 31. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 PRIDE Cluster: Second Implementation • Clustered all public, identified spectra in PRIDE • EBI compute farm, LSF • 20.7 M identified spectra • 610 CPU days, two calendar weeks • Validation, calibration • Feedback into PRIDE datasets • EBI farm, LSF • Griss et al., Nat. Methods, 2013 • Clustered all public spectra in PRIDE by summer 2015. • Apache Hadoop. • Starting with 256 M spectra. • 190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide). • 66 M identified spectra • Result: 28 M clusters • 5 calendar days on 30 node Hadoop cluster, 340 CPU cores • Griss et al., Nat. Methods, 2016
  • 32. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 One perfect cluster in PRIDE Cluster web - 880 PSMs give the same peptide ID - 4 species - 28 datasets - Same instruments http://www.ebi.ac.uk/pride/cluster/
  • 33. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 3. Consistently unidentified clusters Not identified Not identified Not identified Not identified Consensus spectrum Not identified Not identified Originally submitted spectra Spectrum clustering Method to target recurrent unidentified spectra ??
  • 34. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Consistently unidentified clusters (Recurring Unidentified Spectra) • 19 M clusters contain only unidentified spectra. • Most of them are likely to be derived from peptides. • They could correspond to PTMs or variant peptides -> Potential Biomarkers? • With various methods, we found likely identifications for about 20%. • Vast amount of data mining remains to be done.
  • 35. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 3. Consistently unidentified clusters
  • 36. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 PRIDE Cluster as a Public Data Mining Resource 36 • http://www.ebi.ac.uk/pride/cluster • Spectral libraries for 16 species. • Spectral archives (including the Recurring Unidentified Spectra) • All clustering results, as well as specific subsets of interest available. • Source code (open source) and Java API
  • 37. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Applications of spectrum clustering… 37 • Applicable to small groups of “similar” datasets (e.g. in clinical setting): • Can be used to target spectra that are “consistently” unidentified. • Unidentified spectra could represent PTMs or sequence variants. • Try “more-expensive” computational analysis methods (e.g. spectral searches, de novo). • Improve protein quantification.
  • 38. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Overview • Short introduction to proteomics and PRIDE • Reuse of public proteomics data • “Big data approach” -> PRIDE Cluster • Open analysis pipelines • Future perspectives
  • 39. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Open analysis pipelines • Goal: Development of open, reproducible and modular pipelines based on Open MS for DDA (Data Dependent Acquisition) approaches. • Deployment in the EMBL-”Embassy Cloud”, with the goal that in the future, they can be deployed in other cloud infrastructures, and be reused by anyone in the community (e.g. hospitals). • Connected to PRIDE, bringing the tools closer to the data. • We can use these pipelines to reanalyse PRIDE data.
  • 40. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Open analysis pipelines • Goal: Development of open, reproducible and modular pipelines based on Open MS for DDA (Data Dependent Acquisition) approaches. • Deployment in the EMBL-”Embassy Cloud”, with the goal that in the future, they can be deployed in other cloud infrastructures, and be reused by anyone in the community (e.g. hospitals). • Connected to PRIDE, bringing the tools closer to the data. • We can use these pipelines to reanalyse PRIDE data. • Recent 3-year BBSRC grant awarded to do the same for DIA approaches (to start on December 2017). • In collaboration with Manchester University/ Stoller Center.
  • 41. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Overview • Short introduction to proteomics and PRIDE • Reuse of public proteomics data • “Big data approach” -> PRIDE Cluster • Open analysis pipelines • Future perspectives
  • 42. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 • Main challenges: • Lack of suitable metadata in existing datasets (essential for clinical studies). • Lack of specialised resources for clinicians. • Challenges related to patient-identifiable data.
  • 43. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Is proteomics data patient identifiable? A couple of papers on this topic in 2016 Clear guidelines and policy still need to be developed
  • 44. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Summary • Public proteomics datasets are on the rise! Reliable infrastructure now exists. • A lot of possibilities open for reuse of this data. • New purposes: proteogenomics, new PTMs. • It is possible to mine public data using spectrum clustering looking for new proteoforms (new potential biomarkers?) • Starting to work in open and reproducible analysis pipelines. • Aim: In the future they are made available to everyone in the community.
  • 45. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 Aknowledgements: People Attila Csordas Tobias Ternent Mathias Walzer Gerhard Mayer (de.NBI) Johannes Griss Yasset Perez-Riverol Manuel Bernal-Llinares Andrew Jarnuczak Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob Acknowledgements: The PRIDE Team All data submitters !!! @pride_ebi @proteomexchange
  • 46. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017 www.hupo2017.ie Dublin 17-21st September
  • 47. Juan A. Vizcaíno juan@ebi.ac.uk Bioinformatics in Laboratory Medicine London, 30 June 2017