Reusing and integrating public proteomics data to improve our knowledge of the human proteome

Reusing and integrating public proteomics
data to improve our knowledge of the human
proteome
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-European Bioinformatics Institute (EMBL-EBI)
Hinxton, Cambridge, UK

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Overview
• Short introduction to PRIDE and ProteomeXchange
• Reuse of public proteomics data
• Integration of proteomics and genomics data
• Open analysis pipelines for proteomics data

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
•PRIDE stores mass spectrometry (MS)-
based proteomics data:
•Peptide and protein expression data
(identification and quantification)
•Post-translational modifications
•Mass spectra (raw data and peak lists)
•Technical and biological metadata
•Any other related information
•Full support for tandem MS approaches
•Any type of data can be stored
•From July 2017, an ELIXIR core resource
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
ProteomeXchange: A Global, distributed proteomics database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
jPOST
(MS/MS data)
Mandatory data deposition
http://www.proteomexchange.org
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017
iProX
(MS/MS data)
• Framework to allow standard data submission and dissemination
pipelines between the main existing proteomics repositories.

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
PRIDE data submissions and data growth
May 2018 (320 datasets) was again a
record month in terms of datasets
submitted
Datasets submitted per month
> 2,400 datasets submitted in 2017
Datasets submitted per year
PRIDE contains >85% of all ProteomeXchange datasets
Dataset PXD010000 reached on June 1st

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
HPP statistics (HPP tags)
0
20
40
60
80
100
120
140
160
180
Hum
an
Proteom
e
Project
ease-Driven
Hum
an
Proteom
e
Project(B/D-…
Chrom
osom
e-centric
Hum
an
Proteom
e
Project(C-H
PP)
Cancer(B/D-H
PP)
Hum
an
Im
m
uno-Peptidom
e
Project(HU
PO
-H
IPP)
Liver(B/D-HPP)
Protein
M
isfolding
and
Aggregation
(B/D-HPP)
Extrem
e
Conditions
(B/D-H
PP)
Hum
an
Brain
Proteom
e
Project(H
UPO_HBPP)(B/D-HPP)
Diabetes
(B/D-HPP)
Food
and
Nutrition
(B/D-H
PP)
G
lycoproteom
ics
(B/D-HPP)
KidneyU
rine
(B/D-H
PP)
Cardiovascular(B/D-HPP)
EpigeneticC
hrom
atin
(B/D-HPP)
EyeO
M
E
(B/D-HPP)
China
Hum
an
Proteom
e
Project(CN
HPP)

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
HPP statistics (Countries)
0
10
20
30
40
50
60
China
USA
SpainCanada
FranceG
erm
any
Republic
ofKorea
SwitzerlandAustralia
Italy
Japan
Netherlands
Brazil
IndiaNorway
South
KoreaThailand
AustriaBelgiumDenm
arkFinland
IsraelPakistan
RussiaSwedenTaiwan

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Data re-use in proteomics keeps increasing
Data download volume for PRIDE in 2017: 295 TB
0
50
100
150
200
250
300
350
2013 2014 2015 2016 2017
Downloads in TBs

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Data sharing in Proteomics
Vaudel et al., Proteomics, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Public data re-analysis -> Data repurposing
• Individual authors (third parties) can re-analyze MS
proteomics raw data with new hypotheses in mind
(not taken into account by the original authors).
• Proteogenomics studies
• Meta-analysis studies

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Examples of repurposing datasets: proteogenomics
Data in public resources can be used for genome annotation purposes ->
Discovery of short ORFs, translated lncRNAs, etc

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Examples of repurposing datasets: proteogenomics
Also some studies have been performed in model organisms: mouse, rat,
Drosophila, and other microorganisms (Mycobacterium tuberculosis,
Helicobacter pylori, rice,…)

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Public datasets from different omics: OmicsDI
http://www.omicsdi.org/
• Aims to integrate of ‘omics’ datasets (proteomics,
transcriptomics, metabolomics and genomics at present).
PRIDE
MassIVE
jPOST
PASSEL
GPMDB
ArrayExpress
Expression Atlas
MetaboLights
Metabolomics Workbench
GNPS
EGA
…and others
Perez-Riverol et al., Nat Biotechnol, 2017

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
OmicsDI: Portal for omics datasets

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Public data re-analysis -> Data repurposing
• Individual authors (third parties) can re-analyze MS
proteomics raw data with new hypotheses in mind
(not taken into account by the original authors).
• Proteogenomics studies
• Meta-analysis studies -> Analysing together a
large number of datasets

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Reuse of public proteomics data is on the rise!
Martens & Vizcaíno, Trends Bioch Sci, 2017 Vaudel et al., Proteomics, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
My talk on Monday afternoon: “The functional landscape of
human phosphorylation”
1. Consistent re-analysis of PRIDE public datasets
2. Constructing a functional score for those phospho-sites (ML)
3. Validation of the score (in silico and in vivo)
Collaboration with Pedro Beltrao’s group
Ø Largest to date MS-based phospho-proteomics atlas
Ø Fully annotated at dataset level
Ø 101 cell lines/tissues (120 PXD datasets)
Ø 6,801 raw files (~5.2 TB)
Ø Running time ~ 2 months
Ø ~120k highly confident phospho-peptide identifications
(<0.01 FDR, Ascore & ∆score filtered)

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
ProteoGenomics data integration in PRIDE
PX Submission
Tool PRIDE
1 2
PRIDE
submission
Pipelines
PRIDE web
and API
3
TrackHub
Registry
4
Automatically connecting proteomics data from original data
submissions to PRIDE to genome browsers (Ensembl, UCSC
browser)
Data in HUPO-
PSI standard
formats:
mzIdentML,
mzTab

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Proteogenomics related formats

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Mapping peptides to the genome: PoGo
Schlaffner CN., Pirklbauer G, Bender A , Choudhary JS, PoGo: Cell Systems, 5(2):152-156.e4)
For each .pogo file:
• PTMs are standard to a common
representation using PRIDE-Mod library.
• Each Peptide reference to an Assay URL in
PRIDE.
• Each Pogo file is generated automatically
by the PRIDE Pipeline.
chr1 1314335 1314365 VLIPVFALGR 1000 - 1314335 1314335 0,0,0 1 30 0
chr1 1454464 1454488 ITVLEALR 1000 + 1454464 1454464 128,128,128 1 24 0
chr1 1456317 1456344 LFDWANTSR 1000 + 1456317 1456317 128,128,128 1 27 0
chr1 1459184 1459211 ATLNAFLYR 1000 + 1459184 1459184 128,128,128 1 27 0
chr1 1462609 1462633 LAQFDYGR 1000 + 1462609 1462609 128,128,128 1 24 0
chr1 1485135 1485159 ITVLEALR 1000 + 1485135 1485135 128,128,128 1 24 0
chr1 1486636 1486663 LFDWANTSR 1000 + 1486636 1486636 128,128,128 1 27 0
chr1 1490572 1490596 LAQFDYGR 1000 + 1490572 1490572 128,128,128 1 24 0
chr1 1522863 1522887 ITVLEALR 1000 + 1522863 1522863 128,128,128 1 24 0
Challenge in the Future:
• Bed information can be extended with more
information about the transcript reliability.
• Peptide uniqueness
• Reliability score.
• Native bigBed should be provided to
remove the customization of new pipelines,
etc.
• What to do with the unmapped peptides
(which are long lists.)
• Maintainability.

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
TrackHub creation and Publication
https://www.trackhubregistry.org/

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
TrackHub creation and Publication

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
UCSC Viewer
http://genome.ucsc.edu/cgi-
bin/hgTracks?db=mm10&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=
0&nonVirtPosition=&position=chr12%3A20021505-
100107519&hgsid=644445905_BScrPMQymPnGtl9O7jeSv1jS5Rm0

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Visualization in IGV

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Reproducible Science
http://www.nature.com/nature/focus/reproducibility/

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
How to make data analysis pipelines reproducible
• That means using:
• Exactly the same software (including the same version) in the
same order.
• The same protein sequence database (including the same
version).
• If we use the same files as input to the software, we will get
EXACTLY the same results.
• If that’s not the case, something has gone wrong.
• Computers are much more reliable than people.

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Develop exemplary proteomics data analysis workflows and deploy
them in the EMBL-EBI "Embassy Cloud”:
(1) Standard identification workflow
(2) Identification workflow for PTMs
(3) Quantification (label-free/label-based approaches)
(4) Quality Control (to aid data set interpretation/reanalysis evaluation)
(5) Versions of quantification approaches (including PTMs)
è Connected to public proteomics data from
Developing pipelines in the cloud -> DDA data

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Cloud based infrastructure

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Open analysis pipelines: DIA and proteogenomics
• Pipelines for DIA approaches.
• In collaboration with the Stoller Center (Manchester) (co-PIs Graham,
Hubbard & Townsend)
• Pipelines for proteogenomics approaches (project just started).
• In collaboration with J. Choudhary (Institute of Cancer Research, London)
• Additional DDA pipelines (ELIXIR Proteomics Community).

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Vision: total transparency and reproducibility
Analysis
Pipelines
Input data Data analysis Results

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Summary
• Public proteomics datasets are on the rise! Reliable (widely used)
infrastructure now exists: PRIDE and ProteomeXchange.
• A lot of possibilities open for reuse of this data.
• New purposes: proteogenomics, novel PTMs,...
• New infrastructure to integrate proteomics and genomics data
• Developing open and reproducible analysis pipelines.
• Supporting reproducible science
• Aim: In the future they are made available to everyone in the
community

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Aknowledgements: People
Yasset Perez-Riverol
Johannes Griss
Suresh Hewapathirana
Tobias Ternent
Jingwen Bai
Attila Csordas
Deepti Jaiswal
Andrew Jarnuczak
Mathias Walzer
Gerhard Mayer (de.NBI)
Former team members, especially
Manuel Bernal-Linares & Henning
Hermjakob
Acknowledgements
All data submitters !!!
@pride_ebi
@proteomexchange

Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium

Reusing and integrating public proteomics data to improve our knowledge of the human proteome

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Reusing and integrating public proteomics data to improve our knowledge of the human proteome

Similar to Reusing and integrating public proteomics data to improve our knowledge of the human proteome (20)

More from Juan Antonio Vizcaino

More from Juan Antonio Vizcaino (11)

Recently uploaded

Recently uploaded (20)

Reusing and integrating public proteomics data to improve our knowledge of the human proteome