Neurodevelopmental disorders according to the dsm 5 tr
Reusing and integrating public proteomics data to improve our knowledge of the human proteome
1. Reusing and integrating public proteomics
data to improve our knowledge of the human
proteome
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-European Bioinformatics Institute (EMBL-EBI)
Hinxton, Cambridge, UK
2. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Overview
• Short introduction to PRIDE and ProteomeXchange
• Reuse of public proteomics data
• Integration of proteomics and genomics data
• Open analysis pipelines for proteomics data
3. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
•PRIDE stores mass spectrometry (MS)-
based proteomics data:
•Peptide and protein expression data
(identification and quantification)
•Post-translational modifications
•Mass spectra (raw data and peak lists)
•Technical and biological metadata
•Any other related information
•Full support for tandem MS approaches
•Any type of data can be stored
•From July 2017, an ELIXIR core resource
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
4. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
ProteomeXchange: A Global, distributed proteomics database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
jPOST
(MS/MS data)
Mandatory data deposition
http://www.proteomexchange.org
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017
iProX
(MS/MS data)
• Framework to allow standard data submission and dissemination
pipelines between the main existing proteomics repositories.
5. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
PRIDE data submissions and data growth
May 2018 (320 datasets) was again a
record month in terms of datasets
submitted
Datasets submitted per month
> 2,400 datasets submitted in 2017
Datasets submitted per year
PRIDE contains >85% of all ProteomeXchange datasets
Dataset PXD010000 reached on June 1st
6. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
HPP statistics (HPP tags)
0
20
40
60
80
100
120
140
160
180
Hum
an
Proteom
e
Project
ease-Driven
Hum
an
Proteom
e
Project(B/D-…
Chrom
osom
e-centric
Hum
an
Proteom
e
Project(C-H
PP)
Cancer(B/D-H
PP)
Hum
an
Im
m
uno-Peptidom
e
Project(HU
PO
-H
IPP)
Liver(B/D-HPP)
Protein
M
isfolding
and
Aggregation
(B/D-HPP)
Extrem
e
Conditions
(B/D-H
PP)
Hum
an
Brain
Proteom
e
Project(H
UPO_HBPP)(B/D-HPP)
Diabetes
(B/D-HPP)
Food
and
Nutrition
(B/D-H
PP)
G
lycoproteom
ics
(B/D-HPP)
KidneyU
rine
(B/D-H
PP)
Cardiovascular(B/D-HPP)
EpigeneticC
hrom
atin
(B/D-HPP)
EyeO
M
E
(B/D-HPP)
China
Hum
an
Proteom
e
Project(CN
HPP)
7. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
HPP statistics (Countries)
0
10
20
30
40
50
60
China
USA
SpainCanada
FranceG
erm
any
Republic
ofKorea
SwitzerlandAustralia
Italy
Japan
Netherlands
Brazil
IndiaNorway
South
KoreaThailand
AustriaBelgiumDenm
arkFinland
IsraelPakistan
RussiaSwedenTaiwan
8. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Overview
• Short introduction to PRIDE and ProteomeXchange
• Reuse of public proteomics data
• Integration of proteomics and genomics data
• Open analysis pipelines for proteomics data
9. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Data re-use in proteomics keeps increasing
Data download volume for PRIDE in 2017: 295 TB
0
50
100
150
200
250
300
350
2013 2014 2015 2016 2017
Downloads in TBs
11. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Public data re-analysis -> Data repurposing
• Individual authors (third parties) can re-analyze MS
proteomics raw data with new hypotheses in mind
(not taken into account by the original authors).
• Proteogenomics studies
• Meta-analysis studies
12. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Public data re-analysis -> Data repurposing
• Individual authors (third parties) can re-analyze MS
proteomics raw data with new hypotheses in mind
(not taken into account by the original authors).
• Proteogenomics studies
• Meta-analysis studies
13. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Examples of repurposing datasets: proteogenomics
Data in public resources can be used for genome annotation purposes ->
Discovery of short ORFs, translated lncRNAs, etc
14. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Examples of repurposing datasets: proteogenomics
Also some studies have been performed in model organisms: mouse, rat,
Drosophila, and other microorganisms (Mycobacterium tuberculosis,
Helicobacter pylori, rice,…)
15. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Public datasets from different omics: OmicsDI
http://www.omicsdi.org/
• Aims to integrate of ‘omics’ datasets (proteomics,
transcriptomics, metabolomics and genomics at present).
PRIDE
MassIVE
jPOST
PASSEL
GPMDB
ArrayExpress
Expression Atlas
MetaboLights
Metabolomics Workbench
GNPS
EGA
…and others
Perez-Riverol et al., Nat Biotechnol, 2017
17. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Public data re-analysis -> Data repurposing
• Individual authors (third parties) can re-analyze MS
proteomics raw data with new hypotheses in mind
(not taken into account by the original authors).
• Proteogenomics studies
• Meta-analysis studies -> Analysing together a
large number of datasets
18. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Reuse of public proteomics data is on the rise!
Martens & Vizcaíno, Trends Bioch Sci, 2017 Vaudel et al., Proteomics, 2016
19. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
My talk on Monday afternoon: “The functional landscape of
human phosphorylation”
1. Consistent re-analysis of PRIDE public datasets
2. Constructing a functional score for those phospho-sites (ML)
3. Validation of the score (in silico and in vivo)
Collaboration with Pedro Beltrao’s group
Ø Largest to date MS-based phospho-proteomics atlas
Ø Fully annotated at dataset level
Ø 101 cell lines/tissues (120 PXD datasets)
Ø 6,801 raw files (~5.2 TB)
Ø Running time ~ 2 months
Ø ~120k highly confident phospho-peptide identifications
(<0.01 FDR, Ascore & ∆score filtered)
20. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Overview
• Short introduction to PRIDE and ProteomeXchange
• Reuse of public proteomics data
• Integration of proteomics and genomics data
• Open analysis pipelines for proteomics data
21. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
ProteoGenomics data integration in PRIDE
PX Submission
Tool PRIDE
1 2
PRIDE
submission
Pipelines
PRIDE web
and API
3
TrackHub
Registry
4
Automatically connecting proteomics data from original data
submissions to PRIDE to genome browsers (Ensembl, UCSC
browser)
Data in HUPO-
PSI standard
formats:
mzIdentML,
mzTab
23. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Mapping peptides to the genome: PoGo
Schlaffner CN., Pirklbauer G, Bender A , Choudhary JS, PoGo: Cell Systems, 5(2):152-156.e4)
For each .pogo file:
• PTMs are standard to a common
representation using PRIDE-Mod library.
• Each Peptide reference to an Assay URL in
PRIDE.
• Each Pogo file is generated automatically
by the PRIDE Pipeline.
chr1 1314335 1314365 VLIPVFALGR 1000 - 1314335 1314335 0,0,0 1 30 0
chr1 1454464 1454488 ITVLEALR 1000 + 1454464 1454464 128,128,128 1 24 0
chr1 1456317 1456344 LFDWANTSR 1000 + 1456317 1456317 128,128,128 1 27 0
chr1 1459184 1459211 ATLNAFLYR 1000 + 1459184 1459184 128,128,128 1 27 0
chr1 1462609 1462633 LAQFDYGR 1000 + 1462609 1462609 128,128,128 1 24 0
chr1 1485135 1485159 ITVLEALR 1000 + 1485135 1485135 128,128,128 1 24 0
chr1 1486636 1486663 LFDWANTSR 1000 + 1486636 1486636 128,128,128 1 27 0
chr1 1490572 1490596 LAQFDYGR 1000 + 1490572 1490572 128,128,128 1 24 0
chr1 1522863 1522887 ITVLEALR 1000 + 1522863 1522863 128,128,128 1 24 0
Challenge in the Future:
• Bed information can be extended with more
information about the transcript reliability.
• Peptide uniqueness
• Reliability score.
• Native bigBed should be provided to
remove the customization of new pipelines,
etc.
• What to do with the unmapped peptides
(which are long lists.)
• Maintainability.
28. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Overview
• Short introduction to PRIDE and ProteomeXchange
• Reuse of public proteomics data
• Integration of proteomics and genomics data
• Open analysis pipelines for proteomics data
30. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
How to make data analysis pipelines reproducible
• That means using:
• Exactly the same software (including the same version) in the
same order.
• The same protein sequence database (including the same
version).
• If we use the same files as input to the software, we will get
EXACTLY the same results.
• If that’s not the case, something has gone wrong.
• Computers are much more reliable than people.
31. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Develop exemplary proteomics data analysis workflows and deploy
them in the EMBL-EBI "Embassy Cloud”:
(1) Standard identification workflow
(2) Identification workflow for PTMs
(3) Quantification (label-free/label-based approaches)
(4) Quality Control (to aid data set interpretation/reanalysis evaluation)
(5) Versions of quantification approaches (including PTMs)
è Connected to public proteomics data from
Developing pipelines in the cloud -> DDA data
33. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Open analysis pipelines: DIA and proteogenomics
• Pipelines for DIA approaches.
• In collaboration with the Stoller Center (Manchester) (co-PIs Graham,
Hubbard & Townsend)
• Pipelines for proteogenomics approaches (project just started).
• In collaboration with J. Choudhary (Institute of Cancer Research, London)
• Additional DDA pipelines (ELIXIR Proteomics Community).
34. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Vision: total transparency and reproducibility
Analysis
Pipelines
Input data Data analysis Results
35. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Summary
• Public proteomics datasets are on the rise! Reliable (widely used)
infrastructure now exists: PRIDE and ProteomeXchange.
• A lot of possibilities open for reuse of this data.
• New purposes: proteogenomics, novel PTMs,...
• New infrastructure to integrate proteomics and genomics data
• Developing open and reproducible analysis pipelines.
• Supporting reproducible science
• Aim: In the future they are made available to everyone in the
community
36. Juan A. Vizcaíno
juan@ebi.ac.uk
19th
C-HPP Symposium
Santiago, 16 June 2018
Aknowledgements: People
Yasset Perez-Riverol
Johannes Griss
Suresh Hewapathirana
Tobias Ternent
Jingwen Bai
Attila Csordas
Deepti Jaiswal
Andrew Jarnuczak
Mathias Walzer
Gerhard Mayer (de.NBI)
Former team members, especially
Manuel Bernal-Linares & Henning
Hermjakob
Acknowledgements
All data submitters !!!
@pride_ebi
@proteomexchange