Reuse of public data in proteomics

Exploring the potential of public
proteomics data
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-EBI
Hinxton, Cambridge, UK

Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Datasets are being reused more and more….
Vaudel et al., Proteomics, 2016
Data download volume for
PRIDE Archive in 2015: 198 TB
0
50
100
150
200
250
2013 2014 2015 2016
Downloads in TBs

Juan A. Vizcaíno
juan@ebi.ac.uk
Data sharing in Proteomics
Vaudel et al., Proteomics, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk

Juan A. Vizcaíno
juan@ebi.ac.uk
• Data as they are.
• Protein knowledge bases: UniProt, neXtProt.
• Contributing to the Protein Evidence Code.

Juan A. Vizcaíno
juan@ebi.ac.uk
Protein Evidence codes in UniProt/neXtProt
http://www.uniprot.org/help/protein_existence

Juan A. Vizcaíno
juan@ebi.ac.uk
Use of MS data in UniProt

Juan A. Vizcaíno
juan@ebi.ac.uk
Use of MS data in neXtProt

Juan A. Vizcaíno
juan@ebi.ac.uk
Reuse
• Information is not only extracted, but reused in new
experiments with the potential of generating new
knowledge.
• Transitions used in SRM approaches.
• Meta-analysis approaches.
• Spectral libraries.

Juan A. Vizcaíno
juan@ebi.ac.uk
SRMAtlas
http://www.srmatlas.org/

Juan A. Vizcaíno
juan@ebi.ac.uk
PeptidePicker
http://mrmpeptidepicker.proteincentre.com/

Juan A. Vizcaíno
juan@ebi.ac.uk
Meta-analysis approaches
• Putting data coming from a lot of experiments
together, to extract new knowledge. Examples:
• Study the cleavage mechanism and performance of
trypsin.
• Fragmentation patterns.
• Retention time prediction.
• Which is the most suitable reference DB for long-term
proteomics data storage?
• Data integration of experiments done at different time
points.

Juan A. Vizcaíno
juan@ebi.ac.uk
Spectral searching
• Concept: To compare experimental spectra to other
experimental spectra.
• There are many spectral libraries publicly available (for
instance, from NIST, PeptideAtlas and PRIDE)
• Custom ‘search engines’ have been developed:
• SpectraST (TPP)
• X!Hunter (GPM)
• Bibliospec
• It has been claimed that the searches have more
sensitivity that with sequence database approaches

Juan A. Vizcaíno
juan@ebi.ac.uk
Spectral searching (2)
http://peptide.nist.gov/

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster as a Public Data Mining Resource
17
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API

Juan A. Vizcaíno
juan@ebi.ac.uk
Reprocess
• Data are reprocessed with the intention of obtaining
new knowledge or to provide an updated view on the
results.
• It mainly serves the same purpose of the original
experiment.
• For instance, a shot-gun dataset can be reprocessed
with a different algorithm or an updated sequence
database.

Juan A. Vizcaíno
juan@ebi.ac.uk
Reprocessing repositories
• These resources collect MS raw data and reprocess it using
one given analysis pipeline, and an up-to date protein
sequence database.
• Main resources: GPMDB and PeptideAtlas (ISB, Seattle).

Juan A. Vizcaíno
juan@ebi.ac.uk
PeptideAtlas and GPMDB

Juan A. Vizcaíno
juan@ebi.ac.uk
Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014
•Around 60% of the data used for the
analysis comes from previous
experiments, most of them stored in
proteomics repositories such as
PRIDE/ProteomeXchange, PASSEL or
MassIVE.
•They complement that data with “exotic”
tissues.

Juan A. Vizcaíno
juan@ebi.ac.uk
Reprocessing for the validation of controversial data
• Analysis of Tyrannosaurus rex fossils: controversial presence of
collagen (is it a contamination of the sample? Did the sample contain
any T. rex proteins at all?)
Asara et al. (2007) Science 316: 280-5.
Asara et al. (2007) Science 316: 1324-5.
Bern et al. (2009) JPR 9: 4328-32
PRIDE Archive assay accession
8633

Juan A. Vizcaíno
juan@ebi.ac.uk
Info from R. Chalkley
Bromenshenk et al. (2011) PLOS One 5: e13181
Reprocessing for the validation of controversial data (2)

Juan A. Vizcaíno
juan@ebi.ac.uk
Experimental Protocol
1. Collected samples from healthy, collapsing and collapsed bee colonies.
2. Homogenised bees.
3. Digested with Trypsin
4. Analyzed by LC-MSMS on LTQ
5. Searched using Sequest
6. Filtered Results using Peptide and Protein Prophet
7. Performed further analysis to determine species statistically more
commonly found in collapsing/collapsed colony samples
Bromenshenk et al. (2011) PLOS One 5: e13181

Juan A. Vizcaíno
juan@ebi.ac.uk
• Big pitfall: Search database was only composed by viral
proteins. Not bee proteins at all!!
• After researching the data, there is no evidence for viral
peptides/proteins in any of their data: honey bee, fruit fly,
wasp, moth, human keratin, bacteria that like sugary
environments, …
• “We believe that there is currently insufficient evidence to
conclude that bees are a natural host for IIV-6, let alone that
the virus is linked to CCD”.
Knudsen & Chalkley (2011) PLOS One 6:
e20873
Foster (2011), MCP 10: M110.006387

Juan A. Vizcaíno
juan@ebi.ac.uk
Reprocessing for the validation of controversial data
Datasets PXD000561 and PXD000865 in PRIDE Archive

Juan A. Vizcaíno
juan@ebi.ac.uk
Various reanalysis of these datasets have been performed…
Reanalysis of Pandey dataset (Nature, 2014) made by J. Choudhary’s group at
Sanger Institute
Wright et al., Nat Commun, 2016Dataset PXD000561
http://www.ebi.ac.uk/gxa

Juan A. Vizcaíno
juan@ebi.ac.uk
Repurposing
• Data are considered in light of a question or a context
that is different from the original study.
• Proteogenomics studies
• Discovery of novel PTMs.

Juan A. Vizcaíno
juan@ebi.ac.uk
Examples of repurposing datasets: proteogenomics
Data in public resources can be used for genome annotation purposes

Juan A. Vizcaíno
juan@ebi.ac.uk
Repurposing: new PTMs found
• Individual authors can reprocess raw data with new
hypotheses in mind (not taken into account by the original
authors).
• Recent examples (using phosphoproteomics data sets):
• O-GlcNAc-6-phosphate1
• Phosphoglyceryl2
• ADP-ribosylation3
1Hahne & Kuster, Mol Cell Proteomics (2012) 11 10 1063-9
2Moellering & Cravatt, Science (2013) 341 549-553
3Matic et al., Nat Methods (2012) 9 771-2

Juan A. Vizcaíno
juan@ebi.ac.uk
Vaudel M, Barsnes H, Berven FS, Sickmann A,
Martens L:
Proteomics 2011;11(5):996-9.
https://github.com/compomics/searchgui https://github.com/compomics/peptide-shaker
Vaudel M, Burkhart J, Zahedi RP, Berven FS, Sickmann A, Martens L,
Barsnes H:
Nature Biotechnology 2015; 33(1):22-4.
CompOmics Open Source Analysis Pipeline

Juan A. Vizcaíno
juan@ebi.ac.uk
Find the desired PRIDE project …
… and start re-analyzing the data!
… inspect the project details ….
Reshake PRIDE data!

Juan A. Vizcaíno
juan@ebi.ac.uk
Public datasets from different omics: OmicsDI
http://www.ebi.ac.uk/Tools/omicsdi/
• Aims to integrate of ‘omics’ datasets (proteomics,
transcriptomics, metabolomics and genomics at present).
PRIDE
MassIVE
jPOST
PASSEL
GPMDB
ArrayExpress
Expression Atlas
MetaboLights
Metabolomics Workbench
GNPS
EGA
Perez-Riverol et al., Nat Biotechnol, in press

Juan A. Vizcaíno
juan@ebi.ac.uk
OmicsDI: Portal for omics datasets

Juan A. Vizcaíno
juan@ebi.ac.uk
Acknowledgements
http://www.ncbi.nlm.nih.gov/pubmed/26449181
http://onlinelibrary.wiley.com/doi/10.1002/pmic.201500295/epdf

Juan A. Vizcaíno
juan@ebi.ac.uk
Questions?

Reuse of public data in proteomics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Reuse of public data in proteomics

Similar to Reuse of public data in proteomics (20)

More from Juan Antonio Vizcaino

More from Juan Antonio Vizcaino (16)

Recently uploaded

Recently uploaded (20)

Reuse of public data in proteomics