There is an increasing demand for approaches to integrate proteomics data with other ‘omics’ data types, especially genomics. In fact, current resources to integrate proteomics in a genome context are insufficient for large-scale studies. To bring the genomics and proteomics results into coherence, novel resources must be developed that provide simple links across previously acquired datasets with minimal preprocessing and hassle.
PRIDE (http://www.ebi.ac.uk/pride) and Ensembl (http://www.ensembl.org) are world-leading resources for proteomics and genomics data. We have developed a new resource and framework to enable a systematic integration of mass spectrometry (MS) based proteomics evidences into genome browsers. We automatically integrate every high-quality MS-based peptidoform reported in the PRIDE database, into genome coordinates through Ensembl, and other genome browsers.
Systematic integration of millions of peptidoform evidences into Ensembl and other genome browsers
1. Yasset Perez-Riverol Ph.D
PRIDE Team
Systematic integration of millions of
peptidoform evidences into Ensembl and
other genome browsers.
2. Meeting Sanger-EBI team
EMBL-EBI, November 2017
PRIDE Proteogenomics
• Provides a Trackhub in ENSEMBL for every
ProteomeXchange COMPLETE submission in
ENSEMBL.
• Provides a global TrackHub in ENSEMBL for
all PRIDE peptide evidences.
3. Meeting Sanger-EBI team
EMBL-EBI, November 2017
PX Complete Trackhub in ENSEMBL
PX Submission
Tool
PRIDE Archive
1 2
PRIDE submission Pipelines
PRIDE Archive
Web and API
3
TrackHub
Registry
4
PX submission can be Partial or Complete:
Partial Submission: RAW data, SEARCH
Results and Peaks Lists.
Complete Submission: RAW data, Result
Files and Peak Lists, SEARCH Results.
1
Each PX submission can be search by:
Title, Metadata, Description, Tissue,
Taxonomy, PTMs.
Peptide Sequence or Protein Identifier.
3
TrackHub Registry can search Tracks by:
ShortLabel, LongLabel.
OmicsType: Proteomics, Genomics,
Transcriptomics.
4
PRIDE Submission Pipelines2
mzid
Peak
lists
PX Complete Submission
Assays
mztab
mgf
lists
PX Complete Submission
Assays
5
Convert to mztab/mgf and filter evidences do not
pass the reported mzid threshold.
5 Storage of the Project Metadata, Peptide Sequences,
Protein identifiers in Solr and MongoDB.
6
6
Assay
Peptide
Pogo File
PX Complete Submission
Taxonomies
Track
TrackHub Generation
Taxonomies
TrackHub
Registry
4. Meeting Sanger-EBI team
EMBL-EBI, November 2017
Generation reliable peptide tables
Current Filter options:
• 1% FDR PSM level (Combine
Results)
• 1% FDR Peptide Level (Combine
Results)
Possible Filters (HPP):
• > 8 AA
• 1% FDR at transcript level
(inference needed)
Combine PSM Score:
- Same Spectra, Peptide
- Different Search Engine
Combine Peptide Score:
- Same Peptide
- Different PSMs
Experiment Peptide PSMs Quant
<a href="http://www.ebi.ac.uk/pride/archive/assays/34642">Assay 34642</a> APPLLEGAPFR 1 1.000000
<a href="http://www.ebi.ac.uk/pride/archive/assays/34644">Assay 34644</a> APPLLEGAPFR 1 1.000000
<a href="http://www.ebi.ac.uk/pride/archive/assays/34642">Assay 34642</a> THTQDAVPLTLGQEFSGYVQQVQYAM(oxidation)VR 1 1.000000
<a href="http://www.ebi.ac.uk/pride/archive/assays/34645">Assay 34645</a> KKQVM(oxidation)EK 1 1.000000
<a href="http://www.ebi.ac.uk/pride/archive/assays/34645">Assay 34645</a> VGSGDTNNFPYLEK 2 2.000000
<a href="http://www.ebi.ac.uk/pride/archive/assays/34645">Assay 34645</a> SLTYLSILR 3 3.000000
<a href="http://www.ebi.ac.uk/pride/archive/assays/34642">Assay 34642</a> LPFTPLSYIQGLSHR 8 8.000000
Audain et. al, J Proteomics 2017
A B DC E
P1 P3P2 P4
PR1
JIG HF
P5
PR1
Protein Inference Toolkit
Protein Groups
5. Meeting Sanger-EBI team
EMBL-EBI, November 2017
Mapping peptides to ENSEMBL: PoGo
Schlaffner CN., Pirklbauer G, Bender A , Choudhary JS, PoGo: Jumping from Peptides to Genomic Loci, biorxiv (2016)
For each .pogo file:
• PTMs are standard to a common
representation using PRIDE-Mod library.
• Each Peptide reference to an Assay URL in
PRIDE.
• Each Pogo file is generated automatically by
the PRIDE Pipeline.
chr1 1314335 1314365 VLIPVFALGR 1000 - 1314335 1314335 0,0,0 1 30 0
chr1 1454464 1454488 ITVLEALR 1000 + 1454464 1454464 128,128,128 1 24 0
chr1 1456317 1456344 LFDWANTSR 1000 + 1456317 1456317 128,128,128 1 27 0
chr1 1459184 1459211 ATLNAFLYR 1000 + 1459184 1459184 128,128,128 1 27 0
chr1 1462609 1462633 LAQFDYGR 1000 + 1462609 1462609 128,128,128 1 24 0
chr1 1485135 1485159 ITVLEALR 1000 + 1485135 1485135 128,128,128 1 24 0
chr1 1486636 1486663 LFDWANTSR 1000 + 1486636 1486636 128,128,128 1 27 0
chr1 1490572 1490596 LAQFDYGR 1000 + 1490572 1490572 128,128,128 1 24 0
chr1 1522863 1522887 ITVLEALR 1000 + 1522863 1522863 128,128,128 1 24 0
Challenge in the Future:
• Bed information can be extended with more
information about the transcript reliability.
• Peptide uniqueness
• Reliability score.
• Native bigBed should be provided to remove
the customization of new pipelines, etc.
• What to do with the unmapped peptides
(which are long lists.)
• Maintainability.
6. Meeting Sanger-EBI team
EMBL-EBI, November 2017
Getting there..
Taxonomy COUNT
Human 876
Mouse 252
Arabidopsis thaliana 127
Rat 68
Yeast 62
E.Coli 38
Bos taurus 32
Drosophila melanogaster 21
zebrafish 20
Zeaa Mays 15
Candida albicans 12
Gallus gallus 12
Caenorhabditis elegans 10
Horse 10
Synechocystis sp. PCC 9
Glycine max 9
Complete
Submission
PRIDE
Taxonomy COUNT
Human 74
Mouse
Arabidopsis thaliana
Rat 4
Yeast
E.Coli
Bos taurus
Drosophila melanogaster
zebrafish
Zeaa Mays
Candida albicans
Gallus gallus
Caenorhabditis elegans
Horse
Synechocystis sp. PCC
Glycine max
TrackHub
Creation
7. Meeting Sanger-EBI team
EMBL-EBI, November 2017
TrackHub creation and Publication
Track
TrackHub Creation
Taxonomies
Pogo Files
Pogo Files
Taxonomies
TrackHub
Registry
trackhub-
parameters.json
1
2
TrackHub
3
TrackHub
Registry
4
TrackHub Creator:
Python Framework.
Interact with ENSEMBL API to retrieve the
latest ENSEMBL release and assembly
version (for all species supported by
PoGo.
Download the corresponding GTF and
FASTA (Protein File) for each Species
supported by PoGo.
Using Pogo source code, compile the
Pogo version and generate the bed/bigbed
files.
Create/Update the TrackHub for an
specific PX submission. A new library has
been developed to interact with
ENSEMBL trackhub registry
(https://github.com/PRIDE-Utilities/track-
hub-registrator)
https://github.com/Proteogenomics/trackhub-
creator
1
2
3
4
8. Meeting Sanger-EBI team
EMBL-EBI, November 2017
Trackhub creator
• Download ENSEMBL FASTA Protein file and GTF for an specific taxonomy. Detect the taxonomy
in PRIDE and automatically query ENSEMBL API and Download the data from it.
• Run PoGo and generate the mapping files: bed files for Peptides and Modified Peptides (On going,
We will automatically generate the 1 and 2 gaps files).
• Generate the TrackHub with an specific structure (Next session.).
• Publish the Trackhub in ENSEMBL Registry (https://www.trackhubregistry.org/search).
Note: We have faced millions of problems:
1- ENSEMBL and UCSC support only UCSC notations for
almost everything:
genome hg38 trackDb hg38/trackDb.txt
Chromosomes should be chr1, chr2.. Etc.
Chromosomes sizes are programmatically
available.
2- Taxonomies where not supported on PoGo. We have
moved to a better list now with 19 Taxonomies.
3- Modifications has been a nightmare between UNIMOD
and PoGo.
http://ftp.pride.ebi.ac.uk/pride/data/proteogenomics/latest/archive/
9. Meeting Sanger-EBI team
EMBL-EBI, November 2017
Getting there..
http://genome.ucsc.edu/cgi-
bin/hgTracks?db=mm10&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=
0&nonVirtPosition=&position=chr12%3A20021505-
100107519&hgsid=644445905_BScrPMQymPnGtl9O7jeSv1jS5Rm0
11. Meeting Sanger-EBI team
EMBL-EBI, November 2017
PRIDE Proteogenomics
• Provides a Trackhub in ENSEMBL for every
ProteomeXchange COMPLETE submission in
ENSEMBL.
• Provides a global TrackHub in ENSEMBL for
all PRIDE peptide evidences.
13. Meeting Sanger-EBI team
EMBL-EBI, November 2017
PRIDE Cluster Approach
“When merging large number of datasets coming from different groups/bioinformatics workflows the
consensus spectra should be generated to evaluate the accuracy of each independent dataset."
We know results depend on: Database, Search engine Settings, bioinformatics pipeline, etc.
Assay-24261
Assay-3394
Peptide Mascot Score: 54.06
Peptide Mascot Score: 32.86
GYTFSTTAER: 3 PSMs
GYSFTTTAER: 13918 (143)
14. Meeting Sanger-EBI team
EMBL-EBI, November 2017
PRIDE Cluster Pipeline
PX
Complete
.
.
n
Hadoop Cluster
PRIDE Archive Import
PX successfully converted
New Peptide/PTMs
Number of Identified and non-Identified Spectra
QC QC
Number of new clusters
PRIDE Cluster score distribution
Number of clusters by modification
mgf
(Annotated
spectra)
Clustering
Files
QC
Number of Peptides
Number of new Peptides
Number of PTMs
Number of New PTMs
Peptide
Tables
(Pogo File)
Peptide Export
Taxonomies
Track
TrackHub Generation
Taxonomies
TrackHub
Registry
15. Meeting Sanger-EBI team
EMBL-EBI, November 2017
PRIDE Cluster Peptide Evidences.
Human MouseArabidopsis
The Peptide Evidence Files from PRIDE Cluster collapsed all
the peptide evidences from reliable clusters:
ftp://ftp.pride.ebi.ac.uk/pride/data/cluster/peptide-
results/2015-04/
16. Meeting Sanger-EBI team
EMBL-EBI, November 2017
All trackhubs already published
Note: Some challenges ahead:
1- Taxonomies like Saccharomyces cerevisiae are not
supported in PoGo.
2- Scaffold are not supported in Pogo with the UCSC
notation. The same problem that we face with chromosome
1, 2, .. To chr1 chr2, etc can’t be done with scaffold.
3- More evidences are needed.
17. Meeting Sanger-EBI team
EMBL-EBI, November 2017
Getting there … (PepBed package)
• Black (all identified
peptides).
• Cyan (oxidation)
• Orange (acetyl)
• Red (phospho)
18. Meeting Sanger-EBI team
EMBL-EBI, November 2017
Getting there … (Comparions with APRIS
Peptidome)
APRIS Peptidome:
• The eight studies covered a huge range of tissues and cell types:
• The peptides from the PeptideAtlas database cover 51 different tissues, cell types, and developmental stages (2016).
• Geiger study interrogated 11 different cell types.
• NIST database.
• The Kim and Wilhelm analyses peptides were generated from 30 and 35 distinct tissues types (51 tissues in total).
19. Meeting Sanger-EBI team
EMBL-EBI, November 2017
PRIDE team.
Manuel Bernal-Llinares
(track-hub creator)
Tobias Ternent
(pride pipelines)
Johannes Griss
(pride cluster pipelines)
Sanger Team
Christoph Schlaffner
(pogo tool)
Jyoti Choudhary
(PI)
Juan A. Vizcaino
(PI)
Alessandro Vullo
(trackhub registry)
ENSEMBLTeam