SlideShare a Scribd company logo
1 of 28
Scaling up Spectral Clustering in the
PRIDE database
Dr. Juan Antonio Vizcaíno
PRIDE Group Coordinator
Proteomics Services Team
EMBL-EBI
Hinxton, Cambridge, UK
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Overview
• Spectral clustering in PRIDE, what for?
• PRIDE Cluster algorithm
• PRIDE Cluster H algorithm
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride
• Data repository focused
on MS/MS approaches
• Main submission point of
MS/MS data in the
ProteomeXchange
consortium
• Data is stored as
originally analysed by the
authors
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2013
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Why PRIDE Cluster is needed?
• Data in PRIDE are of very heterogeneous sources: different
instrumentation, search engines, analysis pipelines, etc.
• Data is of heterogeneous quality
• We want to do QC on top, using the information from the
originally submitted results.
• Spectrum clustering approach
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Overview
• Spectral clustering in PRIDE, what for?
• PRIDE Cluster algorithm
• PRIDE Cluster H algorithm
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
• Quality assessment method. Taking advantage of the wealth of data in
PRIDE.
• Assumption: The same peptide will generate similar MS/MS spectra
across many experiments:
• No data pre-filtering done. All species, all instruments together.
• Done in June 2012.
• Cluster all identified spectra (20.7 M) in PRIDE (modified version of the MS-Cluster
algorithm [1]). API available (http://pride-spectra-clustering.googlecode.com).
• Those clusters which contain only/mainly one peptide are considered reliable.
• 601 CPU days (~2 weeks).
• The more data, the consensus spectrum will be more stable.
PRIDE Cluster
Griss et al., Nat Methods, 20131. Frank et al. Nat Methods 8, 587-591 (2011)
PRIDE Cluster
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
PRIDE Cluster
Griss et al., Nat Methods, 20131. Frank et al. Nat Methods 8, 587-591 (2011)
PRIDE Cluster
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 10 spectra in
a cluster and ratio >70%.
Originally submitted identified spectra
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
PRIDE Cluster – Web interface
VNPTVFFDIAVDGEPLGR
http://www.ebi.ac.uk/pride/cluster/
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
PRIDE Spectra clustering API is open source
https://code.google.com/p/pride-spectra-clustering/
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Overview
• Spectral clustering in PRIDE, what for?
• Original PRIDE Cluster algorithm
• PRIDE Cluster H algorithm
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
However…
• Increase of data in PRIDE,
especially in the “most popular” m/z values
• We tried a new run of the original algorithm at the beginning of 2013
• The original algorithm did not work for these “popular” m/z windows
• For the others it took far too much time.
• We also wanted to extend the approach to unidentified spectra.
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Parallelizing Spectral Clustering: Hadoop
• The key to speeding up clustering is to split up the job and
run sections in parallel on different machines.
• Hadoop is a Framework for parallelism using the Map-
Reduce algorithm by Google.
• It is an open source implementation of Map-Reduce
• It solves many general issues of large parallel jobs
o Scheduling
o inter-job communication
o failure
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Handling Spectra first
● Filter
● Normalize
● Choose highest peaks
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Clustering Steps
1 Highest Peak Clustering
- Key - each of highest Six Peaks
Assumption: Clustered spectra will share at least one
of their most intense 6 peaks
- Sort m/z values
- Find best fit on dot product cluster if large enough
2 - Combine clusters
- Key – m/z value
- Sort m/z value
- Find best fit on dot product combine if large enough.
3 - Repeat Step 2
4 - Generate output
Note Every
Spectrum has 6
copies
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
In Memory Spectrum
Latest Spectrum
Out of Window Spectrum
Latest Spectrum
State Before Process
m/z
m/z
Process Next Spectrum
Drop Clusters out of
Window
Add
(or)
Compare
Merge
m/z window
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Assessing the new PRIDE-H algorithm
‘HUPO’ dataset (HUPO PPP and BPP,
several publications).
‘COPD’ dataset (PMID: 19357784).
‘CPTAC’ dataset study 6 (PMID:
19837981).
Used to find the best parameters
for the spectrum clustering:
- Dot product threshold
to merge clusters.
- Number of iterations
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Run done on September 2014
• It contains 54.2 Million identified in PRIDE (Public data)
• In total there are ~95 Million identified spectra in PRIDE.
• Aprox. 760 CPU days -> ~2.5 days run time.
• Results available in PRIDE cluster web interface
• The results will be available in PRIDE Archive very soon.
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Result from the clustering of the whole of PRIDE
3 - size >= 10 & ratio >= 0.7 => reliable psm
2 - ratio >= 0.7 & size < 10 => potentially reliable
1 - ratio < 0.7 => unreliable psm
Rating Total number
1 11,051,042
2 8,925,445
3 21,156,575
Precursor m/z range in clusters
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Detection of annotation errors
Cluster with 255 PSMs
In 11PSMs, one PTM
was not reported
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
PRIDE Cluster web
https://www.ebi.ac.uk/pride/cluster/
Search by peptide
sequence
or provide a peak list
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
PRIDE Cluster web
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
VNPTVFFDIAVDGEPLGR
PRIDE Cluster – Web interface
2014
2012
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Information soon available in PRIDE Archive web
Information to be added in the new
PRIDE web interface at the PSM level
List of reliable PSMs to be
downloadable
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
PRIDE Cluster – ExamplesPRIDE Cluster – New spectral libraries
http://www.ebi.ac.uk/pride/cluster/libraries
Species Number of spectra 2012 2014 Increase
Homo sapiens 81428 265495 326.05%
Mouse 53376 120524 225.80%
Rat 17624 42512 241.22%
Arabidopsis 69242 115260 166.46%
Drosophila 4686 28621 610.78%
Saccharomyces 71487 14153 19.80%
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
For the near future
• Very soon: Scores will be visible in PRIDE Archive page,
associated with the specific PSMs.
• A lot of data mining to do…
• Clustering of unidentified spectra as well.
• Around 450 M total spectra at present in PRIDE (and this is only
considering PX ‘complete’ submissions).
• Transfer annotation from identified to identified spectra.
• Find clusters of spectra that could not be originally identified:
exotic PTMs? SNPs?...
• Plans to move to Hadoop 2.0 (Yarn).
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Conclusions
• Spectral clustering is used in PRIDE as a way to assess the
quality of PSMs
• Original PRIDE Cluster algorithm:
• It worked very well, but was not able to cope with data increase in
PRIDE.
• PRIDE Cluster-H algorithm:
• Based on the Hadoop open source framework (parallelization).
• It maintains accuracy and it has been used for 54.2 M spectra
(~2.5 days).
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Acknowledgements
PRIDE Team
Johannes Griss
Rui Wang
Steve Lewis
Henning Hermjakob
juan@ebi.ac.uk
@juan_vizcaino
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Questions?

More Related Content

Similar to Scaling Spectral Clustering in PRIDE Database

The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...Juan Antonio Vizcaino
 
MS Imaging data in ProteomeXchange (HUPO 2014)
MS Imaging data in ProteomeXchange (HUPO 2014)MS Imaging data in ProteomeXchange (HUPO 2014)
MS Imaging data in ProteomeXchange (HUPO 2014)Juan Antonio Vizcaino
 
Interoperability is the key: repositories networks promoting the quality and ...
Interoperability is the key: repositories networks promoting the quality and ...Interoperability is the key: repositories networks promoting the quality and ...
Interoperability is the key: repositories networks promoting the quality and ...Pedro Príncipe
 
Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsJuan Antonio Vizcaino
 
Towards an e-infrastructure in agriculture?
Towards an e-infrastructure in agriculture?Towards an e-infrastructure in agriculture?
Towards an e-infrastructure in agriculture?Blue BRIDGE
 
Virtual Research Environments as-a-serive
Virtual Research Environments as-a-seriveVirtual Research Environments as-a-serive
Virtual Research Environments as-a-seriveBlue BRIDGE
 
Smart opendata results of pilots & user groups final
Smart opendata results of pilots & user groups finalSmart opendata results of pilots & user groups final
Smart opendata results of pilots & user groups finalMartin Tuchyna
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)OpenAIRE
 
OpenAIRE - Implementing Open Science (presentation by Natalia Manola at Food ...
OpenAIRE - Implementing Open Science (presentation by Natalia Manola at Food ...OpenAIRE - Implementing Open Science (presentation by Natalia Manola at Food ...
OpenAIRE - Implementing Open Science (presentation by Natalia Manola at Food ...OpenAIRE
 
The benefits of the PILAR federation for engineering education
The benefits of the PILAR federation for engineering educationThe benefits of the PILAR federation for engineering education
The benefits of the PILAR federation for engineering educationManuel Castro
 
Benchmarking Commercial RDF Stores with Publications Office Dataset
Benchmarking Commercial RDF Stores with Publications Office DatasetBenchmarking Commercial RDF Stores with Publications Office Dataset
Benchmarking Commercial RDF Stores with Publications Office DatasetGhislain Atemezing
 
OpenAIRE services and tools for researchers/authors and projects (FOSTER work...
OpenAIRE services and tools for researchers/authors and projects (FOSTER work...OpenAIRE services and tools for researchers/authors and projects (FOSTER work...
OpenAIRE services and tools for researchers/authors and projects (FOSTER work...Pedro Príncipe
 
OpenAIRE services and tools for researchers/authors and projects (FOSTER work...
OpenAIRE services and tools for researchers/authors and projects (FOSTER work...OpenAIRE services and tools for researchers/authors and projects (FOSTER work...
OpenAIRE services and tools for researchers/authors and projects (FOSTER work...OpenAIRE
 
The ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateThe ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateJuan Antonio Vizcaino
 
Open air eplus@portal_webinar
Open air eplus@portal_webinarOpen air eplus@portal_webinar
Open air eplus@portal_webinarOpenAIRE
 
A user journey in OpenAIRE services through the lens of repository managers -...
A user journey in OpenAIRE services through the lens of repository managers -...A user journey in OpenAIRE services through the lens of repository managers -...
A user journey in OpenAIRE services through the lens of repository managers -...OpenAIRE
 

Similar to Scaling Spectral Clustering in PRIDE Database (20)

The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
 
MS Imaging data in ProteomeXchange (HUPO 2014)
MS Imaging data in ProteomeXchange (HUPO 2014)MS Imaging data in ProteomeXchange (HUPO 2014)
MS Imaging data in ProteomeXchange (HUPO 2014)
 
Interoperability is the key: repositories networks promoting the quality and ...
Interoperability is the key: repositories networks promoting the quality and ...Interoperability is the key: repositories networks promoting the quality and ...
Interoperability is the key: repositories networks promoting the quality and ...
 
Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasets
 
Towards an e-infrastructure in agriculture?
Towards an e-infrastructure in agriculture?Towards an e-infrastructure in agriculture?
Towards an e-infrastructure in agriculture?
 
GBIF Work Programme 2016 Update
GBIF Work Programme 2016 UpdateGBIF Work Programme 2016 Update
GBIF Work Programme 2016 Update
 
Virtual Research Environments as-a-serive
Virtual Research Environments as-a-seriveVirtual Research Environments as-a-serive
Virtual Research Environments as-a-serive
 
Smart opendata results of pilots & user groups final
Smart opendata results of pilots & user groups finalSmart opendata results of pilots & user groups final
Smart opendata results of pilots & user groups final
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
 
OpenAIRE - Implementing Open Science (presentation by Natalia Manola at Food ...
OpenAIRE - Implementing Open Science (presentation by Natalia Manola at Food ...OpenAIRE - Implementing Open Science (presentation by Natalia Manola at Food ...
OpenAIRE - Implementing Open Science (presentation by Natalia Manola at Food ...
 
The benefits of the PILAR federation for engineering education
The benefits of the PILAR federation for engineering educationThe benefits of the PILAR federation for engineering education
The benefits of the PILAR federation for engineering education
 
Pride Cluster 062016 Update
Pride Cluster 062016 UpdatePride Cluster 062016 Update
Pride Cluster 062016 Update
 
Benchmarking Commercial RDF Stores with Publications Office Dataset
Benchmarking Commercial RDF Stores with Publications Office DatasetBenchmarking Commercial RDF Stores with Publications Office Dataset
Benchmarking Commercial RDF Stores with Publications Office Dataset
 
The European Open Science Cloud
The European Open Science CloudThe European Open Science Cloud
The European Open Science Cloud
 
OpenAIRE services and tools for researchers/authors and projects (FOSTER work...
OpenAIRE services and tools for researchers/authors and projects (FOSTER work...OpenAIRE services and tools for researchers/authors and projects (FOSTER work...
OpenAIRE services and tools for researchers/authors and projects (FOSTER work...
 
OpenAIRE services and tools for researchers/authors and projects (FOSTER work...
OpenAIRE services and tools for researchers/authors and projects (FOSTER work...OpenAIRE services and tools for researchers/authors and projects (FOSTER work...
OpenAIRE services and tools for researchers/authors and projects (FOSTER work...
 
The ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateThe ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 update
 
Euro lipids 2014_graz
Euro lipids 2014_grazEuro lipids 2014_graz
Euro lipids 2014_graz
 
Open air eplus@portal_webinar
Open air eplus@portal_webinarOpen air eplus@portal_webinar
Open air eplus@portal_webinar
 
A user journey in OpenAIRE services through the lens of repository managers -...
A user journey in OpenAIRE services through the lens of repository managers -...A user journey in OpenAIRE services through the lens of repository managers -...
A user journey in OpenAIRE services through the lens of repository managers -...
 

More from Juan Antonio Vizcaino

Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Juan Antonio Vizcaino
 
Introduction to the PSI standard data formats
Introduction to the PSI standard data formatsIntroduction to the PSI standard data formats
Introduction to the PSI standard data formatsJuan Antonio Vizcaino
 
Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Juan Antonio Vizcaino
 
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...Juan Antonio Vizcaino
 
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Juan Antonio Vizcaino
 
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...Juan Antonio Vizcaino
 
Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...Juan Antonio Vizcaino
 
How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?Juan Antonio Vizcaino
 
Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Juan Antonio Vizcaino
 

More from Juan Antonio Vizcaino (20)

Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...
 
Introduction to the PSI standard data formats
Introduction to the PSI standard data formatsIntroduction to the PSI standard data formats
Introduction to the PSI standard data formats
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
PRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchangePRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchange
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018
 
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
 
PSI-Proteome Informatics update
PSI-Proteome Informatics updatePSI-Proteome Informatics update
PSI-Proteome Informatics update
 
ProteomeXchange update
ProteomeXchange updateProteomeXchange update
ProteomeXchange update
 
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
 
The ELIXIR Proteomics community
The ELIXIR Proteomics community The ELIXIR Proteomics community
The ELIXIR Proteomics community
 
The ELIXIR Proteomics Community
The ELIXIR Proteomics CommunityThe ELIXIR Proteomics Community
The ELIXIR Proteomics Community
 
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
 
Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...
 
How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
PRIDE and ProteomeXchange
PRIDE and ProteomeXchangePRIDE and ProteomeXchange
PRIDE and ProteomeXchange
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
 
Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017
 

Recently uploaded

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionPriyansha Singh
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 

Recently uploaded (20)

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorption
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 

Scaling Spectral Clustering in PRIDE Database

  • 1. Scaling up Spectral Clustering in the PRIDE database Dr. Juan Antonio Vizcaíno PRIDE Group Coordinator Proteomics Services Team EMBL-EBI Hinxton, Cambridge, UK
  • 2. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 Overview • Spectral clustering in PRIDE, what for? • PRIDE Cluster algorithm • PRIDE Cluster H algorithm
  • 3. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 PRIDE (PRoteomics IDEntifications) database http://www.ebi.ac.uk/pride • Data repository focused on MS/MS approaches • Main submission point of MS/MS data in the ProteomeXchange consortium • Data is stored as originally analysed by the authors Martens et al., Proteomics, 2005 Vizcaíno et al., NAR, 2013
  • 4. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 Why PRIDE Cluster is needed? • Data in PRIDE are of very heterogeneous sources: different instrumentation, search engines, analysis pipelines, etc. • Data is of heterogeneous quality • We want to do QC on top, using the information from the originally submitted results. • Spectrum clustering approach
  • 5. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 Overview • Spectral clustering in PRIDE, what for? • PRIDE Cluster algorithm • PRIDE Cluster H algorithm
  • 6. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 • Quality assessment method. Taking advantage of the wealth of data in PRIDE. • Assumption: The same peptide will generate similar MS/MS spectra across many experiments: • No data pre-filtering done. All species, all instruments together. • Done in June 2012. • Cluster all identified spectra (20.7 M) in PRIDE (modified version of the MS-Cluster algorithm [1]). API available (http://pride-spectra-clustering.googlecode.com). • Those clusters which contain only/mainly one peptide are considered reliable. • 601 CPU days (~2 weeks). • The more data, the consensus spectrum will be more stable. PRIDE Cluster Griss et al., Nat Methods, 20131. Frank et al. Nat Methods 8, 587-591 (2011) PRIDE Cluster
  • 7. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 PRIDE Cluster Griss et al., Nat Methods, 20131. Frank et al. Nat Methods 8, 587-591 (2011) PRIDE Cluster NMMAACDPR NMMAACDPR PPECPDFDPPR NMMAACDPR Consensus spectrum PPECPDFDPPR NMMAACDPR NMMAACDPR Threshold: At least 10 spectra in a cluster and ratio >70%. Originally submitted identified spectra
  • 8. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 PRIDE Cluster – Web interface VNPTVFFDIAVDGEPLGR http://www.ebi.ac.uk/pride/cluster/
  • 9. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 PRIDE Spectra clustering API is open source https://code.google.com/p/pride-spectra-clustering/
  • 10. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 Overview • Spectral clustering in PRIDE, what for? • Original PRIDE Cluster algorithm • PRIDE Cluster H algorithm
  • 11. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 However… • Increase of data in PRIDE, especially in the “most popular” m/z values • We tried a new run of the original algorithm at the beginning of 2013 • The original algorithm did not work for these “popular” m/z windows • For the others it took far too much time. • We also wanted to extend the approach to unidentified spectra.
  • 12. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 Parallelizing Spectral Clustering: Hadoop • The key to speeding up clustering is to split up the job and run sections in parallel on different machines. • Hadoop is a Framework for parallelism using the Map- Reduce algorithm by Google. • It is an open source implementation of Map-Reduce • It solves many general issues of large parallel jobs o Scheduling o inter-job communication o failure
  • 13. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 Handling Spectra first ● Filter ● Normalize ● Choose highest peaks
  • 14. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 Clustering Steps 1 Highest Peak Clustering - Key - each of highest Six Peaks Assumption: Clustered spectra will share at least one of their most intense 6 peaks - Sort m/z values - Find best fit on dot product cluster if large enough 2 - Combine clusters - Key – m/z value - Sort m/z value - Find best fit on dot product combine if large enough. 3 - Repeat Step 2 4 - Generate output Note Every Spectrum has 6 copies
  • 15. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 In Memory Spectrum Latest Spectrum Out of Window Spectrum Latest Spectrum State Before Process m/z m/z Process Next Spectrum Drop Clusters out of Window Add (or) Compare Merge m/z window
  • 16. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 Assessing the new PRIDE-H algorithm ‘HUPO’ dataset (HUPO PPP and BPP, several publications). ‘COPD’ dataset (PMID: 19357784). ‘CPTAC’ dataset study 6 (PMID: 19837981). Used to find the best parameters for the spectrum clustering: - Dot product threshold to merge clusters. - Number of iterations
  • 17. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 Run done on September 2014 • It contains 54.2 Million identified in PRIDE (Public data) • In total there are ~95 Million identified spectra in PRIDE. • Aprox. 760 CPU days -> ~2.5 days run time. • Results available in PRIDE cluster web interface • The results will be available in PRIDE Archive very soon.
  • 18. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 Result from the clustering of the whole of PRIDE 3 - size >= 10 & ratio >= 0.7 => reliable psm 2 - ratio >= 0.7 & size < 10 => potentially reliable 1 - ratio < 0.7 => unreliable psm Rating Total number 1 11,051,042 2 8,925,445 3 21,156,575 Precursor m/z range in clusters
  • 19. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 Detection of annotation errors Cluster with 255 PSMs In 11PSMs, one PTM was not reported
  • 20. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 PRIDE Cluster web https://www.ebi.ac.uk/pride/cluster/ Search by peptide sequence or provide a peak list
  • 21. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 PRIDE Cluster web
  • 22. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 VNPTVFFDIAVDGEPLGR PRIDE Cluster – Web interface 2014 2012
  • 23. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 Information soon available in PRIDE Archive web Information to be added in the new PRIDE web interface at the PSM level List of reliable PSMs to be downloadable
  • 24. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 PRIDE Cluster – ExamplesPRIDE Cluster – New spectral libraries http://www.ebi.ac.uk/pride/cluster/libraries Species Number of spectra 2012 2014 Increase Homo sapiens 81428 265495 326.05% Mouse 53376 120524 225.80% Rat 17624 42512 241.22% Arabidopsis 69242 115260 166.46% Drosophila 4686 28621 610.78% Saccharomyces 71487 14153 19.80%
  • 25. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 For the near future • Very soon: Scores will be visible in PRIDE Archive page, associated with the specific PSMs. • A lot of data mining to do… • Clustering of unidentified spectra as well. • Around 450 M total spectra at present in PRIDE (and this is only considering PX ‘complete’ submissions). • Transfer annotation from identified to identified spectra. • Find clusters of spectra that could not be originally identified: exotic PTMs? SNPs?... • Plans to move to Hadoop 2.0 (Yarn).
  • 26. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 Conclusions • Spectral clustering is used in PRIDE as a way to assess the quality of PSMs • Original PRIDE Cluster algorithm: • It worked very well, but was not able to cope with data increase in PRIDE. • PRIDE Cluster-H algorithm: • Based on the Hadoop open source framework (parallelization). • It maintains accuracy and it has been used for 54.2 M spectra (~2.5 days).
  • 27. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 Acknowledgements PRIDE Team Johannes Griss Rui Wang Steve Lewis Henning Hermjakob juan@ebi.ac.uk @juan_vizcaino
  • 28. Juan A. Vizcaíno juan@ebi.ac.uk 13th HUPO World Congress Madrid, 8 October 2014 Questions?