Scaling Spectral Clustering in PRIDE Database

Scaling up Spectral Clustering in the
PRIDE database
Dr. Juan Antonio Vizcaíno
PRIDE Group Coordinator
Proteomics Services Team
EMBL-EBI
Hinxton, Cambridge, UK

Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Overview
• Spectral clustering in PRIDE, what for?
• PRIDE Cluster algorithm
• PRIDE Cluster H algorithm

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride
• Data repository focused
on MS/MS approaches
• Main submission point of
MS/MS data in the
ProteomeXchange
consortium
• Data is stored as
originally analysed by the
authors
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2013

Juan A. Vizcaíno
juan@ebi.ac.uk
Why PRIDE Cluster is needed?
• Data in PRIDE are of very heterogeneous sources: different
instrumentation, search engines, analysis pipelines, etc.
• Data is of heterogeneous quality
• We want to do QC on top, using the information from the
originally submitted results.
• Spectrum clustering approach

Juan A. Vizcaíno
juan@ebi.ac.uk
• Quality assessment method. Taking advantage of the wealth of data in
PRIDE.
• Assumption: The same peptide will generate similar MS/MS spectra
across many experiments:
• No data pre-filtering done. All species, all instruments together.
• Done in June 2012.
• Cluster all identified spectra (20.7 M) in PRIDE (modified version of the MS-Cluster
algorithm [1]). API available (http://pride-spectra-clustering.googlecode.com).
• Those clusters which contain only/mainly one peptide are considered reliable.
• 601 CPU days (~2 weeks).
• The more data, the consensus spectrum will be more stable.
PRIDE Cluster
Griss et al., Nat Methods, 20131. Frank et al. Nat Methods 8, 587-591 (2011)
PRIDE Cluster

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster
Griss et al., Nat Methods, 20131. Frank et al. Nat Methods 8, 587-591 (2011)
PRIDE Cluster
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 10 spectra in
a cluster and ratio >70%.
Originally submitted identified spectra

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster – Web interface
VNPTVFFDIAVDGEPLGR
http://www.ebi.ac.uk/pride/cluster/

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Spectra clustering API is open source
https://code.google.com/p/pride-spectra-clustering/

Juan A. Vizcaíno
juan@ebi.ac.uk
Overview
• Spectral clustering in PRIDE, what for?
• Original PRIDE Cluster algorithm
• PRIDE Cluster H algorithm

Juan A. Vizcaíno
juan@ebi.ac.uk
However…
• Increase of data in PRIDE,
especially in the “most popular” m/z values
• We tried a new run of the original algorithm at the beginning of 2013
• The original algorithm did not work for these “popular” m/z windows
• For the others it took far too much time.
• We also wanted to extend the approach to unidentified spectra.

Juan A. Vizcaíno
juan@ebi.ac.uk
Parallelizing Spectral Clustering: Hadoop
• The key to speeding up clustering is to split up the job and
run sections in parallel on different machines.
• Hadoop is a Framework for parallelism using the Map-
Reduce algorithm by Google.
• It is an open source implementation of Map-Reduce
• It solves many general issues of large parallel jobs
o Scheduling
o inter-job communication
o failure

Juan A. Vizcaíno
juan@ebi.ac.uk
Handling Spectra first
● Filter
● Normalize
● Choose highest peaks

Juan A. Vizcaíno
juan@ebi.ac.uk
Clustering Steps
1 Highest Peak Clustering
- Key - each of highest Six Peaks
Assumption: Clustered spectra will share at least one
of their most intense 6 peaks
- Sort m/z values
- Find best fit on dot product cluster if large enough
2 - Combine clusters
- Key – m/z value
- Sort m/z value
- Find best fit on dot product combine if large enough.
3 - Repeat Step 2
4 - Generate output
Note Every
Spectrum has 6
copies

Juan A. Vizcaíno
juan@ebi.ac.uk
In Memory Spectrum
Latest Spectrum
Out of Window Spectrum
Latest Spectrum
State Before Process
m/z
m/z
Process Next Spectrum
Drop Clusters out of
Window
Add
(or)
Compare
Merge
m/z window

Juan A. Vizcaíno
juan@ebi.ac.uk
Assessing the new PRIDE-H algorithm
‘HUPO’ dataset (HUPO PPP and BPP,
several publications).
‘COPD’ dataset (PMID: 19357784).
‘CPTAC’ dataset study 6 (PMID:
19837981).
Used to find the best parameters
for the spectrum clustering:
- Dot product threshold
to merge clusters.
- Number of iterations

Juan A. Vizcaíno
juan@ebi.ac.uk
Run done on September 2014
• It contains 54.2 Million identified in PRIDE (Public data)
• In total there are ~95 Million identified spectra in PRIDE.
• Aprox. 760 CPU days -> ~2.5 days run time.
• Results available in PRIDE cluster web interface
• The results will be available in PRIDE Archive very soon.

Juan A. Vizcaíno
juan@ebi.ac.uk
Result from the clustering of the whole of PRIDE
3 - size >= 10 & ratio >= 0.7 => reliable psm
2 - ratio >= 0.7 & size < 10 => potentially reliable
1 - ratio < 0.7 => unreliable psm
Rating Total number
1 11,051,042
2 8,925,445
3 21,156,575
Precursor m/z range in clusters

Juan A. Vizcaíno
juan@ebi.ac.uk
Detection of annotation errors
Cluster with 255 PSMs
In 11PSMs, one PTM
was not reported

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster web
https://www.ebi.ac.uk/pride/cluster/
Search by peptide
sequence
or provide a peak list

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster web

Juan A. Vizcaíno
juan@ebi.ac.uk
VNPTVFFDIAVDGEPLGR
PRIDE Cluster – Web interface
2014
2012

Juan A. Vizcaíno
juan@ebi.ac.uk
Information soon available in PRIDE Archive web
Information to be added in the new
PRIDE web interface at the PSM level
List of reliable PSMs to be
downloadable

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster – ExamplesPRIDE Cluster – New spectral libraries
http://www.ebi.ac.uk/pride/cluster/libraries
Species Number of spectra 2012 2014 Increase
Homo sapiens 81428 265495 326.05%
Mouse 53376 120524 225.80%
Rat 17624 42512 241.22%
Arabidopsis 69242 115260 166.46%
Drosophila 4686 28621 610.78%
Saccharomyces 71487 14153 19.80%

Juan A. Vizcaíno
juan@ebi.ac.uk
For the near future
• Very soon: Scores will be visible in PRIDE Archive page,
associated with the specific PSMs.
• A lot of data mining to do…
• Clustering of unidentified spectra as well.
• Around 450 M total spectra at present in PRIDE (and this is only
considering PX ‘complete’ submissions).
• Transfer annotation from identified to identified spectra.
• Find clusters of spectra that could not be originally identified:
exotic PTMs? SNPs?...
• Plans to move to Hadoop 2.0 (Yarn).

Juan A. Vizcaíno
juan@ebi.ac.uk
Conclusions
• Spectral clustering is used in PRIDE as a way to assess the
quality of PSMs
• Original PRIDE Cluster algorithm:
• It worked very well, but was not able to cope with data increase in
PRIDE.
• PRIDE Cluster-H algorithm:
• Based on the Hadoop open source framework (parallelization).
• It maintains accuracy and it has been used for 54.2 M spectra
(~2.5 days).

Juan A. Vizcaíno
juan@ebi.ac.uk
Acknowledgements
PRIDE Team
Johannes Griss
Rui Wang
Steve Lewis
Henning Hermjakob
juan@ebi.ac.uk
@juan_vizcaino

Juan A. Vizcaíno
juan@ebi.ac.uk
Questions?

Scaling Spectral Clustering in PRIDE Database

Recommended

Recommended

More Related Content

Similar to Scaling Spectral Clustering in PRIDE Database

Similar to Scaling Spectral Clustering in PRIDE Database (20)

More from Juan Antonio Vizcaino

More from Juan Antonio Vizcaino (20)

Recently uploaded

Recently uploaded (20)

Scaling Spectral Clustering in PRIDE Database