Dr. Juan Antonio Vizcaíno presents two algorithms for spectral clustering in the PRIDE proteomics database: PRIDE Cluster and PRIDE Cluster H. PRIDE Cluster originally clustered over 20 million spectra but did not scale well with increasing data. PRIDE Cluster H was developed to address this using the Hadoop parallel processing framework. It clustered over 54 million spectra from PRIDE in around 2.5 days. The algorithms aim to assess identification quality and find consensus spectra across experiments. Cluster results will be made available in the PRIDE Archive and used for annotation transfer and identification of previously unidentified spectra.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Scaling Spectral Clustering in PRIDE Database
1. Scaling up Spectral Clustering in the
PRIDE database
Dr. Juan Antonio Vizcaíno
PRIDE Group Coordinator
Proteomics Services Team
EMBL-EBI
Hinxton, Cambridge, UK
2. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Overview
• Spectral clustering in PRIDE, what for?
• PRIDE Cluster algorithm
• PRIDE Cluster H algorithm
3. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride
• Data repository focused
on MS/MS approaches
• Main submission point of
MS/MS data in the
ProteomeXchange
consortium
• Data is stored as
originally analysed by the
authors
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2013
4. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Why PRIDE Cluster is needed?
• Data in PRIDE are of very heterogeneous sources: different
instrumentation, search engines, analysis pipelines, etc.
• Data is of heterogeneous quality
• We want to do QC on top, using the information from the
originally submitted results.
• Spectrum clustering approach
5. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Overview
• Spectral clustering in PRIDE, what for?
• PRIDE Cluster algorithm
• PRIDE Cluster H algorithm
6. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
• Quality assessment method. Taking advantage of the wealth of data in
PRIDE.
• Assumption: The same peptide will generate similar MS/MS spectra
across many experiments:
• No data pre-filtering done. All species, all instruments together.
• Done in June 2012.
• Cluster all identified spectra (20.7 M) in PRIDE (modified version of the MS-Cluster
algorithm [1]). API available (http://pride-spectra-clustering.googlecode.com).
• Those clusters which contain only/mainly one peptide are considered reliable.
• 601 CPU days (~2 weeks).
• The more data, the consensus spectrum will be more stable.
PRIDE Cluster
Griss et al., Nat Methods, 20131. Frank et al. Nat Methods 8, 587-591 (2011)
PRIDE Cluster
7. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
PRIDE Cluster
Griss et al., Nat Methods, 20131. Frank et al. Nat Methods 8, 587-591 (2011)
PRIDE Cluster
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 10 spectra in
a cluster and ratio >70%.
Originally submitted identified spectra
8. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
PRIDE Cluster – Web interface
VNPTVFFDIAVDGEPLGR
http://www.ebi.ac.uk/pride/cluster/
9. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
PRIDE Spectra clustering API is open source
https://code.google.com/p/pride-spectra-clustering/
10. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Overview
• Spectral clustering in PRIDE, what for?
• Original PRIDE Cluster algorithm
• PRIDE Cluster H algorithm
11. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
However…
• Increase of data in PRIDE,
especially in the “most popular” m/z values
• We tried a new run of the original algorithm at the beginning of 2013
• The original algorithm did not work for these “popular” m/z windows
• For the others it took far too much time.
• We also wanted to extend the approach to unidentified spectra.
12. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Parallelizing Spectral Clustering: Hadoop
• The key to speeding up clustering is to split up the job and
run sections in parallel on different machines.
• Hadoop is a Framework for parallelism using the Map-
Reduce algorithm by Google.
• It is an open source implementation of Map-Reduce
• It solves many general issues of large parallel jobs
o Scheduling
o inter-job communication
o failure
13. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Handling Spectra first
● Filter
● Normalize
● Choose highest peaks
14. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Clustering Steps
1 Highest Peak Clustering
- Key - each of highest Six Peaks
Assumption: Clustered spectra will share at least one
of their most intense 6 peaks
- Sort m/z values
- Find best fit on dot product cluster if large enough
2 - Combine clusters
- Key – m/z value
- Sort m/z value
- Find best fit on dot product combine if large enough.
3 - Repeat Step 2
4 - Generate output
Note Every
Spectrum has 6
copies
15. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
In Memory Spectrum
Latest Spectrum
Out of Window Spectrum
Latest Spectrum
State Before Process
m/z
m/z
Process Next Spectrum
Drop Clusters out of
Window
Add
(or)
Compare
Merge
m/z window
16. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Assessing the new PRIDE-H algorithm
‘HUPO’ dataset (HUPO PPP and BPP,
several publications).
‘COPD’ dataset (PMID: 19357784).
‘CPTAC’ dataset study 6 (PMID:
19837981).
Used to find the best parameters
for the spectrum clustering:
- Dot product threshold
to merge clusters.
- Number of iterations
17. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Run done on September 2014
• It contains 54.2 Million identified in PRIDE (Public data)
• In total there are ~95 Million identified spectra in PRIDE.
• Aprox. 760 CPU days -> ~2.5 days run time.
• Results available in PRIDE cluster web interface
• The results will be available in PRIDE Archive very soon.
18. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Result from the clustering of the whole of PRIDE
3 - size >= 10 & ratio >= 0.7 => reliable psm
2 - ratio >= 0.7 & size < 10 => potentially reliable
1 - ratio < 0.7 => unreliable psm
Rating Total number
1 11,051,042
2 8,925,445
3 21,156,575
Precursor m/z range in clusters
19. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Detection of annotation errors
Cluster with 255 PSMs
In 11PSMs, one PTM
was not reported
20. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
PRIDE Cluster web
https://www.ebi.ac.uk/pride/cluster/
Search by peptide
sequence
or provide a peak list
23. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Information soon available in PRIDE Archive web
Information to be added in the new
PRIDE web interface at the PSM level
List of reliable PSMs to be
downloadable
24. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
PRIDE Cluster – ExamplesPRIDE Cluster – New spectral libraries
http://www.ebi.ac.uk/pride/cluster/libraries
Species Number of spectra 2012 2014 Increase
Homo sapiens 81428 265495 326.05%
Mouse 53376 120524 225.80%
Rat 17624 42512 241.22%
Arabidopsis 69242 115260 166.46%
Drosophila 4686 28621 610.78%
Saccharomyces 71487 14153 19.80%
25. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
For the near future
• Very soon: Scores will be visible in PRIDE Archive page,
associated with the specific PSMs.
• A lot of data mining to do…
• Clustering of unidentified spectra as well.
• Around 450 M total spectra at present in PRIDE (and this is only
considering PX ‘complete’ submissions).
• Transfer annotation from identified to identified spectra.
• Find clusters of spectra that could not be originally identified:
exotic PTMs? SNPs?...
• Plans to move to Hadoop 2.0 (Yarn).
26. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Conclusions
• Spectral clustering is used in PRIDE as a way to assess the
quality of PSMs
• Original PRIDE Cluster algorithm:
• It worked very well, but was not able to cope with data increase in
PRIDE.
• PRIDE Cluster-H algorithm:
• Based on the Hadoop open source framework (parallelization).
• It maintains accuracy and it has been used for 54.2 M spectra
(~2.5 days).
27. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 8 October 2014
Acknowledgements
PRIDE Team
Johannes Griss
Rui Wang
Steve Lewis
Henning Hermjakob
juan@ebi.ac.uk
@juan_vizcaino