Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pride Cluster 062016 Update

265 views

Published on

Presentation given as a seminar for the group of Kathryn Lilley on 20 June 2016. Some slides are borrowed from Johannes Griss and Henning Hermjakob.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Pride Cluster 062016 Update

  1. 1. Spectrum clustering of PRIDE MS/MS data Dr. Juan Antonio Vizcaíno Proteomics Team Leader EMBL-EBI Hinxton, Cambridge, UK
  2. 2. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Overview • Brief introduction to PRIDE • PRIDE Cluster: motivation and concept, first implementation • PRIDE Cluster second implementation
  3. 3. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Overview • Brief introduction to PRIDE • PRIDE Cluster: motivation and concept, first implementation • PRIDE Cluster second implementation
  4. 4. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE (PRoteomics IDEntifications) database http://www.ebi.ac.uk/pride • PRIDE stores mass spectrometry (MS)- based proteomics data: • Peptide and protein expression data (identification and quantification) • Post-translational modifications • Mass spectra (raw data and peak lists) • Technical and biological metadata • Any other related information • Full support for tandem MS approaches Martens et al., Proteomics, 2005 Vizcaíno et al., NAR, 2016
  5. 5. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 ProteomeXchange: A Global, distributed proteomics database PASSEL (SRM data) PRIDE (MS/MS data) MassIVE (MS/MS data) Raw ID/Q Meta 155 datasets/month since July 2015 Mandatory raw data deposition since July 2015
  6. 6. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Overview • Brief introduction to PRIDE • PRIDE Cluster: motivation and concept, first implementation • PRIDE Cluster second implementation
  7. 7. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Motivation • Data is stored in PRIDE as originally analysed by the submitters (no data reprocessing is done) • Heterogeneous quality, difficult to make the data comparable • Enable assessment of (published) proteomics data • Pre-requisite for data reuse (e.g. in other bioinformatics resources such as UniProt)
  8. 8. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster - Concept Griss et al., Nat Methods, 2013 NMMAACDPR NMMAACDPR PPECPDFDPPR NMMAACDPR Consensus spectrum PPECPDFDPPR NMMAACDPR NMMAACDPR Threshold: At least 10 spectra in a cluster and ratio >70%. Originally submitted identified spectra
  9. 9. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster - Concept
  10. 10. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster: Implementation • Griss et al, Nat. Methods 2013 • Clustered all public, identified spectra in PRIDE • EBI compute farm, LSF • 20.7 M identified spectra • 610 CPU days, two calendar weeks • Validation, calibration • Feedback into PRIDE datasets • EBI farm, LSF
  11. 11. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Overview • Brief introduction to PRIDE • PRIDE Cluster: motivation and concept, first implementation • PRIDE Cluster second implementation
  12. 12. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Archive • World-leading repository for MS/MS-based proteomics data • Founding member of ProteomeXchange
  13. 13. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster Sequence-based search engines Spectrum clustering Incorrectly or unidentified spectra
  14. 14. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster: Second Implementation • Griss et al, Nat. Methods 2013 • Clustered all public, identified spectra in PRIDE • EBI compute farm, LSF • 20.7 M identified spectra • 610 CPU days, two calendar weeks • Validation, calibration • Feedback into PRIDE datasets • EBI farm, LSF Griss et al, Nat. Methods 2016, in press Clustered all public spectra in PRIDE by April 2015 Apache Hadoop • Starting with 256 M spectra. • 190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide). • 66 M identified spectra • Result: 28 M clusters • 5 calendar days on 30 node Hadoop cluster, 340 CPU cores
  15. 15. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster - Concept Griss et al., Nat Methods, 2016 NMMAACDPR NMMAACDPR PPECPDFDPPR NMMAACDPR Consensus spectrum PPECPDFDPPR NMMAACDPR NMMAACDPR Threshold: At least 3 spectra in a cluster and ratio >70%. Originally submitted identified spectra
  16. 16. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster - Concept
  17. 17. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster - Concept Griss et al., Nat Methods, 2016 NMMAACDPR NMMAACDPR PPECPDFDPPR NMMAACDPR Consensus spectrum PPECPDFDPPR NMMAACDPR NMMAACDPR Threshold: At least 3 spectra in a cluster and ratio >70%. Originally submitted identified spectra
  18. 18. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster - Concept
  19. 19. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Output of the analysis • 1. Inconsistent spectrum clusters • 2. Clusters including identified and unidentified clusters • 3. Clusters just containing unidentified spectra
  20. 20. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 1. Re-analysis of inconsistent clusters NMMAACDPR NMMAACDPR IGGIGTVPVGR NMMAACDPR PPECPDFDPPR VFDEFKPLVEEPQNLIK NMMAACDPR IGGIGTVPVGR No sequence has a proportion in the cluster >50% Consensus spectrum PPECPDFDPPR VFDEFKPLVEEP QNLIK Originally submitted identified spectra
  21. 21. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 1. Re-analysis of inconsistent clusters • Re-analysed 3,997 large (>100 spectra), inconsistent clusters with PepNovo, SpectraST, X!Tandem. • 453 clusters (11%) were identified as peptides originated from keratins, trypsin, albumin, and hemoglobin. • In this case, it is likely that a contaminants DB was not used in the search.
  22. 22. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Validation
  23. 23. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016
  24. 24. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016
  25. 25. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016
  26. 26. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016
  27. 27. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016
  28. 28. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 2. Inferring identifications for originally unidentified spectra 30 • 9.1 M unidentified spectra were contained in clusters with a reliable identification. • These are candidate new identifications (that need to be confirmed), often missed due to search engine settings • Example: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.
  29. 29. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 3. Consistently unidentified clusters 31 • 19 M clusters contain only unidentified spectra. • 41,155 of these spectra have more than 100 spectra (= 12 M spectra). • Most are likely to be derived from peptides. • They could correspond to PTMs or variant peptides. • With various methods, we found likely identifications for about 20%. • Vast amount of data mining remains to be done.
  30. 30. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 3. Consistently unidentified clusters
  31. 31. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster as a Public Data Mining Resource 36 • http://www.ebi.ac.uk/pride/cluster • Spectral libraries for 16 species. • All clustering results, as well as specific subsets of interest available. • Source code (open source) and Java API
  32. 32. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016
  33. 33. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Applications of spectrum clustering… 38 • In individual or small groups or “similar” datasets: • Can be used to target spectra that are “consistently” unidentified. • Unidentified spectra could represent PTMs or sequence variants. • Try “more-expensive” computational analysis methods (e.g. spectral searches, de novo). • When mixing identified and unidentified spectra from different experiments, if “non-initially” found PTMs are identified, one could modify the initial search parameters. • For quantification purposes, the alignment of different runs could be improved by clustering the spectra first?
  34. 34. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Acknowledgements Johannes Griss Rui Wang Yasset Perez-Riverol Steve Lewis Henning Hermjakob Open MS team (led by O. Kohlbacher) David Tabb The rest of the PRIDE team especially Noemi del Toro and Jose A. Dianes
  35. 35. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Questions?

×