Your SlideShare is downloading. ×
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals

120
views

Published on

The Royal Society of Chemistry has hosted the ChemSpider database and associated platforms for over five years. Technologies made significant progress over that period but, more importantly, the …

The Royal Society of Chemistry has hosted the ChemSpider database and associated platforms for over five years. Technologies made significant progress over that period but, more importantly, the community needs in terms of the variety of data types as well as search performance have increased. The preprocessing of chemicals for improved similarity searching and compound database navigation is seen as one crucial component of major development efforts to architect a new data repository. This component is engineered and implemented in collaboration with the group of Professor Oliver Kohlbacher at University of Tübingen. They have developed an approach for clustering large chemical libraries based on a fast, parallel, and purely CPU-based algorithm for 2D binary fingerprint similarity calculation. Using this method, the complete similarity network of our seed set with tens of millions of chemicals has been analyzed at a Tanimoto threshold of 0.6 and all similarity links were fed into our database. The latter is highly beneficial and will allow us to create more complex and enriching visualizations of similar compounds with associated bioactivity data and physicochemical properties for the RSC chemical repository users. This presentation will provide an overview of our experiences in applying clustering to our compound data and how it will be used to enrich data navigation on the RSC data repository.

Published in: Science

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
120
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Change to add more database, rearrange
  • Transcript

    • 1. Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan, Antony Williams, Oliver Kohlbacher, Philipp Thiel, Colin Batchelor ACS, 248th National Meeting San Francisco, CA August 14th 2014
    • 2. Chemical space - 1060
    • 3. Navigation in chemical space
    • 4. Clustering
    • 5. Science dimensions
    • 6. • ~30 million chemicals and growing • Data sourced from >500 different sources • Crowdsourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • A structure centric hub for web-searching
    • 7. ChemSpider
    • 8. Properties
    • 9. Classification
    • 10. ChemSpider Data Slices
    • 11. Tagging in ChemSpider
    • 12. RSC Archive – since 1841
    • 13. DERA - Digitally Enabling RSC Archive
    • 14. Twelve broad categories
    • 15. Twelve broad categories Largest category is 30 times the size of the smallest
    • 16. 200 subcategories
    • 17. How does it work? Latent Semantic Analysis to build feature sets for (1) articles (2) categories. Features: words, citations and pairs of words. Domain experts (Journal Development staff) build a category vector. All articles with a cosine similarity greater than an adjustable threshold go into the category.
    • 18. RSC Data Repository
    • 19. Structures similarity Molecule Similarity Similarity ?Similarity ? Suitable in silico representation: 2D binary fingerprints Suitable in silico representation: 2D binary fingerprints 0 1 0 1 0 1 1 0Y: 0 1 1 0 1 1 0 1X: 25 0 1 2 3 4 5 6 7
    • 20. Structures similarity Molecule Similarity 26 • Important fingerprint properties: 1. Length: length of the binary vector 2. Density: fraction of 1-bits • Various fingerprint types exist – Different atom typing and generation procedure – Different properties (length, density, ...) • Alternative representation: Feature list – Store only index numbers of vector positions – Memory-efficient storage 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 Length 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 Sparse fingerprint (sFP) 1 1 0 1 0 1 1 0 0 1 1 1 0 1 1 1 Dense fingerprint (dFP) 0 1 0 1 0 1 1 0 1,3,5,6
    • 21. Structures similarity 27 2. Jaccard P., Bulletin del la Société Vaudoise des Sciences Naturelles (1901), 37, 547-579 3. Tanimoto T.T., IBM Internal Report (1957) • Molecules as binary vectors • Various chemoinformatics dis-/similiarity measures: – Euclidean distance – Cosine similarity (inner product) • Most frequently used: Tanimoto Coefficient 2,3 – Corresponds to Jaccard index – Metric – [0.0, 1.0] (dissimilar  similar) Molecule Similarity
    • 22. Full Similarity Matrix Clustering 28 Results: Clustering the Available Chemspace • ZINC all purchasable set: ~17x106 compounds (sFP) • Tanimoto cutoff analysis: 0.76 • Opteron, 64 threads, 100 GB main memory Total run-time: 64 hours CCs decomposition: 12 hours Total run-time: 64 hours CCs decomposition: 12 hours
    • 23. Federated linked system
    • 24. Thank you Email: tkachenkov@rsc.org Slides: http://www.slideshare.net/valerytkachenko16