Clustering the Royal Society of Chemistry
chemical repository to enable enhanced
navigation across millions of chemicals
V...
Chemical space - 1060
Navigation in chemical space
Clustering
Science dimensions
• ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowdsourced curation and annotation
• On...
ChemSpider
Properties
Classification
ChemSpider Data Slices
Tagging in ChemSpider
RSC Archive – since 1841
DERA -
Digitally Enabling RSC Archive
Twelve broad categories
Twelve broad categories
Largest
category is
30 times
the size of
the smallest
200 subcategories
How does it work?
Latent Semantic Analysis to build feature sets
for (1) articles (2) categories.
Features: words, citatio...
RSC Data Repository
Structures similarity
Molecule Similarity
Similarity ?Similarity ?
Suitable in silico representation:
2D binary fingerprin...
Structures similarity
Molecule Similarity
26
• Important fingerprint properties:
1. Length: length of the binary vector
2....
Structures similarity
27
2. Jaccard P., Bulletin del la Société Vaudoise des Sciences Naturelles (1901), 37, 547-579
3. Ta...
Full Similarity Matrix Clustering
28
Results: Clustering the Available Chemspace
• ZINC all purchasable set: ~17x106
compo...
Federated linked system
Thank you
Email: tkachenkov@rsc.org
Slides: http://www.slideshare.net/valerytkachenko16
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals
Upcoming SlideShare
Loading in …5
×

Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals

710 views

Published on

The Royal Society of Chemistry has hosted the ChemSpider database and associated platforms for over five years. Technologies made significant progress over that period but, more importantly, the community needs in terms of the variety of data types as well as search performance have increased. The preprocessing of chemicals for improved similarity searching and compound database navigation is seen as one crucial component of major development efforts to architect a new data repository. This component is engineered and implemented in collaboration with the group of Professor Oliver Kohlbacher at University of Tübingen. They have developed an approach for clustering large chemical libraries based on a fast, parallel, and purely CPU-based algorithm for 2D binary fingerprint similarity calculation. Using this method, the complete similarity network of our seed set with tens of millions of chemicals has been analyzed at a Tanimoto threshold of 0.6 and all similarity links were fed into our database. The latter is highly beneficial and will allow us to create more complex and enriching visualizations of similar compounds with associated bioactivity data and physicochemical properties for the RSC chemical repository users. This presentation will provide an overview of our experiences in applying clustering to our compound data and how it will be used to enrich data navigation on the RSC data repository.

Published in: Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
710
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Change to add more database, rearrange
  • Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals

    1. 1. Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan, Antony Williams, Oliver Kohlbacher, Philipp Thiel, Colin Batchelor ACS, 248th National Meeting San Francisco, CA August 14th 2014
    2. 2. Chemical space - 1060
    3. 3. Navigation in chemical space
    4. 4. Clustering
    5. 5. Science dimensions
    6. 6. • ~30 million chemicals and growing • Data sourced from >500 different sources • Crowdsourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • A structure centric hub for web-searching
    7. 7. ChemSpider
    8. 8. Properties
    9. 9. Classification
    10. 10. ChemSpider Data Slices
    11. 11. Tagging in ChemSpider
    12. 12. RSC Archive – since 1841
    13. 13. DERA - Digitally Enabling RSC Archive
    14. 14. Twelve broad categories
    15. 15. Twelve broad categories Largest category is 30 times the size of the smallest
    16. 16. 200 subcategories
    17. 17. How does it work? Latent Semantic Analysis to build feature sets for (1) articles (2) categories. Features: words, citations and pairs of words. Domain experts (Journal Development staff) build a category vector. All articles with a cosine similarity greater than an adjustable threshold go into the category.
    18. 18. RSC Data Repository
    19. 19. Structures similarity Molecule Similarity Similarity ?Similarity ? Suitable in silico representation: 2D binary fingerprints Suitable in silico representation: 2D binary fingerprints 0 1 0 1 0 1 1 0Y: 0 1 1 0 1 1 0 1X: 25 0 1 2 3 4 5 6 7
    20. 20. Structures similarity Molecule Similarity 26 • Important fingerprint properties: 1. Length: length of the binary vector 2. Density: fraction of 1-bits • Various fingerprint types exist – Different atom typing and generation procedure – Different properties (length, density, ...) • Alternative representation: Feature list – Store only index numbers of vector positions – Memory-efficient storage 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 Length 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 Sparse fingerprint (sFP) 1 1 0 1 0 1 1 0 0 1 1 1 0 1 1 1 Dense fingerprint (dFP) 0 1 0 1 0 1 1 0 1,3,5,6
    21. 21. Structures similarity 27 2. Jaccard P., Bulletin del la Société Vaudoise des Sciences Naturelles (1901), 37, 547-579 3. Tanimoto T.T., IBM Internal Report (1957) • Molecules as binary vectors • Various chemoinformatics dis-/similiarity measures: – Euclidean distance – Cosine similarity (inner product) • Most frequently used: Tanimoto Coefficient 2,3 – Corresponds to Jaccard index – Metric – [0.0, 1.0] (dissimilar  similar) Molecule Similarity
    22. 22. Full Similarity Matrix Clustering 28 Results: Clustering the Available Chemspace • ZINC all purchasable set: ~17x106 compounds (sFP) • Tanimoto cutoff analysis: 0.76 • Opteron, 64 threads, 100 GB main memory Total run-time: 64 hours CCs decomposition: 12 hours Total run-time: 64 hours CCs decomposition: 12 hours
    23. 23. Federated linked system
    24. 24. Thank you Email: tkachenkov@rsc.org Slides: http://www.slideshare.net/valerytkachenko16

    ×