Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals

•Download as PPT, PDF•

1 like•1,342 views

The Royal Society of Chemistry has hosted the ChemSpider database and associated platforms for over five years. Technologies made significant progress over that period but, more importantly, the community needs in terms of the variety of data types as well as search performance have increased. The preprocessing of chemicals for improved similarity searching and compound database navigation is seen as one crucial component of major development efforts to architect a new data repository. This component is engineered and implemented in collaboration with the group of Professor Oliver Kohlbacher at University of Tübingen. They have developed an approach for clustering large chemical libraries based on a fast, parallel, and purely CPU-based algorithm for 2D binary fingerprint similarity calculation. Using this method, the complete similarity network of our seed set with tens of millions of chemicals has been analyzed at a Tanimoto threshold of 0.6 and all similarity links were fed into our database. The latter is highly beneficial and will allow us to create more complex and enriching visualizations of similar compounds with associated bioactivity data and physicochemical properties for the RSC chemical repository users. This presentation will provide an overview of our experiences in applying clustering to our compound data and how it will be used to enrich data navigation on the RSC data repository.

Science

Clustering the Royal Society of Chemistry
chemical repository to enable enhanced
navigation across millions of chemicals
Valery Tkachenko, Ken Karapetyan, Antony Williams,
Oliver Kohlbacher, Philipp Thiel, Colin Batchelor
ACS, 248th National Meeting
San Francisco, CA
August 14th
2014

• ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowdsourced curation and annotation
• Ongoing deposition of data from our
journals and our collaborators
• A structure centric hub for web-searching

Twelve broad categories
Largest
category is
30 times
the size of
the smallest

How does it work?
Latent Semantic Analysis to build feature sets
for (1) articles (2) categories.
Features: words, citations and pairs of words.
Domain experts (Journal Development staff)
build a category vector.
All articles with a cosine similarity greater than
an adjustable threshold go into the category.

Structures similarity
Molecule Similarity
Similarity ?Similarity ?
Suitable in silico representation:
2D binary fingerprints
Suitable in silico representation:
2D binary fingerprints
0 1 0 1 0 1 1 0Y:
0 1 1 0 1 1 0 1X:
25
0 1 2 3 4 5 6 7

$Structures similarity Molecule Similarity 26 • Important fingerprint properties: 1. Length: length of the binary vector 2. Density: fraction of 1-bits • Various fingerprint types exist – Different atom typing and generation procedure – Different properties (length, density, ...) • Alternative representation: Feature list – Store only index numbers of vector positions – Memory-efficient storage 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 Length 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 Sparse fingerprint (sFP) 1 1 0 1 0 1 1 0 0 1 1 1 0 1 1 1 Dense fingerprint (dFP) 0 1 0 1 0 1 1 0 1,3,5,6$

Structures similarity
27
2. Jaccard P., Bulletin del la Société Vaudoise des Sciences Naturelles (1901), 37, 547-579
3. Tanimoto T.T., IBM Internal Report (1957)
• Molecules as binary vectors
• Various chemoinformatics dis-/similiarity measures:
– Euclidean distance
– Cosine similarity (inner product)
• Most frequently used: Tanimoto Coefficient 2,3
– Corresponds to Jaccard index
– Metric
– [0.0, 1.0] (dissimilar  similar)
Molecule Similarity

Full Similarity Matrix Clustering
28
Results: Clustering the Available Chemspace
• ZINC all purchasable set: ~17x106
compounds (sFP)
• Tanimoto cutoff analysis: 0.76
• Opteron, 64 threads, 100 GB main memory
Total run-time: 64 hours
CCs decomposition: 12 hours
Total run-time: 64 hours
CCs decomposition: 12 hours

Thank you
Email: tkachenkov@rsc.org
Slides: http://www.slideshare.net/valerytkachenko16

What's hot

Supporting the exploding dimensions of the chemical sciences via global netwo...Valery Tkachenko

FAIR Data and Model Management for Systems Biology(and SOPs too!)Carole Goble

Tools and approaches for data deposition into nanomaterial databasesValery Tkachenko

Opportunities in chemical structure standardizationValery Tkachenko

Chemistry Validation and Standardization Platform v2.0Valery Tkachenko

Improving the Management of Computational Models -- Invited talk at the EBIMartin Scharm

Royal society of chemistry activities to develop a data repository for chemis...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

FAIR Data, Operations and Model management for Systems Biology and Systems Me...Carole Goble

Citing data in research articles: principles, implementation, challenges - an...FAIRDOM

The UK National Chemical Database Service – an integration of commercial and ...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Enhancing the Quality of ImmPort DataBarry Smith

An Open Repository Model for Acquiring Knowledge About Scientific ExperimentsCEDAR: Center for Expanded Data Annotation and Retrieval

Open Science Data Repository - the platform for materials researchValery Tkachenko

Overview of open resources to support automated structure verification and e...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

The importance of standards for data exchange and interchange on the Royal So...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

How an Online Resource for Chemistry Can Change Our WorldUS Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

A chemistry data repository to serve them allUS Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Adding complex expert knowledge into chemical database and transforming surfa...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Building linked data large-scale chemistry platform - challenges, lessons and...Valery Tkachenko

ChemSpider – disseminating data and enabling an abundance of chemistry platformsUS Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

What's hot (20)

Supporting the exploding dimensions of the chemical sciences via global netwo...

FAIR Data and Model Management for Systems Biology(and SOPs too!)

Tools and approaches for data deposition into nanomaterial databases

Opportunities in chemical structure standardization

Chemistry Validation and Standardization Platform v2.0

Improving the Management of Computational Models -- Invited talk at the EBI

Royal society of chemistry activities to develop a data repository for chemis...

FAIR Data, Operations and Model management for Systems Biology and Systems Me...

Citing data in research articles: principles, implementation, challenges - an...

The UK National Chemical Database Service – an integration of commercial and ...

Enhancing the Quality of ImmPort Data

An Open Repository Model for Acquiring Knowledge About Scientific Experiments

Open Science Data Repository - the platform for materials research

Overview of open resources to support automated structure verification and e...

The importance of standards for data exchange and interchange on the Royal So...

How an Online Resource for Chemistry Can Change Our World

A chemistry data repository to serve them all

Adding complex expert knowledge into chemical database and transforming surfa...

Building linked data large-scale chemistry platform - challenges, lessons and...

ChemSpider – disseminating data and enabling an abundance of chemistry platforms

Similar to Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals

Data Mining to Discovery for Inorganic Solids: Software Tools and Applicationsaimsnist

eScience Resources for the Chemistry Community from the Royal Society of Chem...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsAnubhav Jain

Metadata-based tools at the ENCODE PortalENCODE-DCC

10 Years of Multi-Label LearningGrigorios Tsoumakas

Neuroscience as networked scienceNeuroscience Information Framework

FAIR data requires FAIR ontologies, how do we do?INRAE (MISTEA) and University of Montpellier (LIRMM)

Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...ICZN

The application of cloud computing to royal society of chemistry data platformsValery Tkachenko

Overview of cheminformaticsBenjamin Bucior

Databases_CSS2.pptxSilpa87

Ontologies for life sciences: examples from the gene ontologyMelanie Courtot

Semantic Technologies for Big Sciences including AstrophysicsArtificial Intelligence Institute at UofSC

The expansive reach of ChemSpider as a resource for the chemistry communityUS Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier

GARNet workshop on Integrating Large Data into Plant ScienceDavid Johnson

Building a Biomedical Knowledge Garden Benjamin Good

Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)Numenta

Encyclopedia of Life: Use cases for phenotypesCyndy Parr

Applying tensor decompositions to author name disambiguation of common Japane...National Institute of Informatics

Similar to Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals (20)

Data Mining to Discovery for Inorganic Solids: Software Tools and Applications

eScience Resources for the Chemistry Community from the Royal Society of Chem...

Data Mining to Discovery for Inorganic Solids: Software Tools and Applications

Metadata-based tools at the ENCODE Portal

10 Years of Multi-Label Learning

Neuroscience as networked science

FAIR data requires FAIR ontologies, how do we do?

Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...

The application of cloud computing to royal society of chemistry data platforms

Overview of cheminformatics

Databases_CSS2.pptx

Ontologies for life sciences: examples from the gene ontology

Semantic Technologies for Big Sciences including Astrophysics

The expansive reach of ChemSpider as a resource for the chemistry community

Semi-automated Exploration and Extraction of Data in Scientific Tables

GARNet workshop on Integrating Large Data into Plant Science

Building a Biomedical Knowledge Garden

Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)

Encyclopedia of Life: Use cases for phenotypes

Applying tensor decompositions to author name disambiguation of common Japane...

Recently uploaded

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha

pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1

PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384

Introduction,importance and scope of horticulture.pptxBhagirath Gogikar

Site Acceptance Test .Poonam Aher Patil

GBSN - Microbiology (Unit 1)Areesha Ahmad

Unit5-Cloud.pptx for lpu course cse121 oManavSingh202607

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi

biology HL practice questions IB BIOLOGY1301aanya

Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1

Bacterial Identification and ClassificationsAreesha Ahmad

9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1

CELL -Structural and Functional unit of life.pdfNistarini College, Purulia (W.B) India

IDENTIFICATION OF THE LIVING- forensic medicinesherlingomez2

GBSN - Biochemistry (Unit 1)Areesha Ahmad

STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONrouseeyyy

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani

Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav

Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju

Recently uploaded (20)

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000

pumpkin fruit fly, water melon fruit fly, cucumber fruit fly

PSYCHOSOCIAL NEEDS. in nursing II sem pptx

Introduction,importance and scope of horticulture.pptx

Site Acceptance Test .

GBSN - Microbiology (Unit 1)

Unit5-Cloud.pptx for lpu course cse121 o

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.

biology HL practice questions IB BIOLOGY

Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service

Bacterial Identification and Classifications

9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service

CELL -Structural and Functional unit of life.pdf

IDENTIFICATION OF THE LIVING- forensic medicine

GBSN - Biochemistry (Unit 1)

STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b

Zoology 5th semester notes( Sumit_yadav).pdf

Pests of mustard_Identification_Management_Dr.UPR.pdf

Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals

1. Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan, Antony Williams, Oliver Kohlbacher, Philipp Thiel, Colin Batchelor ACS, 248th National Meeting San Francisco, CA August 14th 2014

2. Chemical space - 1060

3. Navigation in chemical space

4. Clustering

5. Science dimensions

6. • ~30 million chemicals and growing • Data sourced from >500 different sources • Crowdsourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • A structure centric hub for web-searching

7. ChemSpider

8. Properties

9. Classification

10. ChemSpider Data Slices

11. Tagging in ChemSpider

12. RSC Archive – since 1841

13. DERA - Digitally Enabling RSC Archive

14. Twelve broad categories

15. Twelve broad categories Largest category is 30 times the size of the smallest

16. 200 subcategories

17. How does it work? Latent Semantic Analysis to build feature sets for (1) articles (2) categories. Features: words, citations and pairs of words. Domain experts (Journal Development staff) build a category vector. All articles with a cosine similarity greater than an adjustable threshold go into the category.

18. RSC Data Repository

19.

20.

21.

22.

23.

24.

25. Structures similarity Molecule Similarity Similarity ?Similarity ? Suitable in silico representation: 2D binary fingerprints Suitable in silico representation: 2D binary fingerprints 0 1 0 1 0 1 1 0Y: 0 1 1 0 1 1 0 1X: 25 0 1 2 3 4 5 6 7

26. Structures similarity Molecule Similarity 26 • Important fingerprint properties: 1. Length: length of the binary vector 2. Density: fraction of 1-bits • Various fingerprint types exist – Different atom typing and generation procedure – Different properties (length, density, ...) • Alternative representation: Feature list – Store only index numbers of vector positions – Memory-efficient storage 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 Length 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 Sparse fingerprint (sFP) 1 1 0 1 0 1 1 0 0 1 1 1 0 1 1 1 Dense fingerprint (dFP) 0 1 0 1 0 1 1 0 1,3,5,6

27. Structures similarity 27 2. Jaccard P., Bulletin del la Société Vaudoise des Sciences Naturelles (1901), 37, 547-579 3. Tanimoto T.T., IBM Internal Report (1957) • Molecules as binary vectors • Various chemoinformatics dis-/similiarity measures: – Euclidean distance – Cosine similarity (inner product) • Most frequently used: Tanimoto Coefficient 2,3 – Corresponds to Jaccard index – Metric – [0.0, 1.0] (dissimilar  similar) Molecule Similarity

28. Full Similarity Matrix Clustering 28 Results: Clustering the Available Chemspace • ZINC all purchasable set: ~17x106 compounds (sFP) • Tanimoto cutoff analysis: 0.76 • Opteron, 64 threads, 100 GB main memory Total run-time: 64 hours CCs decomposition: 12 hours Total run-time: 64 hours CCs decomposition: 12 hours

29. Federated linked system

30. Thank you Email: tkachenkov@rsc.org Slides: http://www.slideshare.net/valerytkachenko16

Editor's Notes

Change to add more database, rearrange

Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals

Similar to Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals (20)

More from Valery Tkachenko

More from Valery Tkachenko (20)

Recently uploaded

Recently uploaded (20)

Clustering the royal society of chemistry chemical repository to enable enhanced navigation across millions of chemicals

Editor's Notes