Clustering Made Human: US UGM 2008

683 views
621 views

Published on

Clustering chemical structures alleviates the tedious task of browsing a large set of compounds by grouping individual structures into generic categories. ChemAxon's JKlustor product offers clustering solutions ranging from similarity based non-hierarchical method to a pure graph based technique. This latter exhibits some clear advantages over the more conventional approaches: clusters are more likely to meet human expectations and tangible explanation why certain compounds are grouped together is also produced. And even it is faster. If you 'farm your classes' then it's time to 'MCS your library'!

Latest developments are here: http://www.chemaxon.com/product/jklustor.html

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
683
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Clustering Made Human: US UGM 2008

  1. 1. Clustering made humanMiklos Vargyas •Solutions for Cheminformatics
  2. 2. Cluster in computingComputer cluster 3
  3. 3. Cluster in ChemistryTransition metal carbonyl clustersDimanganese-decacarbonyl di-tungsten tetra(hpp)Transition metal halide clustersBoron hydridesGas-phase clusters and fullerenes 4
  4. 4. Cluster in Chemistry/PhysicsNanoscale particles• Fullerenes• Nano machines Images produced by MarvinSpace 5
  5. 5. Star clustergravitationally bound groups of stars Image from Wikipedia, the free encyclopedia 6
  6. 6. Clustering carsLive demonstrationGroup by property• Shape, size, type, brand, colour• Many possible arrangement, multiple aspectsGroup by similarity• Categorial perception 7
  7. 7. Why is clustering stars easy?God did the job for us!• Stars have an apparent spatial arrangement• Distance between stars defines clusters 8
  8. 8. Why is clustering cars hard?Lack of innate spatial arrangement • Artificial arrangement • Various approaches, no superior one • “Cars come in all shapes and sizes”Problem of dimensionality• Why 2?! 9
  9. 9. So what about MoleculesAre they like stars or rather like cars? • They come in all shapes and sizes • Vast number of propertiesChemical spaces • Select molecular properties • Estimate or measure them • Use them as coordinates • Place your molecules as points in this abstract space • Group that are close to each other to form clusters 10
  10. 10. Example in 2D 11
  11. 11. Further attempts in 2D 300 250 200 logP 150 100 50 300 0 0 200 400 600 800 1000 250 tpsa 200mass 150 100 50 0 -2 0 2 4 6 8 10 12 tpsa 12
  12. 12. Molecule clusters by similarityJarvis-Patrick clustering • Fast SC1000.cfp -m 0 -f 1024 -t 0.6 -c jarp -i 0.1 • Tanimoto -o SC1000.jarp.t0.6.c0.1 –g -y -z similarity • Globular clusters Number of objects = 999 • Tendency to create large singletons) = Number of clusters (without number of 2 singletons Number of singletons = 8 • Molecular properties & fingerprintAverage dissimilarity = 0.66208726Minimum dissimilarity = 0.0Maximum dissimilarity = 0.9411765 13
  13. 13. Parameter tuning t c Clusters singletons0.6 0.1 2 80.3 0.1 179 2480.5 0.1 7 36 14
  14. 14. The most populated cluster 15
  15. 15. Parameter tuning t c Clusters singletons0.6 0.1 2 80.3 0.1 179 2480.5 0.1 7 360.5 0.5 10 370.5 0.8 81 115 16
  16. 16. Another cluster 17
  17. 17. So what’s wrong with that?1. manual tuning2. lack of interpretability3. need:4. automated (unsupervised) techniques5. easy to grasp simple to understand “explanations”6. one possible solutions: MCS based clustering 18
  18. 18. Maximum Common SubstructureLargest substructure shared by two moleculesMCSSimple concept! More human, visual.Yet hard (= expensive (= slow)) to compute.. 19
  19. 19. MCS of a structure set 20
  20. 20. Hierarchical star clustersstar 21
  21. 21. Hierarchical star clustersstar cluster • star 22
  22. 22. Hierarchical star clustersgalaxy • star cluster – star 23
  23. 23. Hierarchical star clusterslocal group• galaxy – star cluster  star 24
  24. 24. Hierarchical star clusterssupercluster • cluster – local group  galaxy » star cluster 25
  25. 25. Visualisation of hierarchyDendrogram 26
  26. 26. Hierarchical MCS 27
  27. 27. Intuitive visualisation 28
  28. 28. SAR table view 29
  29. 29. R-group deconvolusion 30
  30. 30. Speed-up achieved last year 4000 3500 2006 3000 2007 Linear (2007)Running time (sec) 2500 2000 1500 1000 500 0 -500 0 5000 10000 15000 20000 25000 30000 35000 Structure count Presented at UGM’07 31
  31. 31. Speed-up achieved this year 4000 3500 2006 3000 2007 2008Running time (sec) 2500 2000 1500 1000 500 0 0 5000 10000 15000 20000 25000 30000 35000 Structure count 32
  32. 32. Speed-up this year 10000 1000Running time (sec) 100 2006 2007 2008 10 1 0.1 0 5000 10000 15000 20000 25000 30000 35000 Structure count 33
  33. 33. Clustering performance comparison 90 80 LibraryMCSRunning time (min) 70 60 Jarvis-Patrick Ward-Murtagh 50 40 30 20 10 0 0 20000 40000 60000 80000 100000 120000 Structure count 34
  34. 34. Find out moreProduct descriptions & links www.chemaxon.com/products.htmlForum www.chemaxon.com/forumPresentations and posters www.chemaxon.com/confDownload www.chemaxon.com/download.html 35

×