Talk catching the_drift

Catching the Drift –
Indexing Implicit Knowledge in
Chemical Digital Libraries

Benjamin Köhncke, Sascha Tönnies, & Wolf-Tilo Balke
L3S Research Center
TPDL, Sep. 23 – 27, 2012, Cyprus

Outline
 Introducing the problem
 Describing our way to handle this problem
 Showing that this is a valid way
 Is there still room for improvement?

Sascha Tönnies 15/02/13 2

The Problem
 In chemistry search is entity centered
 Chemical entities can occur in different ways
 Structures / Images / String representations
 Synonyms available
 Already a though task for indexing and retrieval (JCDL 2010)
 The field of drug design is even more complex
 Not only searching for a specific entity or similar entities
 But for entities having the same or similar
characteristic (chemical reaction)
 Chemists have to use their implicit
knowledge
 No database available


The Question

How can we reflect the chemist’s perception
of chemical entities belonging to the same
chemical class to support him
during his search task?


Use Case: Anti-Tuberculosis Drugs

Comics taken from GiZGRAPHICS@fotolia.com


The Idea

Identifying the functional groups of all
occurring chemical entities and cluster them
according to their set of functional groups.


The Workflow…

Reagent(s)
Reactant(s) Product(s)
Reaction
Conditions


First Questions to answer…
 How to build meaningful clusters?

Experimental Set Up
 Dump of PubChem database containing 31,5 million entities
 Calculation of functional groups
 Extending the standard tool checkmol by “dimensions”


Meaningful Clusters by Functional Groups?
 Clustering by set of functional groups
 Cluster Name = MD5(set(functional group names))
 Clusters up to 100 entities are reasonable
 Evaluated by domain experts
 97,84% already usable but only contain around 30% of all entities
100%
# Contained Entities # Clusters 90%

1 773.092 80%

70%
1 < x ≤ 10 816.817
60%
10 < x ≤ 100 226.147 50%

40%
100 < x ≤ 1.000 36.535
30%
1.000 < x ≤ 10.000 3.615
20%
10.000 < x ≤ 100.000 143 10%

0%
100.000 < x 0
1001
1104
1206
1311
1425

1658
1786
1927
2090
2251
2430

2884
3161
3519
3923
4584
5371

8732
1539

2653

6795
100
200
300
400
500
600
700
800
900

14348
0

number of entities per cluster

percentage of all clusters percentage of all entities


Dividing Big Clusters into Sub-Clusters
 Sub-clustering of clusters containing more than 100 entities
 We have to find suitable similarity measures
 Using similarity functions based on fingerprints
 Only uncorrelated combinations chosen (JCDL2011)
 Randomly 100 clusters with more than 1000 entities chosen
 Randomly 10 queries chosen
 Similarity calculation between query and all other cluster entities
 For which of the measures the top-X ranked entities are in the
same functional group cluster?


Results
 Top 100 Candidates
 Estate Fingerprint with Russel Rao, Yule, Manhattan or Simpson
 Substructure Fingerprint with Russel Rao or Manhattan
 Top 1000 Candidates
 Substructure Fingerprint with Manhattan
 Overall: Substructure Fingerprint with Manhattan
100 800
90 700
80
600
70
60 500
50 400
40 300
30
200
20
10 100
0 0
Extended FP Estate FP FP Graphonly MACCS FP Substructure Extended FP Estate FP FP Graphonly MACCS FP Substructure
FP FP FP FP

Forbes Russel_Rao Yule Manhattan Simpson Forbes Russel_Rao Yule Manhattan Simpson


In Search of the K (1)
 We are using k-means clustering (WEKA implementation)
 Each group must contain at least one object
 Each object must belong to exactly one group
 The aim: each entity in a sub-cluster has the same chemical class


In Search of the K (2)
 We took domain specific ontology CheBI as ground truth
 We took randomly 2000 clusters (5%)
 Only clusters containing entities also included in CheBI
 Idea: Taking the ontology class as cluster label
 Only nodes that are at least 3 steps away from the entry node
(CIKM2010)
 We manually build respective sub-clusters
 Evaluation Algorithm stops if k-means found optimal solution
 Here it is k = 4
 Remark: CheBI contains 20.000 chemical classes for our
dataset, we found 150.000 (implicit) classes


Second Question to answer…
 Are these clusters usable for a document retrieval task?

Experimental Set Up
 Collection of 2588 chemical documents from
Archive of Organic Chemistry (ARKIVOC)
 Each document associated to its functional group clusters
based on containing entities
 Precision/Recall analysis by domain experts
 Representative sub-set of 10% of the entire collection
 Just taken entities occurring in > 20 but < 100 documents
 From these documents we randomly selected around 5% (18) as
query terms


Is the sub-cluster decomposition sensible?
 Recall around 93%
 Some entities from other sub-clusters are also relevant
 Precision in average up to 53% for k = 12
 Recall oriented F2: 68%

100%

80%

60%

40%

20%

0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Recall Precision F1 F2


Is it possible to increase that?
 Not just deliver all documents within the cluster
 Using similarity function to rank the documents
 Based on Wikipedia categories (CIKM2010)
𝑐𝑞 𝑖 𝑑 𝑗 𝑐𝑑 𝑗
 𝑠𝑤𝑐 𝑞 𝑖 , 𝑑 𝑗 = ×
𝑐𝑞 𝑖 𝑒𝑑 𝑗
 Evaluation of Mean Average Precision up to 72%
 It is enough to retrieve only documents within the same sub-cluster
73%
72%
71%
70%
69%
68%
67%
66%
65%
1 2 3 4 5 6 7 8 9 10 11 12 13 14
k


Even more findings (1)
 Comparison of number of entities for k = 1 and k = 12
 On average over all queries decreasing number of around 90%
 Recall does not decrease, thus high cluster quality

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
#Entities K = 12 733 23 372 158 1401 143 131 699 1078 46 112 88 37 1234 6043 1012 689 21
#Entities K=1 4638 164 14657 1699 10139 1187 1296 7200 5381 506 4624 539 465 19885 27423 10139 15347 293


Even more findings (2)
 Number of clusters including a certain percentage of all entities
 3500 sub-clusters have been reduced to 3% of #entities for k = 1
 Considering faceted search scenario this is quite important

4000

3500

3000
number of clusters

2500

2000

1500

1000

500

0

100
1
4
7

25

37

49
10
13
16
19
22

28
31
34

40
43
46

52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
entity reduction factor in percent


Take Aways
 Simple clustering based on functional groups is not enough!
 Most clusters are to unspecific
 Sub-Clustering with K-Means and Substructure finger print with
Manhattan worked fine
 Group of domain experts evaluated that almost all relevant
documents (recall of 93%) are located in the respective sub-cluster
 Instead of just delivering all documents from the respective cluster,
we also introduced a ranking measure based on Wikipedia
categories to further enhance the precision (MAP 72%).
 The number of entities in the sub-clusters is
dramatically decreased about 90% compared
to the original functional groups clusters.


www.L3S.de/~toennies

Thank You!


Backup
PRODUCT (?i).* Formation of s [CHEMICAL]
PRODUCT (?i).* One-pot synthesis of s [CHEMICAL]
PRODUCT (?i).* Preparation of (s+[-w|p{InGreek}]*s*){0,2}
[CHEMICAL]

Phenanthrene is a (3/1)
Dicumarol is a (2/2)


Talk catching the_drift

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Talk catching the_drift