Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Talk catching the_drift
1. Catching the Drift –
Indexing Implicit Knowledge in
Chemical Digital Libraries
Benjamin Köhncke, Sascha Tönnies, & Wolf-Tilo Balke
L3S Research Center
TPDL, Sep. 23 – 27, 2012, Cyprus
2. Outline
Introducing the problem
Describing our way to handle this problem
Showing that this is a valid way
Is there still room for improvement?
Sascha Tönnies 15/02/13 2
3. The Problem
In chemistry search is entity centered
Chemical entities can occur in different ways
Structures / Images / String representations
Synonyms available
Already a though task for indexing and retrieval (JCDL 2010)
The field of drug design is even more complex
Not only searching for a specific entity or similar entities
But for entities having the same or similar
characteristic (chemical reaction)
Chemists have to use their implicit
knowledge
No database available
Sascha Tönnies 15/02/13 3
4. The Question
How can we reflect the chemist’s perception
of chemical entities belonging to the same
chemical class to support him
during his search task?
Sascha Tönnies 15/02/13 4
6. The Idea
Identifying the functional groups of all
occurring chemical entities and cluster them
according to their set of functional groups.
Sascha Tönnies 15/02/13 6
8. First Questions to answer…
How to build meaningful clusters?
Experimental Set Up
Dump of PubChem database containing 31,5 million entities
Calculation of functional groups
Extending the standard tool checkmol by “dimensions”
Sascha Tönnies 15/02/13 8
9. Meaningful Clusters by Functional Groups?
Clustering by set of functional groups
Cluster Name = MD5(set(functional group names))
Clusters up to 100 entities are reasonable
Evaluated by domain experts
97,84% already usable but only contain around 30% of all entities
100%
# Contained Entities # Clusters 90%
1 773.092 80%
70%
1 < x ≤ 10 816.817
60%
10 < x ≤ 100 226.147 50%
40%
100 < x ≤ 1.000 36.535
30%
1.000 < x ≤ 10.000 3.615
20%
10.000 < x ≤ 100.000 143 10%
0%
100.000 < x 0
1001
1104
1206
1311
1425
1658
1786
1927
2090
2251
2430
2884
3161
3519
3923
4584
5371
8732
1539
2653
6795
100
200
300
400
500
600
700
800
900
14348
0
number of entities per cluster
percentage of all clusters percentage of all entities
Sascha Tönnies 15/02/13 9
10. Dividing Big Clusters into Sub-Clusters
Sub-clustering of clusters containing more than 100 entities
We have to find suitable similarity measures
Using similarity functions based on fingerprints
Only uncorrelated combinations chosen (JCDL2011)
Randomly 100 clusters with more than 1000 entities chosen
Randomly 10 queries chosen
Similarity calculation between query and all other cluster entities
For which of the measures the top-X ranked entities are in the
same functional group cluster?
Sascha Tönnies 15/02/13 10
12. In Search of the K (1)
We are using k-means clustering (WEKA implementation)
Each group must contain at least one object
Each object must belong to exactly one group
The aim: each entity in a sub-cluster has the same chemical class
Sascha Tönnies 15/02/13 12
13. In Search of the K (2)
We took domain specific ontology CheBI as ground truth
We took randomly 2000 clusters (5%)
Only clusters containing entities also included in CheBI
Idea: Taking the ontology class as cluster label
Only nodes that are at least 3 steps away from the entry node
(CIKM2010)
We manually build respective sub-clusters
Evaluation Algorithm stops if k-means found optimal solution
Here it is k = 4
Remark: CheBI contains 20.000 chemical classes for our
dataset, we found 150.000 (implicit) classes
Sascha Tönnies 15/02/13 13
14. Second Question to answer…
Are these clusters usable for a document retrieval task?
Experimental Set Up
Collection of 2588 chemical documents from
Archive of Organic Chemistry (ARKIVOC)
Each document associated to its functional group clusters
based on containing entities
Precision/Recall analysis by domain experts
Representative sub-set of 10% of the entire collection
Just taken entities occurring in > 20 but < 100 documents
From these documents we randomly selected around 5% (18) as
query terms
Sascha Tönnies 15/02/13 14
15. Is the sub-cluster decomposition sensible?
Recall around 93%
Some entities from other sub-clusters are also relevant
Precision in average up to 53% for k = 12
Recall oriented F2: 68%
100%
80%
60%
40%
20%
0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Recall Precision F1 F2
Sascha Tönnies 15/02/13 15
16. Is it possible to increase that?
Not just deliver all documents within the cluster
Using similarity function to rank the documents
Based on Wikipedia categories (CIKM2010)
𝑐𝑞 𝑖 𝑑 𝑗 𝑐𝑑 𝑗
𝑠𝑤𝑐 𝑞 𝑖 , 𝑑 𝑗 = ×
𝑐𝑞 𝑖 𝑒𝑑 𝑗
Evaluation of Mean Average Precision up to 72%
It is enough to retrieve only documents within the same sub-cluster
73%
72%
71%
70%
69%
68%
67%
66%
65%
1 2 3 4 5 6 7 8 9 10 11 12 13 14
k
Sascha Tönnies 15/02/13 16
17. Even more findings (1)
Comparison of number of entities for k = 1 and k = 12
On average over all queries decreasing number of around 90%
Recall does not decrease, thus high cluster quality
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
#Entities K = 12 733 23 372 158 1401 143 131 699 1078 46 112 88 37 1234 6043 1012 689 21
#Entities K=1 4638 164 14657 1699 10139 1187 1296 7200 5381 506 4624 539 465 19885 27423 10139 15347 293
Sascha Tönnies 15/02/13 17
18. Even more findings (2)
Number of clusters including a certain percentage of all entities
3500 sub-clusters have been reduced to 3% of #entities for k = 1
Considering faceted search scenario this is quite important
4000
3500
3000
number of clusters
2500
2000
1500
1000
500
0
100
1
4
7
25
37
49
10
13
16
19
22
28
31
34
40
43
46
52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
entity reduction factor in percent
Sascha Tönnies 15/02/13 18
19. Take Aways
Simple clustering based on functional groups is not enough!
Most clusters are to unspecific
Sub-Clustering with K-Means and Substructure finger print with
Manhattan worked fine
Group of domain experts evaluated that almost all relevant
documents (recall of 93%) are located in the respective sub-cluster
Instead of just delivering all documents from the respective cluster,
we also introduced a ranking measure based on Wikipedia
categories to further enhance the precision (MAP 72%).
The number of entities in the sub-clusters is
dramatically decreased about 90% compared
to the original functional groups clusters.
Sascha Tönnies 15/02/13 19
21. Backup
PRODUCT (?i).* Formation of s [CHEMICAL]
PRODUCT (?i).* One-pot synthesis of s [CHEMICAL]
PRODUCT (?i).* Preparation of (s+[-w|p{InGreek}]*s*){0,2}
[CHEMICAL]
Phenanthrene is a (3/1)
Dicumarol is a (2/2)
Sascha Tönnies 15/02/13 21