SlideShare a Scribd company logo
1 of 21
Catching the Drift –
            Indexing Implicit Knowledge in
              Chemical Digital Libraries




Benjamin Köhncke, Sascha Tönnies, & Wolf-Tilo Balke
L3S Research Center
                                             TPDL, Sep. 23 – 27, 2012, Cyprus
Outline
   Introducing the problem
   Describing our way to handle this problem
   Showing that this is a valid way
   Is there still room for improvement?




                                                Sascha Tönnies   15/02/13   2
The Problem
 In chemistry search is entity centered
 Chemical entities can occur in different ways
    Structures / Images / String representations
    Synonyms available
    Already a though task for indexing and retrieval (JCDL 2010)
 The field of drug design is even more complex
    Not only searching for a specific entity or similar entities
    But for entities having the same or similar
     characteristic (chemical reaction)
 Chemists have to use their implicit
  knowledge
    No database available


                                                     Sascha Tönnies   15/02/13   3
The Question

How can we reflect the chemist’s perception
 of chemical entities belonging to the same
       chemical class to support him
          during his search task?




                            Sascha Tönnies   15/02/13   4
Use Case: Anti-Tuberculosis Drugs




                                Comics taken from GiZGRAPHICS@fotolia.com

                            Sascha Tönnies        15/02/13         5
The Idea

   Identifying the functional groups of all
occurring chemical entities and cluster them
 according to their set of functional groups.




                              Sascha Tönnies   15/02/13   6
The Workflow…



                              Reagent(s)
                Reactant(s)                    Product(s)
                               Reaction
                              Conditions




                                     Sascha Tönnies    15/02/13   7
First Questions to answer…
 How to build meaningful clusters?

Experimental Set Up
 Dump of PubChem database containing 31,5 million entities
    Calculation of functional groups
    Extending the standard tool checkmol by “dimensions”




                                               Sascha Tönnies   15/02/13   8
Meaningful Clusters by Functional Groups?
 Clustering by set of functional groups
    Cluster Name = MD5(set(functional group names))
 Clusters up to 100 entities are reasonable
    Evaluated by domain experts
 97,84% already usable but only contain around 30% of all entities
                                       100%
   # Contained Entities   # Clusters   90%

            1              773.092     80%

                                       70%
        1 < x ≤ 10         816.817
                                       60%
       10 < x ≤ 100        226.147     50%

                                       40%
     100 < x ≤ 1.000       36.535
                                       30%
    1.000 < x ≤ 10.000      3.615
                                       20%
   10.000 < x ≤ 100.000      143       10%

                                        0%
       100.000 < x            0
                                                                                                        1001
                                                                                                               1104
                                                                                                                      1206
                                                                                                                             1311
                                                                                                                                    1425


                                                                                                                                                  1658
                                                                                                                                                         1786
                                                                                                                                                                1927
                                                                                                                                                                       2090
                                                                                                                                                                              2251
                                                                                                                                                                                     2430


                                                                                                                                                                                                   2884
                                                                                                                                                                                                          3161
                                                                                                                                                                                                                 3519
                                                                                                                                                                                                                        3923
                                                                                                                                                                                                                               4584
                                                                                                                                                                                                                                      5371


                                                                                                                                                                                                                                                    8732
                                                                                                                                           1539




                                                                                                                                                                                            2653




                                                                                                                                                                                                                                             6795
                                                  100
                                                        200
                                                              300
                                                                    400
                                                                          500
                                                                                600
                                                                                      700
                                                                                            800
                                                                                                  900




                                                                                                                                                                                                                                                           14348
                                              0




                                                                                                                      number of entities per cluster

                                                                                            percentage of all clusters                                   percentage of all entities


                                                                                                                                    Sascha Tönnies                                                   15/02/13                                          9
Dividing Big Clusters into Sub-Clusters
 Sub-clustering of clusters containing more than 100 entities
 We have to find suitable similarity measures
     Using similarity functions based on fingerprints
     Only uncorrelated combinations chosen (JCDL2011)
   Randomly 100 clusters with more than 1000 entities chosen
   Randomly 10 queries chosen
   Similarity calculation between query and all other cluster entities
   For which of the measures the top-X ranked entities are in the
    same functional group cluster?




                                                  Sascha Tönnies   15/02/13   10
Results
 Top 100 Candidates
       Estate Fingerprint with Russel Rao, Yule, Manhattan or Simpson
       Substructure Fingerprint with Russel Rao or Manhattan
 Top 1000 Candidates
       Substructure Fingerprint with Manhattan
 Overall: Substructure Fingerprint with Manhattan
100                                                                                800
 90                                                                                700
 80
                                                                                   600
 70
 60                                                                                500
 50                                                                                400
 40                                                                                300
 30
                                                                                   200
 20
 10                                                                                100
  0                                                                                 0
      Extended FP Estate FP        FP          Graphonly   MACCS FP Substructure         Extended FP Estate FP        FP          Graphonly   MACCS FP Substructure
                                                  FP                     FP                                                          FP                     FP

            Forbes    Russel_Rao        Yule      Manhattan     Simpson                        Forbes    Russel_Rao        Yule      Manhattan     Simpson


                                                                                                         Sascha Tönnies                   15/02/13            11
In Search of the K (1)
 We are using k-means clustering (WEKA implementation)
    Each group must contain at least one object
    Each object must belong to exactly one group
 The aim: each entity in a sub-cluster has the same chemical class




                                               Sascha Tönnies   15/02/13   12
In Search of the K (2)
 We took domain specific ontology CheBI as ground truth
 We took randomly 2000 clusters (5%)
    Only clusters containing entities also included in CheBI
 Idea: Taking the ontology class as cluster label
    Only nodes that are at least 3 steps away from the entry node
     (CIKM2010)
 We manually build respective sub-clusters
 Evaluation Algorithm stops if k-means found optimal solution
    Here it is k = 4
 Remark: CheBI contains 20.000 chemical classes for our
  dataset, we found 150.000 (implicit) classes



                                                  Sascha Tönnies   15/02/13   13
Second Question to answer…
 Are these clusters usable for a document retrieval task?


Experimental Set Up
 Collection of 2588 chemical documents from
  Archive of Organic Chemistry (ARKIVOC)
 Each document associated to its functional group clusters
  based on containing entities
 Precision/Recall analysis by domain experts
    Representative sub-set of 10% of the entire collection
    Just taken entities occurring in > 20 but < 100 documents
 From these documents we randomly selected around 5% (18) as
  query terms

                                                 Sascha Tönnies   15/02/13   14
Is the sub-cluster decomposition sensible?
 Recall around 93%
    Some entities from other sub-clusters are also relevant
 Precision in average up to 53% for k = 12
 Recall oriented F2: 68%

                         100%

                          80%

                          60%

                          40%

                          20%

                           0%
                                1   2   3   4      5     6       7       8      9   10   11   12    13   14
                                                Recall       Precision       F1     F2


                                                               Sascha Tönnies            15/02/13        15
Is it possible to increase that?
 Not just deliver all documents within the cluster
    Using similarity function to rank the documents
    Based on Wikipedia categories (CIKM2010)
                     𝑐𝑞 𝑖 𝑑 𝑗   𝑐𝑑 𝑗
    𝑠𝑤𝑐 𝑞 𝑖 , 𝑑 𝑗 =          ×
                  𝑐𝑞 𝑖   𝑒𝑑 𝑗
 Evaluation of Mean Average Precision up to 72%
 It is enough to retrieve only documents within the same sub-cluster
                                      73%
                                      72%
                                      71%
                                      70%
                                      69%
                                      68%
                                      67%
                                      66%
                                      65%
                                            1   2   3   4   5   6   7        8   9   10   11   12   13   14
                                                                         k


                                                        Sascha Tönnies               15/02/13            16
Even more findings (1)
 Comparison of number of entities for k = 1 and k = 12
 On average over all queries decreasing number of around 90%
 Recall does not decrease, thus high cluster quality

                100%

                 90%

                 80%

                 70%

                 60%

                 50%

                 40%

                 30%

                 20%

                 10%

                   0%
                            1     2      3       4      5       6      7      8      9     10    11     12    13     14    15     16     17    18
        #Entities K = 12   733    23    372     158    1401    143    131    699    1078   46    112    88    37    1234   6043   1012   689   21
        #Entities K=1      4638   164   14657   1699   10139   1187   1296   7200   5381   506   4624   539   465   19885 27423 10139 15347    293



                                                                                                                Sascha Tönnies                  15/02/13   17
Even more findings (2)
 Number of clusters including a certain percentage of all entities
 3500 sub-clusters have been reduced to 3% of #entities for k = 1
 Considering faceted search scenario this is quite important


                                     4000


                                     3500


                                     3000
                number of clusters




                                     2500


                                     2000


                                     1500


                                     1000


                                     500


                                       0




                                                                                                                                                                                                              100
                                            1
                                                4
                                                    7




                                                                                 25




                                                                                                     37




                                                                                                                         49
                                                        10
                                                             13
                                                                  16
                                                                       19
                                                                            22


                                                                                      28
                                                                                           31
                                                                                                34


                                                                                                          40
                                                                                                               43
                                                                                                                    46


                                                                                                                              52
                                                                                                                                   55
                                                                                                                                        58
                                                                                                                                             61
                                                                                                                                                  64
                                                                                                                                                       67
                                                                                                                                                            70
                                                                                                                                                                 73
                                                                                                                                                                      76
                                                                                                                                                                           79
                                                                                                                                                                                82
                                                                                                                                                                                     85
                                                                                                                                                                                          88
                                                                                                                                                                                               91
                                                                                                                                                                                                    94
                                                                                                                                                                                                         97
                                                                                                     entity reduction factor in percent



                                                                                                                                                       Sascha Tönnies                                     15/02/13   18
Take Aways
 Simple clustering based on functional groups is not enough!
    Most clusters are to unspecific
 Sub-Clustering with K-Means and Substructure finger print with
  Manhattan worked fine
 Group of domain experts evaluated that almost all relevant
  documents (recall of 93%) are located in the respective sub-cluster
 Instead of just delivering all documents from the respective cluster,
  we also introduced a ranking measure based on Wikipedia
  categories to further enhance the precision (MAP 72%).
 The number of entities in the sub-clusters is
  dramatically decreased about 90% compared
  to the original functional groups clusters.


                                                Sascha Tönnies   15/02/13   19
www.L3S.de/~toennies




Thank You!


                            Sascha Tönnies   15/02/13   20
Backup
 PRODUCT       (?i).* Formation of s [CHEMICAL]
 PRODUCT       (?i).* One-pot synthesis of s [CHEMICAL]
 PRODUCT       (?i).* Preparation of (s+[-w|p{InGreek}]*s*){0,2}
               [CHEMICAL]




 Phenanthrene is a (3/1)
                                        Dicumarol is a (2/2)




                                                                       Sascha Tönnies   15/02/13   21

More Related Content

Recently uploaded

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Recently uploaded (20)

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Talk catching the_drift

  • 1. Catching the Drift – Indexing Implicit Knowledge in Chemical Digital Libraries Benjamin Köhncke, Sascha Tönnies, & Wolf-Tilo Balke L3S Research Center TPDL, Sep. 23 – 27, 2012, Cyprus
  • 2. Outline  Introducing the problem  Describing our way to handle this problem  Showing that this is a valid way  Is there still room for improvement? Sascha Tönnies 15/02/13 2
  • 3. The Problem  In chemistry search is entity centered  Chemical entities can occur in different ways  Structures / Images / String representations  Synonyms available  Already a though task for indexing and retrieval (JCDL 2010)  The field of drug design is even more complex  Not only searching for a specific entity or similar entities  But for entities having the same or similar characteristic (chemical reaction)  Chemists have to use their implicit knowledge  No database available Sascha Tönnies 15/02/13 3
  • 4. The Question How can we reflect the chemist’s perception of chemical entities belonging to the same chemical class to support him during his search task? Sascha Tönnies 15/02/13 4
  • 5. Use Case: Anti-Tuberculosis Drugs Comics taken from GiZGRAPHICS@fotolia.com Sascha Tönnies 15/02/13 5
  • 6. The Idea Identifying the functional groups of all occurring chemical entities and cluster them according to their set of functional groups. Sascha Tönnies 15/02/13 6
  • 7. The Workflow… Reagent(s) Reactant(s) Product(s) Reaction Conditions Sascha Tönnies 15/02/13 7
  • 8. First Questions to answer…  How to build meaningful clusters? Experimental Set Up  Dump of PubChem database containing 31,5 million entities  Calculation of functional groups  Extending the standard tool checkmol by “dimensions” Sascha Tönnies 15/02/13 8
  • 9. Meaningful Clusters by Functional Groups?  Clustering by set of functional groups  Cluster Name = MD5(set(functional group names))  Clusters up to 100 entities are reasonable  Evaluated by domain experts  97,84% already usable but only contain around 30% of all entities 100% # Contained Entities # Clusters 90% 1 773.092 80% 70% 1 < x ≤ 10 816.817 60% 10 < x ≤ 100 226.147 50% 40% 100 < x ≤ 1.000 36.535 30% 1.000 < x ≤ 10.000 3.615 20% 10.000 < x ≤ 100.000 143 10% 0% 100.000 < x 0 1001 1104 1206 1311 1425 1658 1786 1927 2090 2251 2430 2884 3161 3519 3923 4584 5371 8732 1539 2653 6795 100 200 300 400 500 600 700 800 900 14348 0 number of entities per cluster percentage of all clusters percentage of all entities Sascha Tönnies 15/02/13 9
  • 10. Dividing Big Clusters into Sub-Clusters  Sub-clustering of clusters containing more than 100 entities  We have to find suitable similarity measures  Using similarity functions based on fingerprints  Only uncorrelated combinations chosen (JCDL2011)  Randomly 100 clusters with more than 1000 entities chosen  Randomly 10 queries chosen  Similarity calculation between query and all other cluster entities  For which of the measures the top-X ranked entities are in the same functional group cluster? Sascha Tönnies 15/02/13 10
  • 11. Results  Top 100 Candidates  Estate Fingerprint with Russel Rao, Yule, Manhattan or Simpson  Substructure Fingerprint with Russel Rao or Manhattan  Top 1000 Candidates  Substructure Fingerprint with Manhattan  Overall: Substructure Fingerprint with Manhattan 100 800 90 700 80 600 70 60 500 50 400 40 300 30 200 20 10 100 0 0 Extended FP Estate FP FP Graphonly MACCS FP Substructure Extended FP Estate FP FP Graphonly MACCS FP Substructure FP FP FP FP Forbes Russel_Rao Yule Manhattan Simpson Forbes Russel_Rao Yule Manhattan Simpson Sascha Tönnies 15/02/13 11
  • 12. In Search of the K (1)  We are using k-means clustering (WEKA implementation)  Each group must contain at least one object  Each object must belong to exactly one group  The aim: each entity in a sub-cluster has the same chemical class Sascha Tönnies 15/02/13 12
  • 13. In Search of the K (2)  We took domain specific ontology CheBI as ground truth  We took randomly 2000 clusters (5%)  Only clusters containing entities also included in CheBI  Idea: Taking the ontology class as cluster label  Only nodes that are at least 3 steps away from the entry node (CIKM2010)  We manually build respective sub-clusters  Evaluation Algorithm stops if k-means found optimal solution  Here it is k = 4  Remark: CheBI contains 20.000 chemical classes for our dataset, we found 150.000 (implicit) classes Sascha Tönnies 15/02/13 13
  • 14. Second Question to answer…  Are these clusters usable for a document retrieval task? Experimental Set Up  Collection of 2588 chemical documents from Archive of Organic Chemistry (ARKIVOC)  Each document associated to its functional group clusters based on containing entities  Precision/Recall analysis by domain experts  Representative sub-set of 10% of the entire collection  Just taken entities occurring in > 20 but < 100 documents  From these documents we randomly selected around 5% (18) as query terms Sascha Tönnies 15/02/13 14
  • 15. Is the sub-cluster decomposition sensible?  Recall around 93%  Some entities from other sub-clusters are also relevant  Precision in average up to 53% for k = 12  Recall oriented F2: 68% 100% 80% 60% 40% 20% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Recall Precision F1 F2 Sascha Tönnies 15/02/13 15
  • 16. Is it possible to increase that?  Not just deliver all documents within the cluster  Using similarity function to rank the documents  Based on Wikipedia categories (CIKM2010) 𝑐𝑞 𝑖 𝑑 𝑗 𝑐𝑑 𝑗  𝑠𝑤𝑐 𝑞 𝑖 , 𝑑 𝑗 = × 𝑐𝑞 𝑖 𝑒𝑑 𝑗  Evaluation of Mean Average Precision up to 72%  It is enough to retrieve only documents within the same sub-cluster 73% 72% 71% 70% 69% 68% 67% 66% 65% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 k Sascha Tönnies 15/02/13 16
  • 17. Even more findings (1)  Comparison of number of entities for k = 1 and k = 12  On average over all queries decreasing number of around 90%  Recall does not decrease, thus high cluster quality 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 #Entities K = 12 733 23 372 158 1401 143 131 699 1078 46 112 88 37 1234 6043 1012 689 21 #Entities K=1 4638 164 14657 1699 10139 1187 1296 7200 5381 506 4624 539 465 19885 27423 10139 15347 293 Sascha Tönnies 15/02/13 17
  • 18. Even more findings (2)  Number of clusters including a certain percentage of all entities  3500 sub-clusters have been reduced to 3% of #entities for k = 1  Considering faceted search scenario this is quite important 4000 3500 3000 number of clusters 2500 2000 1500 1000 500 0 100 1 4 7 25 37 49 10 13 16 19 22 28 31 34 40 43 46 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 entity reduction factor in percent Sascha Tönnies 15/02/13 18
  • 19. Take Aways  Simple clustering based on functional groups is not enough!  Most clusters are to unspecific  Sub-Clustering with K-Means and Substructure finger print with Manhattan worked fine  Group of domain experts evaluated that almost all relevant documents (recall of 93%) are located in the respective sub-cluster  Instead of just delivering all documents from the respective cluster, we also introduced a ranking measure based on Wikipedia categories to further enhance the precision (MAP 72%).  The number of entities in the sub-clusters is dramatically decreased about 90% compared to the original functional groups clusters. Sascha Tönnies 15/02/13 19
  • 20. www.L3S.de/~toennies Thank You! Sascha Tönnies 15/02/13 20
  • 21. Backup PRODUCT (?i).* Formation of s [CHEMICAL] PRODUCT (?i).* One-pot synthesis of s [CHEMICAL] PRODUCT (?i).* Preparation of (s+[-w|p{InGreek}]*s*){0,2} [CHEMICAL] Phenanthrene is a (3/1) Dicumarol is a (2/2) Sascha Tönnies 15/02/13 21