Domain Identification for Linked Open Data

462 views

Published on

Linked Open Data (LOD) has emerged as one of the largest collections of interlinked structured datasets on the Web. Although the adoption of such datasets for applications is
increasing, identifying relevant datasets for a specific task or topic is still challenging. As an initial step to make such identification easier, we provide an approach to automatically identify the topic domains of given datasets. Our method utilizes existing knowledge sources, more specifically Freebase, and we present an evaluation which validates the topic domains we can identify with our system. Furthermore, we evaluate the effectiveness of identified topic domains for the purpose of finding relevant datasets, thus showing that our approach improves reusability of LOD datasets.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
462
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Outdated cloud diagram – last updated on 2011
  • Wikipedia 4.3 million articles
  • User agreement on appropriateness of the termsThe graph in here shows how many users agreed on how many terms being appropriate descriptors, from a total of 20 users (=100%, horizontal axis) and 120 terms (=100%, vertical)axis).
  • CKAN ranked best for 12 terms while our approach ranked best for 9 terms, we had 27 users participated in the studyWe generate second best results, with only a 30 datasets
  • Domain Identification for Linked Open Data

    1. 1. Domain Identification for Linked Open Data Sarasi Lalithsena Pascal Hitzler Amit Sheth Kno.e.sis Center Wright State University, Dayton, OH Prateek Jain IBM T.J. Watson Research Center Yorktown, NY, USA WI 2013 Atlanta, GA, USA
    2. 2. Motivation lod cloud 262 datasets 870 alive datasets “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lodcloud.net/” 2
    3. 3. Motivation Lingvoj Climbdata Need better ways to dataset discovery, description and organization 3
    4. 4. Problem • How do we identify the relevant datasets from this structured knowledge space? – How do we create a registry of topics which describe the domain of a dataset? 4
    5. 5. State of the Art – Existing Problems to dataset lookup • Rely on manual tagging provided by users and the manual reviewing process – CKAN data hub, LOD Diagram • Rely on keywords and metadata provided by users – CKAN data hub, LODStats • Need to know instances to start explore the datasets – Semantic Search Engines (SSE) such as Sigma, Swoogle and Watson • Need to know seed URIs to find the relevant datasets – Federated Querying Systems for LOD 5
    6. 6. What we propose? • Introduce a systematic and sophisticated way to identify possible domains, topics, tags (Topic Domain) to better describe these datasets • What are these topic domain can be? – Predefined set of list – Type of the schema of each dataset 6
    7. 7. What we propose? Knowledge bases + category system Topic Domains 7
    8. 8. How do we address the previous problems • Use the category system of existing knowledge sources as the vocabulary to describe the domain – Does not need to either rely on a predefined set of tags – Does not need to rely on metadata and keywords • Automatic way to identify the topic domains • Vocabulary can be used to search the datasets and organize the datasets 8
    9. 9. Our approach - Freebase • Use Freebase as our knowledge source to identify the topic domains • Why Freebase? – Wide Coverage Has 39 million topics – Simple Category Hierarchy System • Freebase category system categorizes each topic in to types and types are grouped in to domains music Domain Artist Type • Utilized Freebase types and domains as our topic domains 9
    10. 10. Our Approach - STEPS 1. 2. 3. 4. 5. Instance Identification Category Hierarchy Creation Category Hierarchy Merging Candidate Category Hierarchy Selection Frequency Count Generation 10
    11. 11. Our Approach STEP 1 Instance Identification – Extract the instances of the dataset with its type – Extract the human readable values of the instances and type Granite and its type Rock – Identify the closely related instance from the freebase for each instance in our dataset Ignimbrite, Rock Slate, Rock Granite, Rock http://www.freebase.com/m/ 01tx7r http://www.freebase.com/m/ 01c_9j http://www.freebase.com/m/ 03fcm 11
    12. 12. Our Approach • Instance Identification We attach the type information as well to the query string Apple Apple Company Apple Fruit Apple Fruit 12
    13. 13. Our Approach • STEP 2 Category Hierarchy Creation Ignimbrite /geology/rock_type geography geology {domain/type} geography Ignimbrite rock type geology mountain geography mountain range music music slate rock type geology mountain release track recording geography granite rock type mountain 13
    14. 14. Our Approach • Category Hierarchy Merging geography geology Ignimbrite mountain rock type mountain range geology geography slate music release track rock type mountain recording geology geography granite rock type mountain 14
    15. 15. Our Approach • Candidate Category Hierarchy Selection Filter out insignificant category hierarchies using a simple heuristics geography geology Ignimbrite mountain rock type mountain range geology geography slate music release track rock type mountain recording geology geography granite rock type mountain 15
    16. 16. Our Approach • Frequency Count Generation Count the number of occurrences for each category (number of instances having the given category) Term Frequency Parent Node geology 3 rock type 3 geology mountain range 1 geography ….. … …. 16
    17. 17. Implementation • Map Reduce Deployment STEP 2 and 3 map1 STEP 4 Reducer 1 map2 <Inst, type> …… ....... …… …… Map 3 map4 … STEP 5 Post Processing … … Reducer m … Map n Instances belong to same type will go into a single reducer 17
    18. 18. Evaluation • We ran our experiments with 30 datasets in LOD for evaluation Evaluation Appropriateness of the identified domain Effectiveness in finding the datasets User Study 18
    19. 19. Appropriateness of the identified domain • Select four high frequent domains and types from our results • Mixed it with other randomly selected four domains and types • Asked from users to select the terms that best represent the higher level domains for the dataset – had 20 users * 50% of the users agreed on 73% of the terms (88 out of 120) 19
    20. 20. Appropriateness of the identified domain TERMS WITH HIGHEST USER AGREEMENT FOR EACH DATASET, WE INDICATE BY A STAR (*) THAT TERM WAS ALSO THE HIGHEST RANKED BY OUR SYSTEM (for 22 datasets) 20
    21. 21. Evaluation Evaluation Appropriateness of the identified domain User Study Effectiveness in finding the datasets 1. User Study with three other SE 21
    22. 22. Effectiveness in finding the datasets • Developed a search application using the normalized frequency count • User study with three other existing state of the art – CKAN, LOD Stat and Sigma • Term selection • Top ten results are retrieved • Asked users to rank which set of results they preferred – 1(best ) to 4(worst) • Calculate a user preference score using weighted average 22
    23. 23. Effectiveness in finding the datasets Term Our Approach CKAN LODStat Sigma music 2.037 3.74 3.11 1.333 artist 2.815 3.926 1 2.259 biology 3.481 3.333 1 2.185 animal 2.926 1.63 3.481 1.926 geology 2.852 3.666 1 2.481 drug 2.926 3.148 2 2.555 gene 2.148 3.333 3.074 1.222 university 3.185 3.148 2.37 1.222 food 3.259 2.296 3 1.259 language 3.148 3.74 1 2.11 spacecraft 4 4 1 2 conference 2.814 3.555 1 2.666 astronaut 4 4 1 2 composer 3.815 3.037 1 2.11 tv program 3.666 2.923 1 2.370 instrument 3.852 2 2 3.148 recipe 3.926 2 2 3.074 student 2 3.889 2 3.111 phenotypes 2 3.923 2 3.037 energy 1 3.74 3.26 3.03 23
    24. 24. Evaluation Evaluation Appropriateness of the identified domain User Study Effectiveness in finding the datasets 1. User Study with three other SE 2. Evaluate CKAN as the baseline 24
    25. 25. Evaluate CKAN as the baseline Term P R1 F1 R2 F2 music 0.286 1 0.445 0.1 0.148 artist 0.4 1 0.571 0.2 0.267 biology 0.125 1 0.222 0.333 0.182 animal 0 0 n/a 0 n/a geology 0 0 n/a 0 n/a drug 0.6 0.667 0.632 0.75 0.667 gene 0.333 1 0.5 0.125 0.182 university 0.5 1 0.667 0.051 0.093 food 0 0 n/a 0 n/a language 1 1 1 0.045 0.0861 spacecraft 1 1 1 1 1 conference 1 1 1 0.125 0.222 astronaut 1 1 1 1 1 composer 0.25 1 0.4 0.5 0.333 tv program 0 0 n/a 0 n/a instrument 0 1 0 1 0 recipe 0 1 0 1 0 student 1 0 0 0 0 phenotypes 1 0 0 0 0 energy 1 0 0 0 0 25
    26. 26. Evaluation Evaluation Appropriateness of the identified domain User Study Effectiveness in finding the datasets 1. User Study with three other SE 2. Evaluate CKAN as the baseline 3. Evaluate both CKAN and our approach using a manually curated gold standard 26
    27. 27. Evaluation with a manually curated gold standard CKAN Our Approach Term Precision Recall F-Measure Precision Recall F-Measure music 1 0.5 0.667 0.571 1 0.727 artist 1 0.25 0.4 0.8 1 0.9 biology 1 0.2 0.333 0.625 1 0.769 animal 0 0 n/a 0.333 1 0.5 geology 0 0 n/a 1 0.5 0.667 drug 1 0.6 0.75 1 1 1 gene 1 0.333 0.5 1 1 1 university 0.5 0.667 0.572 0.6 1 0.75 food 0 0 n/a 0.25 1 0.4 language 1 1 1 1 1 1 spacecraft 1 1 1 1 1 1 conference 1 1 1 1 1 1 tv program 0 0 n/a 1 1 1 instrument 1 0 0 0.75 1 0.857 astronaut 1 1 1 1 1 1 composer 1 0.25 0.4 1 1 1 recipe 1 0 0 1 1 1 phenotypes 1 1 1 1 0 0 student 1 0.5 0.667 1 0 0 energy 1 0.333 0.5 1 0 0 Mean 0.775 0.432 0.489 0.846 0.825 0.728 27
    28. 28. Conclusion and Future Work • Our approach is helpful for systematically categorizing the datasets • Demonstrate the potential of using the categorization for finding relevant datasets • Utilize a diverse classification hierarchy such as Freebase • There are other potential application that this work might be important such browsing and interlinking • Plan to improve the domain coverage by using knowledge sources such as Wikipedia and Yago • Compare the interpretation given by multiple knowledge sources to see which one gives a better interpretation 28
    29. 29. Thank You! Questions? http://knoesis.wright.edu/researchers/sarasi sarasi@knoesis.org Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing Wright State University, Dayton, Ohio, USA

    ×