Your SlideShare is downloading. ×
Domain Identification for Linked Open Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Domain Identification for Linked Open Data

183
views

Published on

Linked Open Data (LOD) has emerged as one of the largest collections of interlinked structured datasets on the Web. Although the adoption of such datasets for applications is …

Linked Open Data (LOD) has emerged as one of the largest collections of interlinked structured datasets on the Web. Although the adoption of such datasets for applications is
increasing, identifying relevant datasets for a specific task or topic is still challenging. As an initial step to make such identification easier, we provide an approach to automatically identify the topic domains of given datasets. Our method utilizes existing knowledge sources, more specifically Freebase, and we present an evaluation which validates the topic domains we can identify with our system. Furthermore, we evaluate the effectiveness of identified topic domains for the purpose of finding relevant datasets, thus showing that our approach improves reusability of LOD datasets.

Published in: Technology, Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
183
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Outdated cloud diagram – last updated on 2011
  • Wikipedia 4.3 million articles
  • CKAN ranked best for 12 terms while our approach ranked best for 9 terms
  • CKAN ranked best for 12 terms while our approach ranked best for 9 terms
  • Transcript

    • 1. Domain Identification for Linked Open Data Sarasi Lalithsena Pascal Hitzler Amit Sheth Kno.e.sis Center Wright State University, Dayton, OH Prateek Jain IBM T.J. Watson Research Center Yorktown, NY, USA WI 2013 Atlanta, GA, USA
    • 2. Motivation lod cloud 262 datasets 870 alive datasets “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lodcloud.net/” 2
    • 3. Motivation Lingvoj Climbdata Need better ways to dataset discovery, description and organization 3
    • 4. Problem • How do we identify the relevant datasets from this structured knowledge space? – How do we create a registry of topics which describe the domain of a dataset? 4
    • 5. State of the Art - CKAN • In order to organize this large cloud CKAN encourages users to tag their datasets in to following domains - media - geography - life sciences - publications - government - e-commerce - social web - user generated content - schemata - cross-domain • CKAN administrators then manually go through these tagging and organize the diagram • CKAN provides a search for the datasets based on these manual tagging and keywords 5
    • 6. State of the Art - CKAN But, • Fixed set of tags can’t cope with the increasing diversity of the datasets – For an example what would be tags for Lingvoj dataset? • Manual reviewing process will soon be unsustainable • Classification is subjective 6
    • 7. State of the Art- LODStats • Stream based approach to collect the statistics of datasets • Allow searching for datasets based on the keyword and metadata provided by data publishers 7
    • 8. State of the Art – Other • Semantic Search Engine (SSE) – SSEs such as Sigma, Swoogle and Watson allow to search for instances and give the releted URI instance – But are not designed for dataset search • Federated Querying systems on LOD datasets – Need to know seed URIs to find the relevant datasets 8
    • 9. State of the Art – Existing Problems to dataset lookup • Rely on manual tagging provided by users and the manual reviewing process • Rely on keywords and metadata provided by users • Need to know seed URIs to find the relevant datasets • Need to know instances to start explore the datasets 9
    • 10. What we propose? • Introduce a systematic and sophisticated way to identify possible domains, topics, tags (Topic Domain) to better describe these datasets • What are these topic domain can be? – Predefined set of list – Type of the schema of each dataset 10
    • 11. What we propose? Knowledge bases + category system Topic Domains 11
    • 12. How do we address the previous problems • Use the category system of existing knowledge sources as the vocabulary to describe the domain – Does not need to either rely on a predefined set of tags – Does not need to rely on metadata and keywords • Automatic way to identify the topic domains • Vocabulary can be used to search the datasets and organize the datasets 12
    • 13. Our approach - Freebase • Use Freebase as our knowledge source to identify the topic domains • Why Freebase? – Wide Coverage Has 39 million topics – Simple Category Hierarchy System • Freebase category system categorizes each topic in to types and types are grouped in to domains music Domain Artist Type • Utilized Freebase types and domains as our topic domains 13
    • 14. Our Approach - STEPS 1. 2. 3. 4. 5. Instance Identification Category Hierarchy Creation Category Hierarchy Merging Candidate Category Hierarchy Selection Frequency Count Generation 14
    • 15. Our Approach STEP 1 Instance Identification – Extract the instances of the dataset with its type – Extract the human readable values of the instances and type Granite and its type Rock – Identify the closely related instance from the freebase for each instance in our dataset Ignimbrite, Rock Slate, Rock Granite, Rock http://www.freebase.com/m/ 01tx7r http://www.freebase.com/m/ 01c_9j http://www.freebase.com/m/ 03fcm 15
    • 16. Our Approach • Instance Identification We attach the type information as well to the query string Apple Apple Company Apple Fruit Apple Fruit 16
    • 17. Our Approach • STEP 2 Category Hierarchy Creation Ignimbrite /geology/rock_type geography geology {domain/type} geography Ignimbrite rock type geology mountain geography mountain range music music slate rock type geology mountain release track recording geography granite rock type mountain 17
    • 18. Our Approach • Category Hierarchy Merging geography geology Ignimbrite mountain rock type mountain range geology geography slate music release track rock type mountain recording geology geography granite rock type mountain 18
    • 19. Our Approach • Candidate Category Hierarchy Selection Filter out insignificant category hierarchies using a simple heuristics geography geology Ignimbrite mountain rock type mountain range geology geography slate music release track rock type mountain recording geology geography granite rock type mountain 19
    • 20. Our Approach • Frequency Count Generation Count the number of occurrences for each category (number of instances having the given category) Term Frequency Parent Node geology 3 rock type 3 geology mountain range 1 geography ….. … …. 20
    • 21. Implementation • Map Reduce Deployment STEP 2 and 3 map1 STEP 4 Reducer 1 map2 <Inst, type> …… ....... …… …… Map 3 map4 … STEP 5 Post Processing … … Reducer m … Map n Instances belong to same type will go into a single reducer 21
    • 22. Evaluation • We ran our experiments with 30 datasets in LOD for evaluation Evaluation Appropriateness of the identified domain Effectiveness in finding the datasets User Study 22
    • 23. Evaluation : Appropriateness of the identified domain • Select four high frequent domains and types from our results • Mixed it with other randomly selected four domains and types • Asked from users to select the terms that best represent the higher level domains for the dataset – had 20 users * 50% of the users agreed on 73% of the terms (88 out of 120) 23
    • 24. Evaluation : Appropriateness of the identified domain TERMS WITH HIGHEST USER AGREEMENT FOR EACH DATASET, WE INDICATE BY A STAR (*) THAT TERM WAS ALSO THE HIGHEST RANKED BY OUR SYSTEM (for 22 datasets) 24
    • 25. Evaluation Evaluation Appropriateness of the identified domain User Study Effectiveness in finding the datasets 1. User Study with three other SE 25
    • 26. Evaluation – Effectiveness in finding the datasets • Developed a search application using the normalized frequency count • User study with three other existing state of the art – CKAN, LOD Stat and Sigma • Term selection • Top ten results are retrieved • Asked users to rank which set of results they preferred – 1(best ) to 4(worst) • Calculate a user preference score using weighted average 26
    • 27. Evaluation …. Term Our Approach CKAN LODStat Sigma music 2.037 3.74 3.11 1.333 artist 2.815 3.926 1 2.259 biology 3.481 3.333 1 2.185 animal 2.926 1.63 3.481 1.926 geology 2.852 3.666 1 2.481 drug 2.926 3.148 2 2.555 gene 2.148 3.333 3.074 1.222 university 3.185 3.148 2.37 1.222 food 3.259 2.296 3 1.259 language 3.148 3.74 1 2.11 spacecraft 4 4 1 2 conference 2.814 3.555 1 2.666 astronaut 4 4 1 2 composer 3.815 3.037 1 2.11 tv program 3.666 2.923 1 2.370 instrument 3.852 2 2 3.148 recipe 3.926 2 2 3.074 student 2 3.889 2 3.111 phenotypes 2 3.923 2 3.037 energy 1 3.74 3.26 3.03 28
    • 28. Evaluation Evaluation Appropriateness of the identified domain User Study Effectiveness in finding the datasets 1. User Study with three other SE 2. Evaluate CKAN as the baseline 29
    • 29. Evaluate CKAN as the baseline Term P R1 F1 R2 F2 music 0.286 1 0.445 0.1 0.148 artist 0.4 1 0.571 0.2 0.267 biology 0.125 1 0.222 0.333 0.182 animal 0 0 n/a 0 n/a geology 0 0 n/a 0 n/a drug 0.6 0.667 0.632 0.75 0.667 gene 0.333 1 0.5 0.125 0.182 university 0.5 1 0.667 0.051 0.093 food 0 0 n/a 0 n/a language 1 1 1 0.045 0.0861 spacecraft 1 1 1 1 1 conference 1 1 1 0.125 0.222 astronaut 1 1 1 1 1 composer 0.25 1 0.4 0.5 0.333 tv program 0 0 n/a 0 n/a instrument 0 1 0 1 0 recipe 0 1 0 1 0 student 1 0 0 0 0 phenotypes 1 0 0 0 0 energy 1 0 0 0 0 31
    • 30. Evaluation Evaluation Appropriateness of the identified domain User Study Effectiveness in finding the datasets 1. User Study with three other SE 2. Evaluate CKAN as the baseline 3. Evaluate both CKAN and our approach using a manually curated gold standard 34
    • 31. Evaluation using a manually curated gold standard CKAN Our Approach Term Precision Recall F-Measure Precision Recall F-Measure music 1 0.5 0.667 0.571 1 0.727 artist 1 0.25 0.4 0.8 1 0.9 biology 1 0.2 0.333 0.625 1 0.769 animal 0 0 n/a 0.333 1 0.5 geology 0 0 n/a 1 0.5 0.667 drug 1 0.6 0.75 1 1 1 gene 1 0.333 0.5 1 1 1 university 0.5 0.667 0.572 0.6 1 0.75 food 0 0 n/a 0.25 1 0.4 language 1 1 1 1 1 1 spacecraft 1 1 1 1 1 1 conference 1 1 1 1 1 1 tv program 0 0 n/a 1 1 1 instrument 1 0 0 0.75 1 0.857 astronaut 1 1 1 1 1 1 composer 1 0.25 0.4 1 1 1 recipe 1 0 0 1 1 1 phenotypes 1 1 1 1 0 0 student 1 0.5 0.667 1 0 0 energy 1 0.333 0.5 1 0 0 Mean 0.775 0.432 0.489 0.846 0.825 0.728 36
    • 32. Conclusion and Future Work • Our approach is helpful for systematically categorizing the datasets • Demonstrate the potential of using the categorization for finding relevant datasets • Utilize a diverse classification hierarchy such as Freebase • There are other potential application that this work might be important such browsing, interlinking and querying • Plan to improve the domain coverage by using knowledge sources such as Wikipedia • Compare the interpretation given by multiple knowledge sources to see which one gives you a better interpretation 37
    • 33. Thank You! Questions? http://knoesis.wright.edu/researchers/sarasi sarasi@knoesis.org Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing Wright State University, Dayton, Ohio, USA