SlideShare a Scribd company logo
Domain Identification for Linked Open Data
Sarasi Lalithsena
Pascal Hitzler
Amit Sheth
Kno.e.sis Center
Wright State University, Dayton, OH

Prateek Jain
IBM T.J. Watson Research Center
Yorktown, NY, USA

WI 2013 Atlanta, GA, USA
Motivation

lod cloud
262 datasets

870 alive datasets

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lodcloud.net/”

2
Motivation

Lingvoj
Climbdata

Need better ways to dataset discovery, description and organization

3
Problem
• How do we identify the relevant datasets from this structured
knowledge space?
– How do we create a registry of topics which describe the
domain of a dataset?

4
State of the Art - CKAN
• In order to organize this large cloud CKAN encourages users to
tag their datasets in to following domains
- media
- geography
- life sciences
- publications
- government
- e-commerce
- social web
- user generated content
- schemata
- cross-domain
• CKAN administrators then manually go through these tagging
and organize the diagram
• CKAN provides a search for the datasets based on these manual
tagging and keywords

5
State of the Art - CKAN
But,
• Fixed set of tags can’t cope with the increasing diversity of the
datasets
– For an example what would be tags for Lingvoj dataset?
• Manual reviewing process will soon be unsustainable
• Classification is subjective

6
State of the Art- LODStats

• Stream based approach to collect the statistics of datasets
• Allow searching for datasets based on the keyword and
metadata provided by data publishers

7
State of the Art – Other
• Semantic Search Engine (SSE)
– SSEs such as Sigma, Swoogle and Watson allow to search
for instances and give the releted URI instance
– But are not designed for dataset search
• Federated Querying systems on LOD datasets
– Need to know seed URIs to find the relevant datasets

8
State of the Art – Existing Problems to
dataset lookup
• Rely on manual tagging provided by users and the manual
reviewing process
• Rely on keywords and metadata provided by users
• Need to know seed URIs to find the relevant datasets
• Need to know instances to start explore the datasets

9
What we propose?
• Introduce a systematic and sophisticated way to identify
possible domains, topics, tags (Topic Domain) to better describe
these datasets
• What are these topic domain can be?
– Predefined set of list
– Type of the schema of each dataset

10
What we propose?

Knowledge bases + category system

Topic Domains

11
How do we address the previous
problems
• Use the category system of existing knowledge sources as the
vocabulary to describe the domain
– Does not need to either rely on a predefined set of tags
– Does not need to rely on metadata and keywords
• Automatic way to identify the topic domains
• Vocabulary can be used to search the datasets and organize the
datasets

12
Our approach - Freebase
• Use Freebase as our knowledge source to identify the topic
domains
• Why Freebase?
– Wide Coverage
Has 39 million topics
– Simple Category Hierarchy System
• Freebase category system categorizes each topic in to types and
types are grouped in to domains
music

Domain

Artist

Type

• Utilized Freebase types and domains as our topic domains

13
Our Approach - STEPS

1.
2.
3.
4.
5.

Instance Identification
Category Hierarchy Creation
Category Hierarchy Merging
Candidate Category Hierarchy Selection
Frequency Count Generation

14
Our Approach
STEP 1 Instance Identification
– Extract the instances of the dataset with its type
– Extract the human readable values of the instances and type
Granite and its type Rock
– Identify the closely related instance from the freebase for
each instance in our dataset

Ignimbrite, Rock
Slate, Rock
Granite, Rock

http://www.freebase.com/m/
01tx7r
http://www.freebase.com/m/
01c_9j
http://www.freebase.com/m/
03fcm

15
Our Approach
• Instance Identification
We attach the type information as well to the query string

Apple

Apple Company

Apple Fruit

Apple Fruit

16
Our Approach
• STEP 2 Category Hierarchy Creation
Ignimbrite

/geology/rock_type

geography

geology

{domain/type}

geography

Ignimbrite
rock type

geology

mountain

geography

mountain range

music

music

slate
rock type

geology

mountain

release track

recording

geography

granite
rock type

mountain

17
Our Approach
• Category Hierarchy Merging
geography

geology

Ignimbrite

mountain
rock type
mountain range

geology

geography

slate

music
release track

rock type

mountain
recording

geology

geography

granite
rock type

mountain

18
Our Approach
• Candidate Category Hierarchy Selection
Filter out insignificant category hierarchies using a simple
heuristics
geography

geology

Ignimbrite

mountain
rock type
mountain range

geology

geography

slate

music
release track

rock type

mountain
recording

geology

geography

granite
rock type

mountain

19
Our Approach
• Frequency Count Generation
Count the number of occurrences for each category (number of
instances having the given category)

Term

Frequency

Parent Node

geology

3

rock type

3

geology

mountain range

1

geography

…..

…

….

20
Implementation
• Map Reduce Deployment
STEP 2 and 3
map1

STEP 4
Reducer
1

map2
<Inst, type>
……
.......
……
……

Map 3

map4

…

STEP 5
Post Processing

…
…
Reducer
m

…
Map n

Instances belong to same type will go into a
single reducer

21
Evaluation
• We ran our experiments with 30 datasets in LOD for evaluation

Evaluation
Appropriateness of the identified
domain

Effectiveness in finding the datasets

User Study

22
Evaluation : Appropriateness of the
identified domain
• Select four high frequent domains and types from our results
• Mixed it with other randomly selected four domains and types
• Asked from users to select the terms that best represent the
higher level domains for the dataset – had 20 users

*

50% of the users
agreed on 73% of
the terms (88 out of
120)

23
Evaluation : Appropriateness of the
identified domain

TERMS WITH HIGHEST USER AGREEMENT FOR EACH DATASET, WE INDICATE BY A STAR (*)
THAT TERM WAS ALSO THE HIGHEST RANKED BY OUR SYSTEM (for 22 datasets)

24
Evaluation

Evaluation
Appropriateness of the identified
domain

User Study

Effectiveness in finding the datasets

1. User Study with three other SE

25
Evaluation – Effectiveness in finding the

datasets
• Developed a search application using the normalized frequency
count
• User study with three other existing state of the art
– CKAN, LOD Stat and Sigma
• Term selection
• Top ten results are retrieved
• Asked users to rank which set of results they preferred
– 1(best ) to 4(worst)
• Calculate a user preference score using weighted average

26
Evaluation ….
Term

Our Approach

CKAN

LODStat

Sigma

music

2.037

3.74

3.11

1.333

artist

2.815

3.926

1

2.259

biology

3.481

3.333

1

2.185

animal

2.926

1.63

3.481

1.926

geology

2.852

3.666

1

2.481

drug

2.926

3.148

2

2.555

gene

2.148

3.333

3.074

1.222

university

3.185

3.148

2.37

1.222

food

3.259

2.296

3

1.259

language

3.148

3.74

1

2.11

spacecraft

4

4

1

2

conference

2.814

3.555

1

2.666

astronaut

4

4

1

2

composer

3.815

3.037

1

2.11

tv program

3.666

2.923

1

2.370

instrument

3.852

2

2

3.148

recipe

3.926

2

2

3.074

student

2

3.889

2

3.111

phenotypes

2

3.923

2

3.037

energy

1

3.74

3.26

3.03
28
Evaluation

Evaluation
Appropriateness of the identified
domain

User Study

Effectiveness in finding the datasets

1. User Study with three other SE

2. Evaluate CKAN as the baseline

29
Evaluate CKAN as the baseline
Term

P

R1

F1

R2

F2

music

0.286

1

0.445

0.1

0.148

artist

0.4

1

0.571

0.2

0.267

biology

0.125

1

0.222

0.333

0.182

animal

0

0

n/a

0

n/a

geology

0

0

n/a

0

n/a

drug

0.6

0.667

0.632

0.75

0.667

gene

0.333

1

0.5

0.125

0.182

university

0.5

1

0.667

0.051

0.093

food

0

0

n/a

0

n/a

language

1

1

1

0.045

0.0861

spacecraft

1

1

1

1

1

conference

1

1

1

0.125

0.222

astronaut

1

1

1

1

1

composer

0.25

1

0.4

0.5

0.333

tv program

0

0

n/a

0

n/a

instrument

0

1

0

1

0

recipe

0

1

0

1

0

student

1

0

0

0

0

phenotypes

1

0

0

0

0

energy

1

0

0

0

0
31
Evaluation

Evaluation
Appropriateness of the identified
domain

User Study

Effectiveness in finding the datasets

1. User Study with three other SE

2. Evaluate CKAN as the baseline
3. Evaluate both CKAN and our
approach using a manually curated
gold standard

34
Evaluation using a manually curated
gold standard
CKAN

Our Approach

Term

Precision

Recall

F-Measure

Precision

Recall

F-Measure

music

1

0.5

0.667

0.571

1

0.727

artist

1

0.25

0.4

0.8

1

0.9

biology

1

0.2

0.333

0.625

1

0.769

animal

0

0

n/a

0.333

1

0.5

geology

0

0

n/a

1

0.5

0.667

drug

1

0.6

0.75

1

1

1

gene

1

0.333

0.5

1

1

1

university

0.5

0.667

0.572

0.6

1

0.75

food

0

0

n/a

0.25

1

0.4

language

1

1

1

1

1

1

spacecraft

1

1

1

1

1

1

conference

1

1

1

1

1

1

tv program

0

0

n/a

1

1

1

instrument

1

0

0

0.75

1

0.857

astronaut

1

1

1

1

1

1

composer

1

0.25

0.4

1

1

1

recipe

1

0

0

1

1

1

phenotypes

1

1

1

1

0

0

student

1

0.5

0.667

1

0

0

energy

1

0.333

0.5

1

0

0

Mean

0.775

0.432

0.489

0.846

0.825

0.728
36
Conclusion and Future Work
• Our approach is helpful for systematically categorizing the
datasets
• Demonstrate the potential of using the categorization for finding
relevant datasets
• Utilize a diverse classification hierarchy such as Freebase
• There are other potential application that this work might be
important such browsing, interlinking and querying
• Plan to improve the domain coverage by using knowledge
sources such as Wikipedia
• Compare the interpretation given by multiple knowledge sources
to see which one gives you a better interpretation

37
Thank You!

Questions?
http://knoesis.wright.edu/researchers/sarasi
sarasi@knoesis.org

Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, Ohio, USA

More Related Content

Viewers also liked

Ieee metadata-conf-1999-keynote-amit sheth
Ieee metadata-conf-1999-keynote-amit shethIeee metadata-conf-1999-keynote-amit sheth
Ieee metadata-conf-1999-keynote-amit sheth
Artificial Intelligence Institute at UofSC
 
kHealth: Proactive Personalized Actionable Information for Better Healthcare
kHealth: Proactive Personalized Actionable Information for Better Healthcare kHealth: Proactive Personalized Actionable Information for Better Healthcare
kHealth: Proactive Personalized Actionable Information for Better Healthcare
Artificial Intelligence Institute at UofSC
 
Inside the Mind of Watson: Cognitive Computing
Inside the Mind of Watson: Cognitive ComputingInside the Mind of Watson: Cognitive Computing
Inside the Mind of Watson: Cognitive Computing
Artificial Intelligence Institute at UofSC
 
Big data healthcare
Big data healthcareBig data healthcare
Trust networks infotech2010
Trust networks infotech2010Trust networks infotech2010
Trust networks infotech2010
Artificial Intelligence Institute at UofSC
 
Trust networks tutorial-iicai-12-15-2011
Trust networks tutorial-iicai-12-15-2011Trust networks tutorial-iicai-12-15-2011
Trust networks tutorial-iicai-12-15-2011
Artificial Intelligence Institute at UofSC
 
Semantic Computing in Real-World: Vertical and Horizontal application
Semantic Computing in Real-World: Vertical and Horizontal applicationSemantic Computing in Real-World: Vertical and Horizontal application
Semantic Computing in Real-World: Vertical and Horizontal application
Artificial Intelligence Institute at UofSC
 
Real Time Semantic Analysis of Streaming Sensor Data
Real Time Semantic Analysis of Streaming Sensor DataReal Time Semantic Analysis of Streaming Sensor Data
Real Time Semantic Analysis of Streaming Sensor Data
Artificial Intelligence Institute at UofSC
 
2011 national geographic_photos
2011 national geographic_photos2011 national geographic_photos
2011 national geographic_photos
naturmar
 
Cursing in English on Twitter at CSCW 2014
Cursing in English on Twitter at CSCW 2014Cursing in English on Twitter at CSCW 2014
Cursing in English on Twitter at CSCW 2014
Artificial Intelligence Institute at UofSC
 
User Interests Identification From Twitter using Hierarchical Knowledge Base
User Interests Identification From Twitter using Hierarchical Knowledge BaseUser Interests Identification From Twitter using Hierarchical Knowledge Base
User Interests Identification From Twitter using Hierarchical Knowledge Base
Artificial Intelligence Institute at UofSC
 
NCSU invited talk: Leveraging Social Media for Tourism Marketplace Coordination
NCSU invited talk: Leveraging Social Media for Tourism Marketplace CoordinationNCSU invited talk: Leveraging Social Media for Tourism Marketplace Coordination
NCSU invited talk: Leveraging Social Media for Tourism Marketplace Coordination
Artificial Intelligence Institute at UofSC
 
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and QueryingPrateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Artificial Intelligence Institute at UofSC
 

Viewers also liked (14)

Ieee metadata-conf-1999-keynote-amit sheth
Ieee metadata-conf-1999-keynote-amit shethIeee metadata-conf-1999-keynote-amit sheth
Ieee metadata-conf-1999-keynote-amit sheth
 
kHealth: Proactive Personalized Actionable Information for Better Healthcare
kHealth: Proactive Personalized Actionable Information for Better Healthcare kHealth: Proactive Personalized Actionable Information for Better Healthcare
kHealth: Proactive Personalized Actionable Information for Better Healthcare
 
Inside the Mind of Watson: Cognitive Computing
Inside the Mind of Watson: Cognitive ComputingInside the Mind of Watson: Cognitive Computing
Inside the Mind of Watson: Cognitive Computing
 
Big data healthcare
Big data healthcareBig data healthcare
Big data healthcare
 
Trust networks infotech2010
Trust networks infotech2010Trust networks infotech2010
Trust networks infotech2010
 
Trust networks tutorial-iicai-12-15-2011
Trust networks tutorial-iicai-12-15-2011Trust networks tutorial-iicai-12-15-2011
Trust networks tutorial-iicai-12-15-2011
 
Semantic Computing in Real-World: Vertical and Horizontal application
Semantic Computing in Real-World: Vertical and Horizontal applicationSemantic Computing in Real-World: Vertical and Horizontal application
Semantic Computing in Real-World: Vertical and Horizontal application
 
Real Time Semantic Analysis of Streaming Sensor Data
Real Time Semantic Analysis of Streaming Sensor DataReal Time Semantic Analysis of Streaming Sensor Data
Real Time Semantic Analysis of Streaming Sensor Data
 
2011 national geographic_photos
2011 national geographic_photos2011 national geographic_photos
2011 national geographic_photos
 
Cursing in English on Twitter at CSCW 2014
Cursing in English on Twitter at CSCW 2014Cursing in English on Twitter at CSCW 2014
Cursing in English on Twitter at CSCW 2014
 
User Interests Identification From Twitter using Hierarchical Knowledge Base
User Interests Identification From Twitter using Hierarchical Knowledge BaseUser Interests Identification From Twitter using Hierarchical Knowledge Base
User Interests Identification From Twitter using Hierarchical Knowledge Base
 
Semantic (Web) Technologies for Translational Research in Life Sciences
Semantic (Web) Technologies for Translational Research in Life SciencesSemantic (Web) Technologies for Translational Research in Life Sciences
Semantic (Web) Technologies for Translational Research in Life Sciences
 
NCSU invited talk: Leveraging Social Media for Tourism Marketplace Coordination
NCSU invited talk: Leveraging Social Media for Tourism Marketplace CoordinationNCSU invited talk: Leveraging Social Media for Tourism Marketplace Coordination
NCSU invited talk: Leveraging Social Media for Tourism Marketplace Coordination
 
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and QueryingPrateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
 

Similar to Domain Identification for Linked Open Data

BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
Wake Tech BAS
 
Saner17 sharma
Saner17 sharmaSaner17 sharma
Saner17 sharma
Abhishek Sharma
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
Vijay Susheedran C G
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
Semantics-enhanced Geoscience Interoperability, Analytics, and ApplicationsSemantics-enhanced Geoscience Interoperability, Analytics, and Applications
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
Artificial Intelligence Institute at UofSC
 
FSCI Data Discovery
FSCI Data DiscoveryFSCI Data Discovery
FSCI Data Discovery
ARDC
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
tafosepsdfasg
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
Yongyao Jiang
 
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...Perficient
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).ppt
AravindReddy565690
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
Dhilsath Fathima
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
AKSHAY BHAGAT
 
Using Lucene/Solr to Build CiteSeerX and Friends
Using Lucene/Solr to Build CiteSeerX and FriendsUsing Lucene/Solr to Build CiteSeerX and Friends
Using Lucene/Solr to Build CiteSeerX and Friends
lucenerevolution
 
Using Lucene/Solr to Build CiteSeerX and Friends
Using Lucene/Solr to Build CiteSeerX and FriendsUsing Lucene/Solr to Build CiteSeerX and Friends
Using Lucene/Solr to Build CiteSeerX and Friends
lucenerevolution
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
DataONE
 
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
L07 metadata
L07 metadataL07 metadata
L07 metadata
thplayer127
 
DWDM syllabus.doc
DWDM syllabus.docDWDM syllabus.doc
DWDM syllabus.doc
RitCse
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 

Similar to Domain Identification for Linked Open Data (20)

BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
Saner17 sharma
Saner17 sharmaSaner17 sharma
Saner17 sharma
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
Semantics-enhanced Geoscience Interoperability, Analytics, and ApplicationsSemantics-enhanced Geoscience Interoperability, Analytics, and Applications
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
 
FSCI Data Discovery
FSCI Data DiscoveryFSCI Data Discovery
FSCI Data Discovery
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
Lecture - Data Mining
 
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).ppt
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
 
Using Lucene/Solr to Build CiteSeerX and Friends
Using Lucene/Solr to Build CiteSeerX and FriendsUsing Lucene/Solr to Build CiteSeerX and Friends
Using Lucene/Solr to Build CiteSeerX and Friends
 
Using Lucene/Solr to Build CiteSeerX and Friends
Using Lucene/Solr to Build CiteSeerX and FriendsUsing Lucene/Solr to Build CiteSeerX and Friends
Using Lucene/Solr to Build CiteSeerX and Friends
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
 
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
 
L07 metadata
L07 metadataL07 metadata
L07 metadata
 
DWDM syllabus.doc
DWDM syllabus.docDWDM syllabus.doc
DWDM syllabus.doc
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 

Domain Identification for Linked Open Data

  • 1. Domain Identification for Linked Open Data Sarasi Lalithsena Pascal Hitzler Amit Sheth Kno.e.sis Center Wright State University, Dayton, OH Prateek Jain IBM T.J. Watson Research Center Yorktown, NY, USA WI 2013 Atlanta, GA, USA
  • 2. Motivation lod cloud 262 datasets 870 alive datasets “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lodcloud.net/” 2
  • 3. Motivation Lingvoj Climbdata Need better ways to dataset discovery, description and organization 3
  • 4. Problem • How do we identify the relevant datasets from this structured knowledge space? – How do we create a registry of topics which describe the domain of a dataset? 4
  • 5. State of the Art - CKAN • In order to organize this large cloud CKAN encourages users to tag their datasets in to following domains - media - geography - life sciences - publications - government - e-commerce - social web - user generated content - schemata - cross-domain • CKAN administrators then manually go through these tagging and organize the diagram • CKAN provides a search for the datasets based on these manual tagging and keywords 5
  • 6. State of the Art - CKAN But, • Fixed set of tags can’t cope with the increasing diversity of the datasets – For an example what would be tags for Lingvoj dataset? • Manual reviewing process will soon be unsustainable • Classification is subjective 6
  • 7. State of the Art- LODStats • Stream based approach to collect the statistics of datasets • Allow searching for datasets based on the keyword and metadata provided by data publishers 7
  • 8. State of the Art – Other • Semantic Search Engine (SSE) – SSEs such as Sigma, Swoogle and Watson allow to search for instances and give the releted URI instance – But are not designed for dataset search • Federated Querying systems on LOD datasets – Need to know seed URIs to find the relevant datasets 8
  • 9. State of the Art – Existing Problems to dataset lookup • Rely on manual tagging provided by users and the manual reviewing process • Rely on keywords and metadata provided by users • Need to know seed URIs to find the relevant datasets • Need to know instances to start explore the datasets 9
  • 10. What we propose? • Introduce a systematic and sophisticated way to identify possible domains, topics, tags (Topic Domain) to better describe these datasets • What are these topic domain can be? – Predefined set of list – Type of the schema of each dataset 10
  • 11. What we propose? Knowledge bases + category system Topic Domains 11
  • 12. How do we address the previous problems • Use the category system of existing knowledge sources as the vocabulary to describe the domain – Does not need to either rely on a predefined set of tags – Does not need to rely on metadata and keywords • Automatic way to identify the topic domains • Vocabulary can be used to search the datasets and organize the datasets 12
  • 13. Our approach - Freebase • Use Freebase as our knowledge source to identify the topic domains • Why Freebase? – Wide Coverage Has 39 million topics – Simple Category Hierarchy System • Freebase category system categorizes each topic in to types and types are grouped in to domains music Domain Artist Type • Utilized Freebase types and domains as our topic domains 13
  • 14. Our Approach - STEPS 1. 2. 3. 4. 5. Instance Identification Category Hierarchy Creation Category Hierarchy Merging Candidate Category Hierarchy Selection Frequency Count Generation 14
  • 15. Our Approach STEP 1 Instance Identification – Extract the instances of the dataset with its type – Extract the human readable values of the instances and type Granite and its type Rock – Identify the closely related instance from the freebase for each instance in our dataset Ignimbrite, Rock Slate, Rock Granite, Rock http://www.freebase.com/m/ 01tx7r http://www.freebase.com/m/ 01c_9j http://www.freebase.com/m/ 03fcm 15
  • 16. Our Approach • Instance Identification We attach the type information as well to the query string Apple Apple Company Apple Fruit Apple Fruit 16
  • 17. Our Approach • STEP 2 Category Hierarchy Creation Ignimbrite /geology/rock_type geography geology {domain/type} geography Ignimbrite rock type geology mountain geography mountain range music music slate rock type geology mountain release track recording geography granite rock type mountain 17
  • 18. Our Approach • Category Hierarchy Merging geography geology Ignimbrite mountain rock type mountain range geology geography slate music release track rock type mountain recording geology geography granite rock type mountain 18
  • 19. Our Approach • Candidate Category Hierarchy Selection Filter out insignificant category hierarchies using a simple heuristics geography geology Ignimbrite mountain rock type mountain range geology geography slate music release track rock type mountain recording geology geography granite rock type mountain 19
  • 20. Our Approach • Frequency Count Generation Count the number of occurrences for each category (number of instances having the given category) Term Frequency Parent Node geology 3 rock type 3 geology mountain range 1 geography ….. … …. 20
  • 21. Implementation • Map Reduce Deployment STEP 2 and 3 map1 STEP 4 Reducer 1 map2 <Inst, type> …… ....... …… …… Map 3 map4 … STEP 5 Post Processing … … Reducer m … Map n Instances belong to same type will go into a single reducer 21
  • 22. Evaluation • We ran our experiments with 30 datasets in LOD for evaluation Evaluation Appropriateness of the identified domain Effectiveness in finding the datasets User Study 22
  • 23. Evaluation : Appropriateness of the identified domain • Select four high frequent domains and types from our results • Mixed it with other randomly selected four domains and types • Asked from users to select the terms that best represent the higher level domains for the dataset – had 20 users * 50% of the users agreed on 73% of the terms (88 out of 120) 23
  • 24. Evaluation : Appropriateness of the identified domain TERMS WITH HIGHEST USER AGREEMENT FOR EACH DATASET, WE INDICATE BY A STAR (*) THAT TERM WAS ALSO THE HIGHEST RANKED BY OUR SYSTEM (for 22 datasets) 24
  • 25. Evaluation Evaluation Appropriateness of the identified domain User Study Effectiveness in finding the datasets 1. User Study with three other SE 25
  • 26. Evaluation – Effectiveness in finding the datasets • Developed a search application using the normalized frequency count • User study with three other existing state of the art – CKAN, LOD Stat and Sigma • Term selection • Top ten results are retrieved • Asked users to rank which set of results they preferred – 1(best ) to 4(worst) • Calculate a user preference score using weighted average 26
  • 28. Evaluation Evaluation Appropriateness of the identified domain User Study Effectiveness in finding the datasets 1. User Study with three other SE 2. Evaluate CKAN as the baseline 29
  • 29. Evaluate CKAN as the baseline Term P R1 F1 R2 F2 music 0.286 1 0.445 0.1 0.148 artist 0.4 1 0.571 0.2 0.267 biology 0.125 1 0.222 0.333 0.182 animal 0 0 n/a 0 n/a geology 0 0 n/a 0 n/a drug 0.6 0.667 0.632 0.75 0.667 gene 0.333 1 0.5 0.125 0.182 university 0.5 1 0.667 0.051 0.093 food 0 0 n/a 0 n/a language 1 1 1 0.045 0.0861 spacecraft 1 1 1 1 1 conference 1 1 1 0.125 0.222 astronaut 1 1 1 1 1 composer 0.25 1 0.4 0.5 0.333 tv program 0 0 n/a 0 n/a instrument 0 1 0 1 0 recipe 0 1 0 1 0 student 1 0 0 0 0 phenotypes 1 0 0 0 0 energy 1 0 0 0 0 31
  • 30. Evaluation Evaluation Appropriateness of the identified domain User Study Effectiveness in finding the datasets 1. User Study with three other SE 2. Evaluate CKAN as the baseline 3. Evaluate both CKAN and our approach using a manually curated gold standard 34
  • 31. Evaluation using a manually curated gold standard CKAN Our Approach Term Precision Recall F-Measure Precision Recall F-Measure music 1 0.5 0.667 0.571 1 0.727 artist 1 0.25 0.4 0.8 1 0.9 biology 1 0.2 0.333 0.625 1 0.769 animal 0 0 n/a 0.333 1 0.5 geology 0 0 n/a 1 0.5 0.667 drug 1 0.6 0.75 1 1 1 gene 1 0.333 0.5 1 1 1 university 0.5 0.667 0.572 0.6 1 0.75 food 0 0 n/a 0.25 1 0.4 language 1 1 1 1 1 1 spacecraft 1 1 1 1 1 1 conference 1 1 1 1 1 1 tv program 0 0 n/a 1 1 1 instrument 1 0 0 0.75 1 0.857 astronaut 1 1 1 1 1 1 composer 1 0.25 0.4 1 1 1 recipe 1 0 0 1 1 1 phenotypes 1 1 1 1 0 0 student 1 0.5 0.667 1 0 0 energy 1 0.333 0.5 1 0 0 Mean 0.775 0.432 0.489 0.846 0.825 0.728 36
  • 32. Conclusion and Future Work • Our approach is helpful for systematically categorizing the datasets • Demonstrate the potential of using the categorization for finding relevant datasets • Utilize a diverse classification hierarchy such as Freebase • There are other potential application that this work might be important such browsing, interlinking and querying • Plan to improve the domain coverage by using knowledge sources such as Wikipedia • Compare the interpretation given by multiple knowledge sources to see which one gives you a better interpretation 37
  • 33. Thank You! Questions? http://knoesis.wright.edu/researchers/sarasi sarasi@knoesis.org Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing Wright State University, Dayton, Ohio, USA

Editor's Notes

  1. Outdated cloud diagram – last updated on 2011
  2. Wikipedia 4.3 million articles
  3. CKAN ranked best for 12 terms while our approach ranked best for 9 terms
  4. CKAN ranked best for 12 terms while our approach ranked best for 9 terms