Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying

About 22 years ago..

1

11 years later…

Image from Scientific American Website

Tim Berners-Lee 2006
1. Use URIs as names for things
2. Use HTTP URIs so that people can look up those names.
3. When someone looks up a URI, provide useful information,
using the standards (RDF*, SPARQL)

4. Include links to other URIs. so that they can discover more
things.

6

In 2006 Web of Data

7

Linked Open Data
• Massive collection of instance data
• Primarily connected via owl:sameAs relationship
• Excellent source of information for background
knowledge

• Labeled as mainstream Semantic Web
7/30/2012 8
8

Is it really mainstream Semantic Web?
• What is the relationship between the models
whose instances are being linked?

• How to do querying on LOD without knowing
individual datasets?

• How to perform schema level reasoning over
LOD cloud?

9

What can be done?
• Relationships are at the heart of Semantics
• LOD primarily consists of owl:sameAs links
• LOD captures instance level relationships, but lacks
class level relationships.
o Superclass
o Subclass
o Equivalence

• How to find these relationships?
o Perform a matching of the LOD Ontology’s using state of the art ontology matching tools.

10

Linked Open Data
Alignment and Querying
Dissertation Defense July 27th, 2012

Prateek Jain
Kno.e.sis Center
Wright State University, Dayton, OH

Agenda
• Motivation and Significance of this research

• Research questions and proposed solutions

• State of the current research and planned
work

• Questions and comments
14th February 2012 12

Linked Open Data
• A set of best practices for publishing and
connecting structured data on the Web

• Practices have been adopted by an increasing
number of data providers in the past 5 years

• Latest count is at 295 datasets with over 50
Billion triples (approx)

Linked Open Data 2007 (May)

Linking Open Data cloud diagram, this and subsequent pages, by Richard Cyganiak and AnjaJentzsch. http://lod-cloud.net/

14

Linked Open Data 2007 (Oct)

15

Linked Open Data 2009

16

Linked Open Data 2011

17

Linked Open Data
Number of Datasets Number of triples (Sept 2011)

31,634,213,770
2011-09-19 295
with 503,998,829 out-links
2010-09-22 203
2009-07-14 95
2008-09-18 45
2007-10-08 25
2007-05-01 12
From http://www4.wiwiss.fu-berlin.de/lodcloud/state/

18

6 years of existence how
many applications come to
your mind?

7/30/2012 19

Compiled List
• BBC Music
• Faviki
• Application Lifecycle Management at IBM
Rational

• British Museum

Reality…
• “We DID NOT use the entire Dbpedia or LOD.
The only component of LOD which helped us
with Watson was YAGO class hierarchy present
in DBpedia. We had strict information gain
requirements and other components honestly
did not help much“
– Researcher with the Watson Team

7/30/2012 23
23

A simple query..
“Identify congress members, who have voted “No”
on pro environmental legislation in the past four
years, with
high-pollution industry in their congressional
districts.”

But even with LOD we cannot answer this query.

25

Example: GovTrack
Vote: 2009- vote:hasOption
vote:vote 887 Votes:2009-887/+

vote:votedBy
Aye rdfs:label
vote:hasAction
people/P000197
H.R. 3962: Affordable
Health Care for America
dc:title
Act name
On Passage: H R
dc:title 3962 Affordable Nancy Pelosi
Health Care for
Bills:h3962 America Act

26

Example: GeoNames

rdfs:subClassOf?

27

Our Approach
Use knowledge contributed by users

To enhance existing approaches
to solve these issues:

• Ontology integration

• Detection relationships within
LOD
and across datasets
Cloud
• Querying multiple datasets

28

Circling Back

• LOD captures instance level relationships, but
lacks class level relationships.
o Superclass
o Subclass
o Equivalence

7/30/2012 30
30

• BLOOMS - Bootstrapping-based Linked Open
Data Ontology Matching System

• Developed specifically for LOD Ontologies
• Identifies schema level links between different
LOD datasets

• Aligns ontologies belonging to diverse domains
using diverse data sources
32

Existing Approaches

A survey of approaches to automatic Ontology matching by Erhard Rahm, Philip A. Bernstein in the VLDB Journal 10:
334–350 (2001)
33

LOD Ontology Alignment
• Actual Results from these techniques
 Nation = Menstruation, Confidence=0.9 

• They perform extremely well on established benchmarks, but
typically not in the wilds.

• LOD Ontology’s are of very different nature
• Created by community for community.
• Emphasis on number of instances, not number of meaningful relationships.
• Require solutions beyond syntactic and structural matching.
34

Rabbit out of a hat?
• Traditional auxiliary data sources (WordNet,
Upper Level Ontologies) have limited coverage.

• Community generated is noisy, but is rich in
• Content
• Structure
• Has a “self healing property”

• Problems like Ontology Matching have a
dimension of context associated with them.

35

Wikipedia
• The English version alone has more than 2.9
million articles

• Continually expanded by approx. 100,000 active
volunteer editors

• Multiple points of view are mentioned with
proper contexts

• Article creation/correction is an ongoing activity 36

Ontology Matching using Wikipedia
• On Wikipedia, categories are used to organize
the entire project.

• Wikipedia's category system consists of
overlapping trees.

• Simple rules for categorization

37

BLOOMS Approach – Step 1

• Pre-process the input ontology
 Remove property restrictions
 Remove individuals, properties

• Tokenize the class names
 Remove underscores, hyphens and other delimiters
 Breakdown complex class names
• example: SemanticWeb => Semantic Web

38

• Identify article in Wikipedia corresponding to the concept.
o Each article related to the concept indicates a sense of the usage of the
word.

• For each article found in the previous step
o Identify the Wikipedia category to which it belongs.
o For each category found, find its parent categories till level 4.

• Once the “BLOOMS tree” for each of the sense of the source
concept is created (Ts), utilize it for comparison with the
“BLOOMS tree” of the target concepts (Tt).

39

• In the tree Ts, remove all nodes for which the parent node
which occurs in Tt to create Ts’.
o All leaves of Ts are of level 4 or occur in Tt.
o The pruned nodes do not contribute any additional new knowledge.

• Compute overlap Os between the source and target tree.
o Os= n/(k-1), n = |z|, zε Ts’ ΠTt, k= |s|, sε Ts’

• The decision of alignment is made as follows.
o For Ts εTc and Ttε Td, we have Ts=Tt, then C=D.
o If min{o(Ts,Tt),o(Tt,Ts)} ≥ x, then set C rdfs:subClassOf D if o(Ts,Tt) ≤ o(Tt,
Ts), and set D rdfs:subClassOf C if o(Ts, Tt) ≥ o(Tt, Ts).

40

Evaluation Objectives

• To examine BLOOMS as a tool for the purpose of LOD
ontology matching.

• To examine the ability of BLOOMS to serve as a general
purpose ontology matching system.

42

Circling Back

• LOD primarily consists of owl:sameAs links

7/30/2012 45
45

Part of Relationship
Identification

Partonomy Identification
• Currently entities across datasets are linked using primarily the
owl:sameAs relationship

• Relationships such as partonomy (part-of), and causality can
allow creating even more intelligent applications such as Watson

• Approach PLATO (Part-Of relation finder on Linked Open DAta
Tool)

47

PLATO Approach

• PLATO generates all possible partonomically
linked pairs between the entities in the dataset.
o Utilize “strongly” associated entities

• Identify the type of each entity in the pair using
WordNet.
o Use Class Names
o Gives the lexicographer files for the synsets
corresponding to these entities
48

Winston’s Taxonomy

49

PLATO Approach – Step 2
• PLATO generates linguistic patterns for each applicable
property based on linguistic cues suggested by Winston.
o Cell Wall is made of Cellulose

• Tests the lexical patterns for each entity pair in a corpus-
driven manner.
o Using Web as a corpus

• PLATO counts the total number of web pages that contain
the pattern
o Parse the page and identify the occurance of pattern.

50

PLATO Approach – Step 3
• Asserts the partonomy property with strongest supporting
evidence
o Cell Wall is made of Cellulose, 48
o Cellulose is made of Cell Wall, 10

• PLATO also enriches the schema by generalizing from the
instance level assertions.

51

Evaluation Objectives

• To examine PLATO as a tool for finding different kinds of
part-of relation.

• To examine PLATO as a tool for finding part-of relation
within a dataset

• To examine PLATO as a tool for finding part-of relation
across dataset

52

PLATO Evaluation

53

Some other work
• Requirement document analysis
o Internship at Accenture

• Querying of partonomical relationship
• Operators for querying spatio-temporal-thematic
data

• Plug-n-Play system for BLOOMS
7/30/2012 55
55

BLOOMS BLOOMS+ PLATO Others

2010 1. 1 paper at ISWC 1. Paper at AAAI SS
2. 1 paper at OM 2. Paper at GEOS
workshop

2011 1. 1 paper at ESWC
2. Workshop at ICBO
3. 1 patent

2012 1. 1 paper at ACM
Hypertext

Total of 7 publications covering this research

Potential Applications
• Automatic domain identification of datasets
o Work currently being pursued by Sarasi

• Property alignment on LOD cloud
o Work currently being pursued by Kalpa and Sanjaya

• Personalization of property and concepts match.
o Machine learning and data mining based techniques


Publications
• Prateek Jain, Pascal Hitzler, KunalVerma, Peter Z. Yeh and Amit P. Sheth, “Moving beyond sameAs with PLATO:
Partonomy detection for Linked Data”. In Proceedings of the 23rd ACM Hypertext and Social Media conference (HT 2012),
Milwaukee, WI, USA, June 25th-28th, 2012 (Acceptance Rate 27.5%)

• Amit Krishna Joshi, Prateek Jain, Pascal Hitzler, Peter Yeh, KunalVerma, AmitSheth, Mariana Damova, "Alignment-based
Querying of Linked Open Data", In Proceedings of the 11th International Conference on Ontologies, DataBases, and
Applications of Semantics (ODBASE 2012) (To Appear)

• Prateek Jain,Peter Z. Yeh, KunalVerma, Reymonrod Vasquez, Mariana Damova, Pascal Hitzler and Amit P. Sheth,
“Contextual Ontology Alignment of LOD with an Upper Ontology: A Case Study with Proton”.InGrigoris Antoniou, Marko
Grobelnik, Elena Simperl, BijanParsia, DimitrisPlexousakis, Jeff Pan and Pieter De Leenheer, editors, Proceedings of the
8th Extended Semantic Web Conference 2011, volume 6643 of Lecture Notes in Computer Science, Heidelberg, 2011.
Springer Berlin. (Acceptance Rate 23.5%)

• Prateek Jain, Pascal Hitzler, Amit P. Sheth, KunalVerma and Peter Z. Yeh, “Ontology Alignment for Linked Open Data”. In
P. Patel-Schneider, Y. Pan, P. Hitzler, P. Mika, L. Zhang, J. Pan, I. Horrocks, And B. Glimm, editors, Proceedings of the
9th International Semantic Web Conference 2010, Shanghai, China, November 7th-11th, 2010,volume 6496 of Lecture
Notes in Computer Science, pages 402-417, Heidelberg, 2010. Springer Berlin. (Acceptance Rate 20%)


Publications
• Prateek Jain, Pascal Hitzler and Amit P. Sheth. "Flexible Bootstrapping-Based Ontology Alignment". In Proceedings of the
Fifth international Workshop on Ontology Matching (Shanghai, China, November 7th - 11th, 2010).

• Prateek Jain, Pascal Hitzler, Peter Z. Yeh, KunalVerma, and Amit P. Sheth, “Linked Data Is Merely More Data”. In: Dan
Brickley, Vinay K. Chaudhri, Harry Halpin, and Deborah McGuinness: Linked Data Meets Artificial Intelligence. Technical
Report SS-10-07, AAAI Press, Menlo Park, California, 2010, pp. 82-86. ISBN 978-1-57735-461-1

• Prateek Jain, Peter Yeh, KunalVerma, Cory Henson, and AmitSheth. “SPARQL Query Re-writing Using Partonomy Based
Transformation Rules”. In K. Janowicz, M. Raubal, and S. Levashkin, editors, Proceedings of the Third International
Conference on GeoSpatial Semantics, December 3-4, 2009, Mexico City, Mexico, volume 5892/2009 of Lecture Notes in
Computer Science, pages 140–158, Heidelberg, 2009. Springer Berlin.

• Prateek Jain, KunalVerma, Alex Kass, Reymonrod G. Vasquez, “Automated Review of Natural Language Requirements
Documents: Generating Useful Warnings with User-extensible Glossaries Driving a Simple State Machine”, In
KiranDeshpande, PankajJalote and Sriram K. Rajamani editors, Proceedings of the Second India Software Engineering
Conference, February 23-26, 2009, Pune, India, ACM, New York, NY, 37-46. DOI=
http://doi.acm.org/10.1145/1506216.1506224 (Acceptance Rate 10%).


Publications
• Prateek Jain, Peter Z. Yeh, KunalVerma, Alex Kass, and Amit P. Sheth, 2008. "Enhancing process-adaptation capabilities
with web-based corporate radar technologies". In Proceedings of the First international Workshop on ontology-Supported
Business intelligence (Karlsruhe, Germany, October 27 - 27, 2008). OBI '08, vol. 308. ACM, New York, NY, 1-6. DOI=
http://doi.acm.org/10.1145/1452567.1452569
• Matthew Perry, Amit P. Sheth, FarshadHakimpour, Prateek Jain. "Supporting Complex Thematic, Spatial and Temporal
Queries over Semantic Web Data", In F. T. Fonseca, M. Andrea Rodriguez and S. Levashkin editors, Proceedings of the
Second International Conference on GeoSpatial Semantics, December 3-4, 2009, Mexico City, Mexico, volume 4853/2007
of Lecture Notes in Computer Science, pages 228–246, Heidelberg, 2007. Springer Berlin.
• Colin Puri, KarthikGomadam, Prateek Jain, Peter Z. Yeh, KunalVerma, “Multiple Ontologies in Healthcare Information
Technology: Motivations and Recommendation for Ontology Mapping and Alignment”.In Proceedings of the Workshop on
Working with Multiple Biomedical Ontologies (at ICBO), 26 July 2011, Buffalo, NY, USA.
• Cory Henson, Amit P. Sheth, Prateek Jain, Josh Pschorr and Terry Rapoch. "Video on the Semantic Sensor Web", W3C
Video on the Web Workshop 12-13 December 2007, San Jose, California and Brussels, Belgium


Patent
• Peter Z. Yeh, Prateek Jain, KunalVerma,
Reymonrod G. Vasquez, Titled: Information
Source Alignment, Filed 4th March 2011, Status:
Pending.


Acknowledgement


Acknowledgement
• Cory Henson
o coffee breaks, research, football, baseball, politics, life..
o First person I met while finding my way to LSDIS lab

• Kno.e.sis Lab Members & support staff
• Folks at Accenture Technology Labs
o Amazing group of people to work with/for


Acknowledgement
• NSF Award:IIS-0842129, titled III-SGER: Spatio-
Temporal-Thematic Queries of Semantic Web
Data: a Study of Expressivity and Efﬁciency

• NSF Award 1143717 III: EAGER -- Expressive
Scalable Querying over Linked Open Data.


Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying

More Related Content

What's hot

Viewers also liked

Similar to Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying

Recently uploaded

Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying

Editor's Notes