Sina presentation in IBM

+

Question Answering on
Interlinked Data
Saeedeh Shekarpour, Axel-Cyrille Ngonga Ngomo, Soeren Auer
AKSW Research Group, Leipzig University
December 5 2013, IBM Research Center

+ Motivation
Retrieving information from LOD

AKSW group - Question Answering on Interlinked Data (published in www2013)

2

+ Motivation
Text
queries
(either
keyword
or
natural
language
)
are:

n 

Simple
retrieval
approach

n 

Popular

n 

Implicit
and
ambiguous
seman=cs.

SPARQL
queries
require:

n 

Knowledge
about
the
ontology

n 

Proﬁciency
in
formula=ng
formal
queries

n 

Explicit
and
unambigious
seman=cs.

AKSW
group
-‐
Ques=on
Answering
on
Interlinked
Data
(published
in
www2013)

3

+ Comparison of Search Approaches

Data-Semantic
aware

Data-Semantic
unaware

Our
approach:
SINA

4

Question
Answering
Systems

Information
Retrieval
Keyword-based
query


Natural language
query

+ Example

5

1
n 

3

Which televisions shows were created by Walt Disney?
select * where !
{ ?v0 a
!
?v0 dbo:creator


2
!dbo:TelevisionShow.!
dbr:Walt_Disney. }!

+ Aim and Challenges

Aim: Question answering over a set of interlinked data sources.
n 

Query segmentation.

n 

Resource disambiguation.

n 

To construct a formal query (expressed in SPARQL)


6

+ Further Challenges over Interlinked Data
1. 

Information for answering a certain question can be spread
among different datasets employing heterogeneous schemas.

2. 

Constructing a federated formal query across different datasets
requires exploiting links between the different datasets on both the
schema and instance levels.


7

+ SINA Architecture


8

+ Test bed datasets
*  One single dataset: DBpedia.
*  Three interlinked datasets
from life-science:

ü  Drugbank: is a
comprehensive knowledge
base containing information
about drugs, drug target (i.e.
protein) information,
interactions and enzymes.

ü  Diseasome: contains
information about diseases and
genes associated with these
diseases.

ü  Sider: contains information
about drugs and their side effects.


9

+ Main characteristics of federated queries
1. 

Queries requiring fused information, e.g. side
effects of drugs used for Tuberculosis.

2. 

Queries targeting combined information, e.g.
side effect an enzymes of drugs used for ASTHMA.

3. 

10

Queries requiring keyword expansion, e.g. side
effects of Valdecoxib.

DrugBank

Sider
Drug

a

a
?v1

enzyme

?v0

Disease

?v2
sameAs

a
Diseasome


Side Effect

Drug

a

Enzymes

Asthma

a
side effect

?v3

+ Challenge 1: Query Segmentation and Resource
Disambiguation

l 

Sample
ques5on:
What
is
the
side
effects
of
drugs
used
for
Tuberculosis?

l 

Transformed
to
4-‐tuple
(side
#
effect
#
drug
#
Tuberculosis)

l 

Different
segmenta=ons
are
possible:

1. 

(
side
effect
#
drug
#
Tuberculosis)

2. 

(
side
effect
drug
#
Tuberculosis
)

Mapping
of
the
segments
to
the
resources
in
the
underlying
knowledge
bases.

Each valid segment


11

12

Segment validation

ü 
ü 

Original tuple: (side # effect # drug # Tuberculosis).
Using a naive approach for finding all valid segments.

Valid Segments

Samples of Candidate Resources

Side effect

1.  sider:class:sideeffect
!
2.  sider:property:side_effects!

drug

1. drugbank: drugs
2.class:offer!
3.sider:drugs
4.diseases:possibledrug!

tuberculosis

1.  diseases:1154
!
2.  side_effects: C0041296!


+

13

Concurrent

Segmenta5on
and
Disambigua5on


14

Hidden Markov Model

• 
• 
• 
• 

A statistics model containing a set of states.
Moving from one state to another state generates a sequence of observations.
The probability of entering state only depends on the previous state.
Output is the most likely states generating the sequence of the observation.


15

State Space

• 
• 
• 
• 

A state represents a knowledge base resource.
Contains all resources in the knowledge base.
In practice, we prune the state space by excluding irrelevant states.
Adding an unknown entity state comprising all resources, which are not
available (anymore) in the pruned state space.

•  Extension of State Space with reasoning: An extension of the state space
by including resources inferred from lightweight owl:sameAs reasoning.


16

Bootstrapping the Model Parameters
Emission Probability
• 

The set-similarity level measures the difference between the label and the
segment in terms of the number of words using the Jaccard similarity.

• 

The string-similarity level measures the string similarity of each word in the
segment with the most similar word in the label using the Levenshtein
distance.


17

Bootstrapping the Model Parameters
Transition Probability & Initial Probability
•  Computing the transition probability and initial probability based on Semantic
relatedness of two resources.
•  Semantic relatedness is based on two values: distance and connectivity
degree.
•  We transform these two values to hub and authority values using HITS
algorithm.
•  Initial probability and Transition probability
are defined as a uniform
distribution over the hub and and authority values.


Evaluation of Bootstrapping

18

•  The accuracy of different distribution functions, i.e., Normal, Zipfian and
uniform distributions for transition probability.
•  We ran the distribution functions with two different inputs, i.e. distance and
connectivity degree values as well as hub and authority values.


+ Viterbi Algorithm
Aim: The most likely path generating the sequence of input keywords.


19

+

20

Output of the HMM for the following query:
Which televisions shows were created by Walt Disney?
Probability
0.0023
0.0014
5.89E-4
3.53E-4
3.76E-5

Path of states

dbo:TelevisionShow , dbo:creator , dbr:
dbo:TelevisionShow , dbo:creator , dbr:
dbr:TelevisionShow , dbo:creator , dbr:
dbr:TelevisionShow , dbo:creator , dbr:
dbp:television , dbp:show , dbo:creator


Walt_Disney!
Category:Walt_Disney!
Walt_Disney!
Category:Walt_Disney!
, dbr: Category:Walt_Disney!

+

21

Query Construction


Query Construction Method

Input: set of resources R = {r , r ,..., r }
Output: A query graph QG = (V, E)
is a directed, connected multi-graph.
1

2

n

Forward Chaining:
1.  CT: Comprehensive type.
2.  CD: Comprehensive domain.
3.  CR: Comprehensive range.


22


Input: set of resources R = {r , r ,..., r }
Output: A query graph QG = (V, E)
is a directed, connected multi-graph.
1

2

n

Generating the Incomplete Query Graph (IQG)
Initializing vertices and primary edges.
•  A vertex is added to IQG (1) If r is an instance, (2) If r is a class.
•  Properties are added along with zero, one or two vertices.


23

24


Example: What is the side effects of drugs used for Tuberculosis?
•  diseasome:1154 !
!
•  diseasome:possibleDrug !
•  sider:sideEffect !
!(type

!(type
!(type

Graph 1

!!

property)

sideEffect

possibleDrug
1154

instance) !!
property)!

?v0

?v1
Graph 2


?v2

25


Connecting Sub-graphs of an IQG:
1.  Minimum spanning tree: a minimum set of edges (i.e., properties) to span a set of
disjoint graphs.
2.  Prim’s algorithm: incrementally includes edges to connect disjoint sub-graphs.
•  Direct properties: ?v0 ?p ?v1.
•  Properties via owl:sameAs link.
(1) ?v0 owl:sameAs ?x. ?x ?p ?v1. !
(2) ?v0 ?p ?x. ?x owl:sameAs ?v1. !
(3) ?v0 owl:sameAs ?x. ?x ?p ?y. ?y owl:sameAs ?v1. !

Template 1

Template 2

possibleDrug
1154

?v0

1154

?v2

?v1

sideEffect
?v1


possibleDrug

sideEffect

?v0

?v2

Evaluation

Goal of experiment:
How well:
1.  resource disambiguation
2.  query construction approaches perform.
Measurement of the performance:
1.  For disambiguation using the Mean Reciprocal Rank (MRR).
2.  Query construction in terms of precision and recall.
Benchmark
1.  A natural- language query and the equivalent conjunctive SPARQL query.
2.  25 queries on the 3 interlinked datasets Drugbank, Sider and Diseasome.
3.  QALD1 and QALD3 benchmark for DBpedia.


26

Evaluation using life-science datasets

Without reasoning: precision = 0.91 recall = 0.88
With reasoning:
precision = 0.95 recall = 0.90

27

+ Evaluation using DBpedia
n 

QALD3 Benchmark:

ü 

contains 100 questions.

ü 

32 original questions can be answered correctly.

n 

QALD1 Benchmark:

ü 

contains 50 questions.

ü 

7 complex questions.

ü 

13 questions requiring information beyond DBpedia, i.e., from YAGO and FOAF.

ü 

14 slightly were modified to remove expansion and cleaning problem.

ü 

MRR of disambiguation = 96%

ü 

Query construction accuracy = 83%


28

Runtime

Parallization over three components:
1.  Segment validation
2.  Resource retrieval
3.  Query construction


29

+ Related work


30

31

Thank you

Saeedeh Shekarpour
shekarpour@informatik-leipzig.de
sa.shekarpour@gmail.com

Sina presentation in IBM

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Sina presentation in IBM

Similar to Sina presentation in IBM (20)

More from Saeedeh Shekarpour

More from Saeedeh Shekarpour (7)

Recently uploaded

Recently uploaded (20)

Sina presentation in IBM