+

Question Answering on
Interlinked Data
Saeedeh Shekarpour, Axel-Cyrille Ngonga Ngomo, Soeren Auer
AKSW Research Group, ...
+ Motivation
Retrieving information from LOD

AKSW group - Question Answering on Interlinked Data (published in www2013)

...
+ Motivation
Text	
  queries	
  (either	
  keyword	
  or	
  natural	
  language	
  )	
  are:	
  
n 

Simple	
  retrieval	...
+ Comparison of Search Approaches

Data-Semantic
aware

Data-Semantic
unaware

Our
approach:
SINA

4

Question
Answering
S...
+ Example

5

1
n 

3

Which televisions shows were created by Walt Disney?
select * where !
{ ?v0 a
!
?v0 dbo:creator

A...
+ Aim and Challenges

Aim: Question answering over a set of interlinked data sources.
n 

Query segmentation.

n 

Resou...
+ Further Challenges over Interlinked Data
1. 

Information for answering a certain question can be spread
among different...
+ SINA Architecture

AKSW group - Question Answering on Interlinked Data (published in www2013)

8
+ Test bed datasets
*  One single dataset: DBpedia.
*  Three interlinked datasets
from life-science:

ü  Drugbank: is a
c...
+ Main characteristics of federated queries
1. 

Queries requiring fused information, e.g. side
effects of drugs used for ...
+ Challenge 1: Query Segmentation and Resource
Disambiguation

l 

Sample	
  ques5on:	
  What	
  is	
  the	
  side	
  effe...
12

Segment validation
	
  
ü 
ü 

	
  Original tuple: (side # effect # drug # Tuberculosis).
Using a naive approach for...
+

13

Concurrent	
  
Segmenta5on	
  and	
  Disambigua5on	
  	
  

AKSW group - Question Answering on Interlinked Data (pu...
14

Hidden Markov Model

• 
• 
• 
• 

A statistics model containing a set of states.
Moving from one state to another stat...
15

State Space

• 
• 
• 
• 

A state represents a knowledge base resource.
Contains all resources in the knowledge base.
...
16

Bootstrapping the Model Parameters
Emission Probability
• 

The set-similarity level measures the difference between t...
17

Bootstrapping the Model Parameters
Transition Probability & Initial Probability
•  Computing the transition probabilit...
Evaluation of Bootstrapping

18

•  The accuracy of different distribution functions, i.e., Normal, Zipfian and
uniform di...
+ Viterbi Algorithm
Aim: The most likely path generating the sequence of input keywords.

AKSW group - Question Answering ...
+

20

Output of the HMM for the following query:
Which televisions shows were created by Walt Disney?
Probability
0.0023
...
+

21

Query Construction	
  	
  

AKSW group - Question Answering on Interlinked Data (published in www2013)
Query Construction Method

Input: set of resources R = {r , r ,..., r }
Output: A query graph QG = (V, E)
is a directed, c...
Query Construction Method

Input: set of resources R = {r , r ,..., r }
Output: A query graph QG = (V, E)
is a directed, c...
24

Query Construction Method

Example: What is the side effects of drugs used for Tuberculosis?
•  diseasome:1154 !
!
•  ...
25

Query Construction Method

Connecting Sub-graphs of an IQG:
1.  Minimum spanning tree: a minimum set of edges (i.e., p...
Evaluation

Goal of experiment:
How well:
1.  resource disambiguation
2.  query construction approaches perform.
Measureme...
Evaluation using life-science datasets

Without reasoning: precision = 0.91 recall = 0.88
With reasoning:
precision = 0.95...
+ Evaluation using DBpedia
n 

QALD3 Benchmark:

ü 

contains 100 questions.

ü 

32 original questions can be answered...
Runtime

Parallization over three components:
1.  Segment validation
2.  Resource retrieval
3.  Query construction

AKSW g...
+ Related work

AKSW group - Question Answering on Interlinked Data (published in www2013)

30
31

Thank you

Saeedeh Shekarpour
shekarpour@informatik-leipzig.de
sa.shekarpour@gmail.com
AKSW group - Question Answering...
Upcoming SlideShare
Loading in …5
×

Sina presentation in IBM

950 views
796 views

Published on

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
950
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
17
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Sina presentation in IBM

  1. 1. + Question Answering on Interlinked Data Saeedeh Shekarpour, Axel-Cyrille Ngonga Ngomo, Soeren Auer AKSW Research Group, Leipzig University December 5 2013, IBM Research Center
  2. 2. + Motivation Retrieving information from LOD AKSW group - Question Answering on Interlinked Data (published in www2013) 2
  3. 3. + Motivation Text  queries  (either  keyword  or  natural  language  )  are:   n  Simple  retrieval  approach   n  Popular   n  Implicit  and  ambiguous  seman=cs.   SPARQL  queries  require:   n  Knowledge  about  the  ontology   n  Proficiency  in  formula=ng  formal  queries     n  Explicit  and  unambigious  seman=cs.   AKSW  group  -­‐  Ques=on  Answering  on  Interlinked  Data  (published  in  www2013)   3
  4. 4. + Comparison of Search Approaches Data-Semantic aware Data-Semantic unaware Our approach: SINA 4 Question Answering Systems Information Retrieval Keyword-based query AKSW group - Question Answering on Interlinked Data (published in www2013) Natural language query
  5. 5. + Example 5 1 n  3 Which televisions shows were created by Walt Disney? select * where ! { ?v0 a ! ?v0 dbo:creator AKSW group - Question Answering on Interlinked Data (published in www2013) 2 !dbo:TelevisionShow.! dbr:Walt_Disney. }!
  6. 6. + Aim and Challenges Aim: Question answering over a set of interlinked data sources. n  Query segmentation. n  Resource disambiguation. n  To construct a formal query (expressed in SPARQL) AKSW group - Question Answering on Interlinked Data (published in www2013) 6
  7. 7. + Further Challenges over Interlinked Data 1.  Information for answering a certain question can be spread among different datasets employing heterogeneous schemas. 2.  Constructing a federated formal query across different datasets requires exploiting links between the different datasets on both the schema and instance levels. AKSW group - Question Answering on Interlinked Data (published in www2013) 7
  8. 8. + SINA Architecture AKSW group - Question Answering on Interlinked Data (published in www2013) 8
  9. 9. + Test bed datasets *  One single dataset: DBpedia. *  Three interlinked datasets from life-science: ü  Drugbank: is a comprehensive knowledge base containing information about drugs, drug target (i.e. protein) information, interactions and enzymes. ü  Diseasome: contains information about diseases and genes associated with these diseases. ü  Sider: contains information about drugs and their side effects. AKSW group - Question Answering on Interlinked Data (published in www2013) 9
  10. 10. + Main characteristics of federated queries 1.  Queries requiring fused information, e.g. side effects of drugs used for Tuberculosis. 2.  Queries targeting combined information, e.g. side effect an enzymes of drugs used for ASTHMA. 3.  10 Queries requiring keyword expansion, e.g. side effects of Valdecoxib. DrugBank Sider Drug a a ?v1 enzyme ?v0 Disease ?v2 sameAs a Diseasome AKSW group - Question Answering on Interlinked Data (published in www2013) Side Effect Drug a Enzymes Asthma a side effect ?v3
  11. 11. + Challenge 1: Query Segmentation and Resource Disambiguation l  Sample  ques5on:  What  is  the  side  effects  of  drugs  used  for  Tuberculosis?     l   Transformed  to  4-­‐tuple  (side  #  effect  #  drug  #  Tuberculosis)   l  Different  segmenta=ons  are  possible:     1.  (  side  effect  #  drug  #  Tuberculosis)   2.  (  side  effect  drug  #  Tuberculosis  ) Mapping  of  the  segments  to  the  resources  in  the  underlying  knowledge  bases.   Each valid segment AKSW group - Question Answering on Interlinked Data (published in www2013) 11
  12. 12. 12 Segment validation   ü  ü   Original tuple: (side # effect # drug # Tuberculosis). Using a naive approach for finding all valid segments.   Valid Segments Samples of Candidate Resources Side effect 1.  sider:class:sideeffect ! 2.  sider:property:side_effects! drug 1. drugbank: drugs 2.class:offer! 3.sider:drugs 4.diseases:possibledrug! tuberculosis 1.  diseases:1154 ! 2.  side_effects: C0041296! AKSW group - Question Answering on Interlinked Data (published in www2013)
  13. 13. + 13 Concurrent   Segmenta5on  and  Disambigua5on     AKSW group - Question Answering on Interlinked Data (published in www2013)
  14. 14. 14 Hidden Markov Model •  •  •  •  A statistics model containing a set of states. Moving from one state to another state generates a sequence of observations. The probability of entering state only depends on the previous state. Output is the most likely states generating the sequence of the observation. AKSW group - Question Answering on Interlinked Data (published in www2013)
  15. 15. 15 State Space •  •  •  •  A state represents a knowledge base resource. Contains all resources in the knowledge base. In practice, we prune the state space by excluding irrelevant states. Adding an unknown entity state comprising all resources, which are not available (anymore) in the pruned state space. •  Extension of State Space with reasoning: An extension of the state space by including resources inferred from lightweight owl:sameAs reasoning. AKSW group - Question Answering on Interlinked Data (published in www2013)
  16. 16. 16 Bootstrapping the Model Parameters Emission Probability •  The set-similarity level measures the difference between the label and the segment in terms of the number of words using the Jaccard similarity. •  The string-similarity level measures the string similarity of each word in the segment with the most similar word in the label using the Levenshtein distance. AKSW group - Question Answering on Interlinked Data (published in www2013)
  17. 17. 17 Bootstrapping the Model Parameters Transition Probability & Initial Probability •  Computing the transition probability and initial probability based on Semantic relatedness of two resources. •  Semantic relatedness is based on two values: distance and connectivity degree. •  We transform these two values to hub and authority values using HITS algorithm. •  Initial probability and Transition probability are defined as a uniform distribution over the hub and and authority values. AKSW group - Question Answering on Interlinked Data (published in www2013)
  18. 18. Evaluation of Bootstrapping 18 •  The accuracy of different distribution functions, i.e., Normal, Zipfian and uniform distributions for transition probability. •  We ran the distribution functions with two different inputs, i.e. distance and connectivity degree values as well as hub and authority values. AKSW group - Question Answering on Interlinked Data (published in www2013)
  19. 19. + Viterbi Algorithm Aim: The most likely path generating the sequence of input keywords. AKSW group - Question Answering on Interlinked Data (published in www2013) 19
  20. 20. + 20 Output of the HMM for the following query: Which televisions shows were created by Walt Disney? Probability 0.0023 0.0014 5.89E-4 3.53E-4 3.76E-5 Path of states dbo:TelevisionShow , dbo:creator , dbr: dbo:TelevisionShow , dbo:creator , dbr: dbr:TelevisionShow , dbo:creator , dbr: dbr:TelevisionShow , dbo:creator , dbr: dbp:television , dbp:show , dbo:creator AKSW group - Question Answering on Interlinked Data (published in www2013) Walt_Disney! Category:Walt_Disney! Walt_Disney! Category:Walt_Disney! , dbr: Category:Walt_Disney!
  21. 21. + 21 Query Construction     AKSW group - Question Answering on Interlinked Data (published in www2013)
  22. 22. Query Construction Method Input: set of resources R = {r , r ,..., r } Output: A query graph QG = (V, E) is a directed, connected multi-graph. 1 2 n Forward Chaining: 1.  CT: Comprehensive type. 2.  CD: Comprehensive domain. 3.  CR: Comprehensive range. AKSW group - Question Answering on Interlinked Data (published in www2013) 22
  23. 23. Query Construction Method Input: set of resources R = {r , r ,..., r } Output: A query graph QG = (V, E) is a directed, connected multi-graph. 1 2 n Generating the Incomplete Query Graph (IQG) Initializing vertices and primary edges. •  A vertex is added to IQG (1) If r is an instance, (2) If r is a class. •  Properties are added along with zero, one or two vertices. AKSW group - Question Answering on Interlinked Data (published in www2013) 23
  24. 24. 24 Query Construction Method Example: What is the side effects of drugs used for Tuberculosis? •  diseasome:1154 ! ! •  diseasome:possibleDrug ! •  sider:sideEffect ! !(type !(type !(type Graph 1 !! property) sideEffect possibleDrug 1154 instance) !! property)! ?v0 ?v1 Graph 2 AKSW group - Question Answering on Interlinked Data (published in www2013) ?v2
  25. 25. 25 Query Construction Method Connecting Sub-graphs of an IQG: 1.  Minimum spanning tree: a minimum set of edges (i.e., properties) to span a set of disjoint graphs. 2.  Prim’s algorithm: incrementally includes edges to connect disjoint sub-graphs. •  Direct properties: ?v0 ?p ?v1. •  Properties via owl:sameAs link. (1) ?v0 owl:sameAs ?x. ?x ?p ?v1. ! (2) ?v0 ?p ?x. ?x owl:sameAs ?v1. ! (3) ?v0 owl:sameAs ?x. ?x ?p ?y. ?y owl:sameAs ?v1. ! Template 1 Template 2 possibleDrug 1154 ?v0 1154 ?v2 ?v1 sideEffect ?v1 AKSW group - Question Answering on Interlinked Data (published in www2013) possibleDrug sideEffect ?v0 ?v2
  26. 26. Evaluation Goal of experiment: How well: 1.  resource disambiguation 2.  query construction approaches perform. Measurement of the performance: 1.  For disambiguation using the Mean Reciprocal Rank (MRR). 2.  Query construction in terms of precision and recall. Benchmark 1.  A natural- language query and the equivalent conjunctive SPARQL query. 2.  25 queries on the 3 interlinked datasets Drugbank, Sider and Diseasome. 3.  QALD1 and QALD3 benchmark for DBpedia. AKSW group - Question Answering on Interlinked Data (published in www2013) 26
  27. 27. Evaluation using life-science datasets Without reasoning: precision = 0.91 recall = 0.88 With reasoning: precision = 0.95 recall = 0.90 AKSW group - Question Answering on Interlinked Data (published in www2013) 27
  28. 28. + Evaluation using DBpedia n  QALD3 Benchmark: ü  contains 100 questions. ü  32 original questions can be answered correctly. n  QALD1 Benchmark: ü  contains 50 questions. ü  7 complex questions. ü  13 questions requiring information beyond DBpedia, i.e., from YAGO and FOAF. ü  14 slightly were modified to remove expansion and cleaning problem. ü  MRR of disambiguation = 96% ü  Query construction accuracy = 83% AKSW group - Question Answering on Interlinked Data (published in www2013) 28
  29. 29. Runtime Parallization over three components: 1.  Segment validation 2.  Resource retrieval 3.  Query construction AKSW group - Question Answering on Interlinked Data (published in www2013) 29
  30. 30. + Related work AKSW group - Question Answering on Interlinked Data (published in www2013) 30
  31. 31. 31 Thank you Saeedeh Shekarpour shekarpour@informatik-leipzig.de sa.shekarpour@gmail.com AKSW group - Question Answering on Interlinked Data (published in www2013)

×