The document describes an intelligent natural language question answering system called ENLIGHT. It discusses what question answering is, how it relates to information retrieval and information extraction. It then covers the general approach taken by question answering systems, including question analysis, document retrieval and processing, answer extraction and construction. It also discusses techniques used by ENLIGHT like handling semantic symmetry, ambiguous modification and incorporating learning. ENLIGHT is shown to have better precision and faster response time compared to other systems.
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Intelligent Natural Language QA System Overview
1. Intelligent N
I lli Natural L
l Language
System
MANISH JOSHI
RAJENDRA AKERKAR
2. Open Domain Question Answering
What is Question Answering?
How is QA related to IR, IE?
S
Some i
issues related to QA
l d
Question taxonomies
Q
General approach to QA
ENLIGHT sys 2 8 July, 2007
3. Question Answering Systems
These types of systems try to provide exact information
as an answer in response to the natural language query
raised by the user.
Motivation: given a question, system should provide
an answer instead of requiring user to search for the
answer in a set of documents
Example:
Q: What year was Mozart born?
A: Mozart was born in 1756.
ENLIGHT sys 3 8 July, 2007
4. Information Retrieval
Document is the unit of information
Answers questions indirectly
One has to search into the Document
Results: (ranked) list based on estimated relevance
Effective approaches are p
pp predominantly statistical
y
(“bag of words”)
Q
QA = ( y short) p
(very ) passage retrieval with natural language
g g g
questions (not queries)
ENLIGHT sys 4 8 July, 2007
5. Information Extraction
Task
Identify messages that fall under a number of specific topics
Extract information according to p
g pre-defined templates
p
Place the information into frame-like database records
Limitations
Templates are hand-crafted by human experts
Templates are domain dependent and not easily portable
ENLIGHT sys 5 8 July, 2007
6. Issues
Applications
Source of the answers
Structured data — natural language queries on databases
A fixed collection or book — encyclopedia
Web d
b data
Domain-independent vs. Domain specific
p p
Users
Casual users vs. Regular users — Profile, History, etc.
May be maintained for regular users
y g
ENLIGHT sys 6 8 July, 2007
7. Question Taxonomy
Factual questions: answer is often found in a text snippet
from one or more documents
Questions that may have yes/no answers
wh questions ( h where, when, etc.)
h i (who, h h )
what, which questions are hard
Questions may be phrased as requests or commands
Questions requiring simple reasoning: Some world
knowledge,
knowledge elementary reasoning may be required to relate
the question with the answer. why, how questions
e.g.
e g How did Socrates die? (by) drinking poisoned wine.
wine
ENLIGHT sys 7 8 July, 2007
8. Question Taxonomy
Context questions: Questions have to be answered in the
context of previous interactions with the user
Who assassinated Indira Gandhi?
When did this happen?
List questions: Fusion of partial answers scattered over
several documents is necessary
Ex. - List 3 major rice producing nations.
How do I assemble a bicycle?
ENLIGHT sys 8 8 July, 2007
10. General Approach
Question analysis: Find type of object that answers question:
"when" -time, date "who" -person, organization, etc.
Document collection preprocessing: Prepare documents
for real-time query processing
q yp g
Document retrieval (IR): Using (augmented) question,
retrieve set of possible relevant documents/passages using IR
ENLIGHT sys 10 8 July, 2007
11. General Approach
Document processing (IE): Search documents for entities
of the desired type and in appropriate relations using NLP
Answer extraction and ranking: Extract and rank
candidate answers from the documents
Answer construction: Provide (links to) context, evidence
context evidence,
etc.
ENLIGHT sys 11 8 July, 2007
12. Question Analysis
Identify semantic type of the entity sought by the question
when, where, who — easy to handle
which, what — ambiguous
e.g. What was the Beatles’ first hit single?
Beatles
Determine additional constraints on the answer entity
key words that will be used to locate candidate
answer-bearing sentences
relations (syntactic/semantic) that should hold between
a candidate answer entity and other entities mentioned
in the question
ENLIGHT sys 12 8 July, 2007
13. Document Processing
Preprocessing: Detailed analysis of all texts in the corpus
may b d
be done a priori
i i
one group annotates terms with one of 50 semantic
tags which are indexed along with terms
Retrieval: Initial set of candidate answer bearing documents
answer-bearing
are selected from a large collection
Boolean retrieval methods may be used profitably
Passage retrieval may be more appropriate
ENLIGHT sys 13 8 July, 2007
14. Document Processing
Analysis:
P t of speech t
Part f h tagging
i
Named entity identification: recognizes multi-word
strings as names of companies/persons, locations/addresses,
quantities, etc.
Shallow/deep syntactic analysis: Obtains information
about syntactic relations, semantic roles
ENLIGHT sys 14 8 July, 2007
15. History
MURAX ((Kupiec, 1993 )
was designed to answer questions from the Trivial Pursuit
general-knowledge board game – drawing answers from
Grolier’s on-line encyclopaedia (1990).
Text Retrieval Conference (TREC). TREC was started in 1992
with the aim of supporting information retrieval research by
pp g y
providing the infrastructure necessary for large-scale
evaluation of text retrieval methodologies.
The QA track was first included as part of TREC in 1999 with
seventeen research groups entering one or more systems
systems.
ENLIGHT sys 15 8 July, 2007
16. Techniques for performing open-domain question
answering
Manual and automatically constructed question analysers,
Document retrieval specifically for question answering,
Semantic type answer extraction
extraction,
Answer extraction via automatically acquired surface matching text
p
patterns,
,
principled target processing combined with document retrieval for
definition questions,
and various approaches to sentence simplification which aid in the
generation of concise definitions.
ENLIGHT sys 16 8 July, 2007
17. Answer Extraction
Look for strings whose semantic type matches that of the
expected answer - matching may include subasumption
(incorporating something under a more general category )
Check additional constraints
Select a window around matching candidate and
calculate word overlap between window and query;
OR
Check how many distinct question keywords are found
in a matching sentence order of occurrence, etc.
sentence, occurrence etc
Check syntactic/semantic role of matching candidate
Semantic Symmetry
Ambiguous Modification
ENLIGHT sys 17 8 July, 2007
18. Semantic Symmetry
Question – Who killed militants?
Militants killed five innocents in Doda District.
After 6 hour long encounter army soldiers killed 3
Militants.
We are looking for sentences containing word ‘Militant’ as
subject but we got a sentence where word ‘Militant’ acts as
object (second sentence)
It is a Linguistic Phenomena which occur when an entity acts
as subject in some sentences and as object in another
sentences.
ENLIGHT sys 18 8 July, 2007
19. Example
Following Example illustrates the phenomenon of semantic symmetry
and demonstrates problems caused thereof.
Question : Who visited President of India?
Candidate Answer 1: George Bush visited President of India
Candidate Answer 2: President of India visited flood affected area of
Mumbai.
More than one sentences are similar at the word level, but they have
very different meanings.
ENLIGHT sys 19 8 July, 2007
20. Some more examples showing semantic symmetry
(1) The birds ate the snake. (1) The snake ate the bird.
(What does snake eat?)
(2) Communists in India are (2) Small parties are supporting
supporting UPA government. Communists in Kerala.
(To whom communists are
supporting?)
ENLIGHT sys 20 8 July, 2007
21. Ambiguous Modification
It is a Linguistic Phenomena which occurs when an
adjective in the sentence may modify more than one
noun.
noun
Question : What is the largest volcano in the Solar System?
Candidate Answer 1: In the Solar System, the largest planet
Jupitor has several volcanoes. ---- Wrong
Candidate Answer 2: Olympus Mons, the largest volcano in
the solar system. --- Correct
In first sentence Largest modifies word ‘planet’ whereas in
second sentence Largest modifies word ‘volcano’.
ENLIGHT sys 21 8 July, 2007
22. Approaches to tackle the problem
Boris Katz and James Lin of MIT developed a system
SAPERE that handles problems occurring due to semantic
symmetry and ambiguous modification.
These problems occurs at semantic level
level.
To deal with problems occurring at semantic level detailed
information at syntactic level is g
y gathered in all approaches
pp
System developed by Katz and Lin gives results after
utilizing syntactic relations. These typical S-V-O ternary
relations are obtained after processing the information
gat e ed
gathered by Minipar functional depe de cy pa se .
pa u ct o a dependency parser.
ENLIGHT sys 22 8 July, 2007
23. Our Approach
To deal with problems at semantic level most of the
approaches available need to obtain and work on
information gathered at syntactic level.
We have proposed a new approach to deal with the
problems caused by Linguistic phenomena of Semantic
Symmetry and Ambiguous Modification.
The Algorithms based on our approach removes wrong
sentences f
t from th answer with th h l of i f
the ith the help f information
ti
obtained at Lexical level (Lexical Analysis).
ENLIGHT sys 23 8 July, 2007
24. Algorithm for Handling Semantic Symmetry
Rule 1 -
If (sequence of keywords in question and candidate
answer matches) then
If (POS of verb keyword are same) then
Candidate
C did t answer i Cis Correct
t
Rule 2 -
If (sequence of keywords in question and candidate
answer do not match) then
If (POS verb keyword are not same) then
Candidate answer is Correct
Otherwise -
Candidate Answer is wrong
ENLIGHT sys 24 8 July, 2007
25. Algorithm for Handling Ambiguous Modification
We have identified the adjective as Adj, Scope defining noun as SN and the
Identifier noun as IN.
Rules –
If the sentence contains keywords in following order –
Adj α SN Where α indicate string of zero or more
keywords.
Thene
Rule1-a If α is IN == Correct Answer Or
Rule1-b
Rule1 b If α is Blank == Correct Answer
Else
Rule 2 If α is Otherwise == Wrong Answer
ENLIGHT sys 25 8 July, 2007
26. Algorithm for Handling Ambiguous Modification
(Cont.)
If the sentence contains keywords in following order –
y g
SN α Adj β IN Where α and β indicate string
of zero or more keywords.
Then
Rule 3 If β is Blank
== Correct Answer
(Value f
(V l of α D Does not matter)
t tt )
Else
Rule 4 If β is Otherwise
== Wrong Answer
ENLIGHT sys 26 8 July, 2007
27. Working System - ENLIGHT
We have developed a system that answers
questions using ‘keyword based matching
paradigm’.
We have incorporated newly formulated
algorithms in the system and we got good
results.
ENLIGHT sys 27 8 July, 2007
29. Preprocessing
Thi module prepares platform f th I t lli
This d l l tf for the Intelligent and
t d
Effective interface.
This module transfer raw format data into well organized
corpus with the help of following activities.
Keyword Extraction
Sentence Segmentation
Handling of Abbreviations and Punctuation Marks
Tokenization
Stemming
Identifying Group of Words with Specific Meaning
Shallow Parsing
Reference Resolution
ENLIGHT sys 29 8 July, 2007
30. Question Analysis
Question T k i i
Q i Tokenization
Question Classification
Corpus M
C Management
t
Various database tables are created to manage the vast data
InfoKeywords
QuestionKeyword
QuestionAnswer
CorpusSentences
Abbreviations
Abb i i
Apostrophes
StopWords
Answer Retrieval
Answer Searching
Answer Generation
ENLIGHT sys 30 8 July, 2007
31. Answer Rescoring
Handling problems caused due to linguistic phenomena
using shallow parsing based algorithms
Semantic Symmetry
Ambiguous Modification
Intelligence Incorporation
Learning
Rote Learning g
Feedback
Can Improve
Satisfactory
Wrong Answer
Loose criterion
Automated Classification
ENLIGHT sys 31 8 July, 2007
32. Results
Preciseness
P i
Response Time
p
Adaptability
ENLIGHT sys 32 8 July, 2007
33. Preciseness
Basic Keyword
ENLIGHT
Matchingg
Average Number of sentences
3 34.6
returned as Answer
Average Number of correct sentences 2.63 6
Average precision 84 % 32 %
ENLIGHT sys 33 8 July, 2007
34. Response Time (ENLIGHT Vs Sapere)
Time Required by Time Required by
Type of Data and
QTAG Minipar
No. f
N of words
d
(Used in ENLIGHT) (Used in Sapere)
News extract, Times of
1.71 s 2.88 s
India.
India 202 Words
Reply, START QA
1.89 s 3.11 s
System. 251 Words
Google Search Engine
1.55 s 2.86 s
Result
Yahoo S
Y h Search E i
h Engine
1.67 s 3.13 s
Results
AVERAGE 1.705 s 2.995 s
ENLIGHT sys 34 8 July, 2007
35. Adaptability
Handling Additional Keywords
Question like ‘who killed the Prime Minister?’ can
also be handled by ENLIGHT System
y y
Use of synonyms
If the question and answer contains synonyms
ENLIGHT System can associate these two words
using the Learning phase.
ENLIGHT sys 35 8 July, 2007
36. References
L. Hirschman, R. Gaizauskas, Natural language question answering: the
view from here, Natural Language engineering, 7(4), December
2001.
Manish Joshi, Rajendra Akerkar, The ENLIGHT System, Intelligent
Natural Language System, Journal of Digital Information
Management, J
M June 2007.
ENLIGHT sys 36 8 July, 2007