"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Knowledge acquisition using automated techniques
1. Methods
of
Knowledge Extraction
Deepti Aggarwal
SIEL|SERL, IIIT-Hyderabad, India
2. Agenda
Introduction to Web as a knowledge
repository
Automated extraction techniques (Input
sources, extracted structures, input pre-
processing, extraction methods, output
generation)
Issues with automated extraction
3. What is knowledge?
A familiarity with someone or something
with experience
Includes facts, information, descriptions,
skills
4. Types of Knowledge
Explicit Knowledge Implicit Knowledge
Always present Not present explicitly
explicitly in records for analysis
Objective facts having Cultural beliefs with
a definite answer subjective judgments
E.g., Hyderabad is the
capital of A.P. E.g., Hyderabad is the
best city to live in India.
6. How knowledge is
represented over the web?
Millions of documents, blogs, forums,
social networks scattered on web
Diverse topic, different formats, from
diverse people in diverse language,
different point of views
7. Benefits of knowledge
extraction over the Web
Question Answering systems
Search engines Explicit
Validating knowledge knowledge
Tracking a particular information
Predicting market, polls etc. Implicit
Community advertisements knowledge
11. Working of automated
extraction systems
Defining Input
output pre- Extraction Output
structures processing methods processing
Input
sources Database
of all facts,
Extraction system relations
16. 1. Named Entity: Definition
It is an atomic element in a body of
text.
Types: person, organization, location etc.
Different named entities when linked together,
form a relation.
17. 1. Named Entity: An
example
Sachin Tendulkarwas born in Bombay.
NE of type „Person‟ NE of type „Location‟
18. 2. Named Entity
Relationship: Structure
Subject – Relation - Object
NE of any type NE of any type
Verb, Adjective, Adverb
22. NLP libraries:
Splitting each sentence into tokens, words,
digits using Sentence Tokenizer
Recognizing language constructs, nouns,
verbs, pronouns using Part-of-speech
Tagger
Example: Sachin/NNPTendulkar/NNP
was/VBD born/VBN in/IN
Bombay/NNP
23. NLP libraries (contd.):
Linking individual constituents of a
sentence with Parser to form parse
tree
Identify types of named entity using
Named Entity Recognizer
Example: Sachin
Tendulkar/PERSON was born
inBombay/LOCATION
24. NLP libraries (contd.):
Identify all co-references and replace
with actual entity using Co -
reference Resolution tool
Identify specific meaning of a word
Word Sense Disambiguation
External vocabularies: MindNet,
DBpedia, WordNet
E.g., contextual meaning of „crane‟:
noun-bird, verb-lift/move
26. Extracting relationships
among NEs: Standard
process
named entities within a
1. Identify
sentence.
verbor adjective that
2. Find the
connects the identified named
entities.
3. Connect them together to form relation.
27. Extracting relationships
among NEs: Required
process
1. Identifypart-of-speech constructs:
noun, verb, adjective etc.
Co-references,
2. Determine
Acronyms and
abbreviations.
3. Connect them together to form a
relationship.
28. Extraction Methods
Natural Language Processing: rule based.
Based on sentence structure
E.g., for English language, a rule can be “noun-verb-noun”
Machine Learning: supervised and
unsupervised learning.
Features are detected from the training data
E.g., to extract instances of some medical diseases, system
is trained over all the symptoms of each given disease.
29. Extraction Methods (contd.)
Other methods:Vocabulary
based systems,
context based clustering.
Maintaining a mapping file of all countries and their
nationalities helps to determine nationality of a
person when his birth place is known.
Hybrid:
NLP based libraries to pre-process the input data,
applying machine learning approach to extract the
relations by using some external vocabulary as
WordNet.
31. Types of output systems
1. Identifies all mentionsof named entities
and their relations.
E.g., from a given corpus, extract all named entity
relations.
2. Identify missing relations of a database
E.g., Given a database, extract the missing attributes
of given entities from the corpus.
3. Linking various entities within a database.
E.g., Given a database, link two entities together with
some relation extracted from the corpus.
32. Working of automated
extraction systems
Defining Input
output pre- Extraction Output
structures processing methods processing
Input
sources Database
of all facts,
Extraction system relations
33. Issues with
automated
extraction
Accuracy, running time, dependency
34. Issue 1: Challenges of
language structure
Co-reference
resolution
Ambiguous, complex
sentences
Abbreviations
Acronyms
35. See an example…
“Tomcalled his father last night. They talked for
an hour. Hesaid hewould be home the next
day."
What is „He'referring to?
Tomorhis father?
36. “You see sir, I can talk English, I can walk English, I
can laugh English, I can run English, because
English is such a funny language.”
Amitabh in NamakHalal
37. Issue 2: Accuracy
Named entity detection: 90%,
relationship 50-70%.
Introduction of noise at each step.
E.g., disambiguation of acronym
„crane‟ with WordNet, introduces
contextual errors, which then
decreases accuracy of rule based
relationship extraction
38. Issue 3: Efficiency
Feature detection steps are
expensive.
Require days for computation
39. Issue 4: Dependency
on external vocabulary sources, like
Wikipedia, WordNet, MindNetetc.
Maintenance &updationof vocabulary
sources is manual: costly and require
expertise.
Limited size produce context based noise
Domain-dependent: medical domain
Corpus-dependent: Wikipedia, news
corpus
Relation specific: Dateand Place-of-
event
40. Issue 5: Problem with Implicit
knowledge extraction
Community Knowledge is learned and shared
No one can be an expert.
cultural competence and perception of
workers are fed into a system as variables.
Cultural Consensus Theory provides
models to include such variables into the
system.
41. Can we do better?
Can we seek human intelligence to improve
the accuracy of automated techniques?
42. References
[1] I. Tuomi. Data is more than knowledge:
implications of the reversed knowledge hierarchy
for knowledge management and organizational
memory. J. Manage. Inf. Syst. , 16(3):103–117, Dec.
1999.
[2] S. Sekine. Named Entity: History and Future. 2004.
[3] S. Sarawagi. Information extraction. Found. Trends
databases , 1(3):261–377, Mar. 2008.
[4] S. C. Weller. Cultural consensus theory:
Applications and frequently asked questions. Field
Methods,19(4):339–368, 2007.
43. References (contd.)
[5] Z. Syed, E. Viegas, and S. Parastatidis. Automatic
discovery of semantic relations using mindnet.
LREC,2010.
[6] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and
K. Miller. Wordnet: An on-line lexical database.
International Journal of Lexicography , 3:235–244,
1990
[7] T. S. Jayram, R. Krishnamurthy, S. Raghavan, S.
Vaithyanathan, and H. Zhu. Avatar information
extraction system. IEEE Data Eng. Bull. , pages 40–48,
2006.
[8] E. Greengrass. Information retrieval: A survey, 2000.
The definition of knowledge is a matter of on-going debate among philosophersbut for our talk I have taken this definition from wikipedia
Predicting market: to predict whether people likes Lux soap or not.community advertisements. Ex: Advertising Bengalis’ community in Hyderabad for a concert in Bengali.
Scarcity is not the issue but abundance is!Easy for humans to understand the meaning lying in different documents.Becomes difficult for a user to find a document of his interest.
Too much of labour, time consuming, biasedness, For huge data, an intelligent way is to formulate an algo which can perform repetitive computation. with systems instead of manual labour. Less time consuming, Which I will talk about in my ppt.I Consider it to be more appropriate. Combines the advantages of both systems and humans. Systems: scalability and accuracy and intelligence with humans. In my thesis, I have particularly opted for this approach. Today I am not talking about this approach. I will cover this topic in some later ppt.
Systems that are built over some algorithms: the use of methods for controlling industrial processes automatically, esp by electronically controlled systems, often reducing manpower
Broad overview of how system worksAccording to me these are five main components
Broad overview of how system worksAccording to me these are five main components
Type of extraction method depends on the applicationHighly sophisticated system can achieve max. of 70% accuracy. Accuracy of automated techniques can not surpass human intelligence.