Extracting Meaning from Wikipedia

KNOWLEDGE
EXTRACTION
FROM
WIKIPEDIA
Ofer Egozi

Doug Lenat
“Intelligence is 10 million rules…”
Cyc, 1984
(#$genls #$Tree-ThePlant #$Plant)
(#$implies (#$and
(#$isa ?OBJ ?SUBSET)
(#$genls ?SUBSET
?SUPERSET))
(#$isa ?OBJ ?SUPERSET))
…an oak is a plant
Predicted to complete in 10 years.

Cyc Today
Can make impressive inferences, such as:
• You have to be awake to eat
• You cannot remember events that have not happened yet
• If you cut a lump of peanut butter in half, each half is also a
lump of peanut butter; if you cut a table in half, neither half
is a table
• When people die, they stay dead
But after 30 years and 700 man-years, only 2M+
rules…
What went wrong?

Machine Translation
Rule-Based Machine Translation (1970s):
• Dictionary for both languages
• Rules representing language structure
• Parsing sentences to find structure
• Mapping between structures
Built by human experts, accumulating rules over
time.
Rules end up conflicting and ambiguous
‫תפוח‬ ‫אוכל‬ ‫ילד‬
Object-verb-subject
Boy eats apple
Subject-verb-object

Machine Translation
Statistical Translation (1990s):
• Massive bilingual corpora
• Corpus alignment
• Calculate probability for word in 1st language
to match word in 2nd language
• Use n-gram to build models that take context into account
Franz Och
Built by data scientists, no linguists needed
Improves as more data gets added

Encyclopedia?
Asymptotic goal: Enter “the world’s most general
knowledge,” down to ever more detailed levels. A
preliminary milestone would be to finish encoding a one-
volume desk encyclopedia...
…There are approximately 30,000 articles in a typical one-
volume desk encyclopedia… For comparison, the
Encyclopedia Brittanica has nine times as many
articles... A conservative estimate for the data enterers’
rate is one paragraph per day; this would make their total
effort about 150 man-years.
Doug Lenat, 1985

YAGO
 “Yet Another Great Ontology”, 2007, MPI
 10M entities, 120M facts
 http://en.wikipedia.org/wiki/Albert_Einstein
 (AlbertEinstein, bornInYear, 1879)
 (AlbertEinstein, hasWonPrize, NobelPrize)
 (AlbertEinstein, isA, Physicist)
 Uses the WordNet curated ontology, and
expands it into Wikipedia entities
 E.g. Albert Einstein is a Person

YAGO
 Knowledge acquisition:
 Work started in 2006
 2007: 1M entities, 5M facts
 2012: 10M entities, 120M facts
 Now adding places
 Data export
 Query over SPARQL

DBpedia
 Created an ontology from scratch
 Crowdsourced the rule definition and mining
 More coverage, but less coherent model and
structure
 2.3M entities, 400M facts
 Uses YAGO ontology as part of resources
 Data export, and SPARQL queries

ESA
 Explicit Semantic Analysis
 Prof. Shaul Markovitch, Dr. Evgeniy
Gabrilovich and yours truly
 The name is a pun on Latent Semantic
Analysis (LSA) – a quick context recap
follows…

Latent Semantic Analysis
 Technique to find “hidden” semantic relations
between groups of terms in documents

ESA
 Wikipedia articles are clear, coherent
and universal semantic concepts
Panther
a
Article words are associated with the concept
(TF.IDF)
Cat [0.92]
Leopard [0.84]
Roar [0.77]

ESA
Cat
Panthera
[0.92]
Cat
[0.95]
Jane
Fonda
[0.07]
The semantics of a word is the vector
of its associations with Wikipedia concepts

ESA
button
Dick
Button
[0.84]
Button
[0.93]
Game
Controlle
r
[0.32]
Mouse
(computing
)
[0.81]
mouse
Mouse
(computing
)
[0.84]
Mouse
(rodent)
[0.91]
John
Steinbec
k
[0.17]
Mickey
Mouse
[0.81]
mouse button
Drag-
and-drop
[0.91]
Mouse
(computing
)
[0.95]
Mouse
(rodent)
[0.56]
Game
Controlle
r
[0.64]
mouse button
The semantics of a text fragment is the average
vector (centroid) of the semantics of its words

Uses of ESA
 Text Categorization
 Semantic Relatedness
 Information Retrieval

More semantic projects
 Word-sense disambiguation
 Multi-lingual dictionary from language links
 Cross-lingual search (Cross-Lingual-ESA)
 WikiData

References
 Cyc:
 Lenat et al, CYC: Using Common Sense Knowledge to Overcome Brittleness and Knowledge
Acquisition Bottlenecks, AI Magazine Vol. 6 No. 4, 1985
 Cycorp: http://www.cyc.com/
 YAGO:
 Suchanek et al, YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW
2007
 YAGO on Max-Planck Institut: http://www.mpi-inf.mpg.de/yago-naga/yago/
 ESA:
 E. Gabrilovich and S. Markovitch, Enhancing Text Categorization with Encyclopedic Knowledge,
AAAI 2006
 E. Gabrilovich and S. Markovitch, Computing Semantic Relatedness using Wikipedia-based Explicit
Semantic Analysis, IJCAI 2007
 Egozi et al, Concept-Based Information Retrieval using Explicit Semantic Analysis, TOIS, 2011
 Others:
 Rada Mihalcea, Using Wikipedia for AutomaticWord Sense Disambiguation, Proceedings of NAACL
HLT, 2007
 Erdmann et al, An Approach for Extracting Bilingual Terminology from Wikipedia, LNCS Vol. 4947,
2008
 Potthast et al, A Wikipedia-Based Multilingual Retrieval Model, Advances in Information Retrieval,
2008

Extracting Meaning from Wikipedia

Recommended

Recommended

More Related Content

Similar to Extracting Meaning from Wikipedia

Similar to Extracting Meaning from Wikipedia (20)

Recently uploaded

Recently uploaded (20)

Extracting Meaning from Wikipedia

Editor's Notes