How Wikipedia serves as a fantastsic source for extracting smenatic world knowledge, and how this is another example for the power of big data overcoming the knowledge acquisition bottleneck...
2. Doug Lenat
“Intelligence is 10 million rules…”
Cyc, 1984
(#$genls #$Tree-ThePlant #$Plant)
(#$implies (#$and
(#$isa ?OBJ ?SUBSET)
(#$genls ?SUBSET
?SUPERSET))
(#$isa ?OBJ ?SUPERSET))
…an oak is a plant
Predicted to complete in 10 years.
3. Cyc Today
Can make impressive inferences, such as:
• You have to be awake to eat
• You cannot remember events that have not happened yet
• If you cut a lump of peanut butter in half, each half is also a
lump of peanut butter; if you cut a table in half, neither half
is a table
• When people die, they stay dead
But after 30 years and 700 man-years, only 2M+
rules…
What went wrong?
6. Machine Translation
Rule-Based Machine Translation (1970s):
• Dictionary for both languages
• Rules representing language structure
• Parsing sentences to find structure
• Mapping between structures
Built by human experts, accumulating rules over
time.
Rules end up conflicting and ambiguous
תפוח אוכל ילד
Object-verb-subject
Boy eats apple
Subject-verb-object
7. Machine Translation
Statistical Translation (1990s):
• Massive bilingual corpora
• Corpus alignment
• Calculate probability for word in 1st language
to match word in 2nd language
• Use n-gram to build models that take context into account
Franz Och
Built by data scientists, no linguists needed
Improves as more data gets added
8. Encyclopedia?
Asymptotic goal: Enter “the world’s most general
knowledge,” down to ever more detailed levels. A
preliminary milestone would be to finish encoding a one-
volume desk encyclopedia...
…There are approximately 30,000 articles in a typical one-
volume desk encyclopedia… For comparison, the
Encyclopedia Brittanica has nine times as many
articles... A conservative estimate for the data enterers’
rate is one paragraph per day; this would make their total
effort about 150 man-years.
Doug Lenat, 1985
11. YAGO
“Yet Another Great Ontology”, 2007, MPI
10M entities, 120M facts
http://en.wikipedia.org/wiki/Albert_Einstein
(AlbertEinstein, bornInYear, 1879)
(AlbertEinstein, hasWonPrize, NobelPrize)
(AlbertEinstein, isA, Physicist)
Uses the WordNet curated ontology, and
expands it into Wikipedia entities
E.g. Albert Einstein is a Person
13. YAGO
Knowledge acquisition:
Work started in 2006
2007: 1M entities, 5M facts
2012: 10M entities, 120M facts
Now adding places
Data export
Query over SPARQL
14. DBpedia
Created an ontology from scratch
Crowdsourced the rule definition and mining
More coverage, but less coherent model and
structure
2.3M entities, 400M facts
Uses YAGO ontology as part of resources
Data export, and SPARQL queries
15. ESA
Explicit Semantic Analysis
Prof. Shaul Markovitch, Dr. Evgeniy
Gabrilovich and yours truly
The name is a pun on Latent Semantic
Analysis (LSA) – a quick context recap
follows…
16. Latent Semantic Analysis
Technique to find “hidden” semantic relations
between groups of terms in documents
17. ESA
Wikipedia articles are clear, coherent
and universal semantic concepts
Panther
a
Article words are associated with the concept
(TF.IDF)
Cat [0.92]
Leopard [0.84]
Roar [0.77]
23. References
Cyc:
Lenat et al, CYC: Using Common Sense Knowledge to Overcome Brittleness and Knowledge
Acquisition Bottlenecks, AI Magazine Vol. 6 No. 4, 1985
Cycorp: http://www.cyc.com/
YAGO:
Suchanek et al, YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW
2007
YAGO on Max-Planck Institut: http://www.mpi-inf.mpg.de/yago-naga/yago/
ESA:
E. Gabrilovich and S. Markovitch, Enhancing Text Categorization with Encyclopedic Knowledge,
AAAI 2006
E. Gabrilovich and S. Markovitch, Computing Semantic Relatedness using Wikipedia-based Explicit
Semantic Analysis, IJCAI 2007
Egozi et al, Concept-Based Information Retrieval using Explicit Semantic Analysis, TOIS, 2011
Others:
Rada Mihalcea, Using Wikipedia for AutomaticWord Sense Disambiguation, Proceedings of NAACL
HLT, 2007
Erdmann et al, An Approach for Extracting Bilingual Terminology from Wikipedia, LNCS Vol. 4947,
2008
Potthast et al, A Wikipedia-Based Multilingual Retrieval Model, Advances in Information Retrieval,
2008
Editor's Notes
Lenat actually explained that Cyc will solve the bottleneck by moving it to the decision of what data to enter, rather than the entry process itself. Compared to entering specific rules, entering facts and generalized rules is certainly better, but still manual.
Fast forward 20 years…
Fast forward 20 years…
There were quite a few efforts to use this wealth of information, I’ll speak about one that was quite impressive in its breadth and comparable to Cyc