Lenat actually explained that Cyc will solve the bottleneck by moving it to the decision of what data to enter, rather than the entry process itself. Compared to entering specific rules, entering facts and generalized rules is certainly better, but still manual.
Fast forward 20 years…
Fast forward 20 years…
There were quite a few efforts to use this wealth of information, I’ll speak about one that was quite impressive in its breadth and comparable to Cyc
Extracting Meaning from Wikipedia
Doug Lenat“Intelligence is 10 million rules…”Cyc, 1984(#$genls #$Tree-ThePlant #$Plant)(#$implies (#$and(#$isa ?OBJ ?SUBSET)(#$genls ?SUBSET?SUPERSET))(#$isa ?OBJ ?SUPERSET))…an oak is a plantPredicted to complete in 10 years.
Cyc TodayCan make impressive inferences, such as:• You have to be awake to eat• You cannot remember events that have not happened yet• If you cut a lump of peanut butter in half, each half is also alump of peanut butter; if you cut a table in half, neither halfis a table• When people die, they stay deadBut after 30 years and 700 man-years, only 2M+rules…What went wrong?
Machine TranslationRule-Based Machine Translation (1970s):• Dictionary for both languages• Rules representing language structure• Parsing sentences to find structure• Mapping between structuresBuilt by human experts, accumulating rules overtime.Rules end up conflicting and ambiguousObject-verb-subjectBoy eats appleSubject-verb-object
Machine TranslationStatistical Translation (1990s):• Massive bilingual corpora• Corpus alignment• Calculate probability for word in 1st languageto match word in 2nd language• Use n-gram to build models that take context into accountFranz OchBuilt by data scientists, no linguists neededImproves as more data gets added
Encyclopedia?Asymptotic goal: Enter “the world’s most generalknowledge,” down to ever more detailed levels. Apreliminary milestone would be to finish encoding a one-volume desk encyclopedia...…There are approximately 30,000 articles in a typical one-volume desk encyclopedia… For comparison, theEncyclopedia Brittanica has nine times as manyarticles... A conservative estimate for the data enterers’rate is one paragraph per day; this would make their totaleffort about 150 man-years.Doug Lenat, 1985
YAGO “Yet Another Great Ontology”, 2007, MPI 10M entities, 120M facts http://en.wikipedia.org/wiki/Albert_Einstein (AlbertEinstein, bornInYear, 1879) (AlbertEinstein, hasWonPrize, NobelPrize) (AlbertEinstein, isA, Physicist) Uses the WordNet curated ontology, andexpands it into Wikipedia entities E.g. Albert Einstein is a Person
YAGO Knowledge acquisition: Work started in 2006 2007: 1M entities, 5M facts 2012: 10M entities, 120M facts Now adding places Data export Query over SPARQL
DBpedia Created an ontology from scratch Crowdsourced the rule definition and mining More coverage, but less coherent model andstructure 2.3M entities, 400M facts Uses YAGO ontology as part of resources Data export, and SPARQL queries
ESA Explicit Semantic Analysis Prof. Shaul Markovitch, Dr. EvgeniyGabrilovich and yours truly The name is a pun on Latent SemanticAnalysis (LSA) – a quick context recapfollows…
Latent Semantic Analysis Technique to find “hidden” semantic relationsbetween groups of terms in documents
ESA Wikipedia articles are clear, coherentand universal semantic conceptsPantheraArticle words are associated with the concept(TF.IDF)Cat [0.92]Leopard [0.84]Roar [0.77]
ESACatPanthera[0.92]Cat[0.95]JaneFonda[0.07]The semantics of a word is the vectorof its associations with Wikipedia concepts
ESAbuttonDickButton[0.84]Button[0.93]GameController[0.32]Mouse(computing)[0.81]mouseMouse(computing)[0.84]Mouse(rodent)[0.91]JohnSteinbeck[0.17]MickeyMouse[0.81]mouse buttonDrag-and-drop[0.91]Mouse(computing)[0.95]Mouse(rodent)[0.56]GameController[0.64]mouse buttonThe semantics of a text fragment is the averagevector (centroid) of the semantics of its words
Uses of ESA Text Categorization Semantic Relatedness Information Retrieval
More semantic projects Word-sense disambiguation Multi-lingual dictionary from language links Cross-lingual search (Cross-Lingual-ESA) WikiData
References Cyc: Lenat et al, CYC: Using Common Sense Knowledge to Overcome Brittleness and KnowledgeAcquisition Bottlenecks, AI Magazine Vol. 6 No. 4, 1985 Cycorp: http://www.cyc.com/ YAGO: Suchanek et al, YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW2007 YAGO on Max-Planck Institut: http://www.mpi-inf.mpg.de/yago-naga/yago/ ESA: E. Gabrilovich and S. Markovitch, Enhancing Text Categorization with EncyclopedicKnowledge, AAAI 2006 E. Gabrilovich and S. Markovitch, Computing Semantic Relatedness using Wikipedia-based ExplicitSemantic Analysis, IJCAI 2007 Egozi et al, Concept-Based Information Retrieval using Explicit Semantic Analysis, TOIS, 2011 Others: Rada Mihalcea, Using Wikipedia for AutomaticWord Sense Disambiguation, Proceedings of NAACLHLT, 2007 Erdmann et al, An Approach for Extracting Bilingual Terminology from Wikipedia, LNCS Vol.4947, 2008 Potthast et al, A Wikipedia-Based Multilingual Retrieval Model, Advances in InformationRetrieval, 2008