Extracting Meaning from Wikipedia


Published on

How Wikipedia serves as a fantastsic source for extracting smenatic world knowledge, and how this is another example for the power of big data overcoming the knowledge acquisition bottleneck...

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Lenat actually explained that Cyc will solve the bottleneck by moving it to the decision of what data to enter, rather than the entry process itself. Compared to entering specific rules, entering facts and generalized rules is certainly better, but still manual.
  • Fast forward 20 years…
  • Fast forward 20 years…
  • There were quite a few efforts to use this wealth of information, I’ll speak about one that was quite impressive in its breadth and comparable to Cyc
  • Extracting Meaning from Wikipedia

    2. 2. Doug Lenat“Intelligence is 10 million rules…”Cyc, 1984(#$genls #$Tree-ThePlant #$Plant)(#$implies (#$and(#$isa ?OBJ ?SUBSET)(#$genls ?SUBSET?SUPERSET))(#$isa ?OBJ ?SUPERSET))…an oak is a plantPredicted to complete in 10 years.
    3. 3. Cyc TodayCan make impressive inferences, such as:• You have to be awake to eat• You cannot remember events that have not happened yet• If you cut a lump of peanut butter in half, each half is also alump of peanut butter; if you cut a table in half, neither halfis a table• When people die, they stay deadBut after 30 years and 700 man-years, only 2M+rules…What went wrong?
    4. 4. Knowledge Acquisition
    5. 5. Machine TranslationRule-Based Machine Translation (1970s):• Dictionary for both languages• Rules representing language structure• Parsing sentences to find structure• Mapping between structuresBuilt by human experts, accumulating rules overtime.Rules end up conflicting and ambiguousObject-verb-subjectBoy eats appleSubject-verb-object
    6. 6. Machine TranslationStatistical Translation (1990s):• Massive bilingual corpora• Corpus alignment• Calculate probability for word in 1st languageto match word in 2nd language• Use n-gram to build models that take context into accountFranz OchBuilt by data scientists, no linguists neededImproves as more data gets added
    7. 7. Encyclopedia?Asymptotic goal: Enter “the world’s most generalknowledge,” down to ever more detailed levels. Apreliminary milestone would be to finish encoding a one-volume desk encyclopedia...…There are approximately 30,000 articles in a typical one-volume desk encyclopedia… For comparison, theEncyclopedia Brittanica has nine times as manyarticles... A conservative estimate for the data enterers’rate is one paragraph per day; this would make their totaleffort about 150 man-years.Doug Lenat, 1985
    8. 8. Wikipedia
    9. 9. Un+Structured Data
    10. 10. YAGO “Yet Another Great Ontology”, 2007, MPI 10M entities, 120M facts http://en.wikipedia.org/wiki/Albert_Einstein (AlbertEinstein, bornInYear, 1879) (AlbertEinstein, hasWonPrize, NobelPrize) (AlbertEinstein, isA, Physicist) Uses the WordNet curated ontology, andexpands it into Wikipedia entities E.g. Albert Einstein is a Person
    11. 11. YAGO
    12. 12. YAGO Knowledge acquisition: Work started in 2006 2007: 1M entities, 5M facts 2012: 10M entities, 120M facts Now adding places Data export Query over SPARQL
    13. 13. DBpedia Created an ontology from scratch Crowdsourced the rule definition and mining More coverage, but less coherent model andstructure 2.3M entities, 400M facts Uses YAGO ontology as part of resources Data export, and SPARQL queries
    14. 14. ESA Explicit Semantic Analysis Prof. Shaul Markovitch, Dr. EvgeniyGabrilovich and yours truly The name is a pun on Latent SemanticAnalysis (LSA) – a quick context recapfollows…
    15. 15. Latent Semantic Analysis Technique to find “hidden” semantic relationsbetween groups of terms in documents
    16. 16. ESA Wikipedia articles are clear, coherentand universal semantic conceptsPantheraArticle words are associated with the concept(TF.IDF)Cat [0.92]Leopard [0.84]Roar [0.77]
    17. 17. ESACatPanthera[0.92]Cat[0.95]JaneFonda[0.07]The semantics of a word is the vectorof its associations with Wikipedia concepts
    18. 18. ESAbuttonDickButton[0.84]Button[0.93]GameController[0.32]Mouse(computing)[0.81]mouseMouse(computing)[0.84]Mouse(rodent)[0.91]JohnSteinbeck[0.17]MickeyMouse[0.81]mouse buttonDrag-and-drop[0.91]Mouse(computing)[0.95]Mouse(rodent)[0.56]GameController[0.64]mouse buttonThe semantics of a text fragment is the averagevector (centroid) of the semantics of its words
    19. 19. Uses of ESA Text Categorization Semantic Relatedness Information Retrieval
    20. 20. More semantic projects Word-sense disambiguation Multi-lingual dictionary from language links Cross-lingual search (Cross-Lingual-ESA) WikiData
    21. 21. Questions?
    22. 22. References Cyc: Lenat et al, CYC: Using Common Sense Knowledge to Overcome Brittleness and KnowledgeAcquisition Bottlenecks, AI Magazine Vol. 6 No. 4, 1985 Cycorp: http://www.cyc.com/ YAGO: Suchanek et al, YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW2007 YAGO on Max-Planck Institut: http://www.mpi-inf.mpg.de/yago-naga/yago/ ESA: E. Gabrilovich and S. Markovitch, Enhancing Text Categorization with EncyclopedicKnowledge, AAAI 2006 E. Gabrilovich and S. Markovitch, Computing Semantic Relatedness using Wikipedia-based ExplicitSemantic Analysis, IJCAI 2007 Egozi et al, Concept-Based Information Retrieval using Explicit Semantic Analysis, TOIS, 2011 Others: Rada Mihalcea, Using Wikipedia for AutomaticWord Sense Disambiguation, Proceedings of NAACLHLT, 2007 Erdmann et al, An Approach for Extracting Bilingual Terminology from Wikipedia, LNCS Vol.4947, 2008 Potthast et al, A Wikipedia-Based Multilingual Retrieval Model, Advances in InformationRetrieval, 2008