Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Extracting Meaning from Wikipedia


Published on

How Wikipedia serves as a fantastsic source for extracting smenatic world knowledge, and how this is another example for the power of big data overcoming the knowledge acquisition bottleneck...

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Extracting Meaning from Wikipedia

  2. 2. Doug Lenat“Intelligence is 10 million rules…”Cyc, 1984(#$genls #$Tree-ThePlant #$Plant)(#$implies (#$and(#$isa ?OBJ ?SUBSET)(#$genls ?SUBSET?SUPERSET))(#$isa ?OBJ ?SUPERSET))…an oak is a plantPredicted to complete in 10 years.
  3. 3. Cyc TodayCan make impressive inferences, such as:• You have to be awake to eat• You cannot remember events that have not happened yet• If you cut a lump of peanut butter in half, each half is also alump of peanut butter; if you cut a table in half, neither halfis a table• When people die, they stay deadBut after 30 years and 700 man-years, only 2M+rules…What went wrong?
  4. 4. Knowledge Acquisition
  5. 5. Machine TranslationRule-Based Machine Translation (1970s):• Dictionary for both languages• Rules representing language structure• Parsing sentences to find structure• Mapping between structuresBuilt by human experts, accumulating rules overtime.Rules end up conflicting and ambiguousObject-verb-subjectBoy eats appleSubject-verb-object
  6. 6. Machine TranslationStatistical Translation (1990s):• Massive bilingual corpora• Corpus alignment• Calculate probability for word in 1st languageto match word in 2nd language• Use n-gram to build models that take context into accountFranz OchBuilt by data scientists, no linguists neededImproves as more data gets added
  7. 7. Encyclopedia?Asymptotic goal: Enter “the world’s most generalknowledge,” down to ever more detailed levels. Apreliminary milestone would be to finish encoding a one-volume desk encyclopedia...…There are approximately 30,000 articles in a typical one-volume desk encyclopedia… For comparison, theEncyclopedia Brittanica has nine times as manyarticles... A conservative estimate for the data enterers’rate is one paragraph per day; this would make their totaleffort about 150 man-years.Doug Lenat, 1985
  8. 8. Wikipedia
  9. 9. Un+Structured Data
  10. 10. YAGO “Yet Another Great Ontology”, 2007, MPI 10M entities, 120M facts (AlbertEinstein, bornInYear, 1879) (AlbertEinstein, hasWonPrize, NobelPrize) (AlbertEinstein, isA, Physicist) Uses the WordNet curated ontology, andexpands it into Wikipedia entities E.g. Albert Einstein is a Person
  11. 11. YAGO
  12. 12. YAGO Knowledge acquisition: Work started in 2006 2007: 1M entities, 5M facts 2012: 10M entities, 120M facts Now adding places Data export Query over SPARQL
  13. 13. DBpedia Created an ontology from scratch Crowdsourced the rule definition and mining More coverage, but less coherent model andstructure 2.3M entities, 400M facts Uses YAGO ontology as part of resources Data export, and SPARQL queries
  14. 14. ESA Explicit Semantic Analysis Prof. Shaul Markovitch, Dr. EvgeniyGabrilovich and yours truly The name is a pun on Latent SemanticAnalysis (LSA) – a quick context recapfollows…
  15. 15. Latent Semantic Analysis Technique to find “hidden” semantic relationsbetween groups of terms in documents
  16. 16. ESA Wikipedia articles are clear, coherentand universal semantic conceptsPantheraArticle words are associated with the concept(TF.IDF)Cat [0.92]Leopard [0.84]Roar [0.77]
  17. 17. ESACatPanthera[0.92]Cat[0.95]JaneFonda[0.07]The semantics of a word is the vectorof its associations with Wikipedia concepts
  18. 18. ESAbuttonDickButton[0.84]Button[0.93]GameController[0.32]Mouse(computing)[0.81]mouseMouse(computing)[0.84]Mouse(rodent)[0.91]JohnSteinbeck[0.17]MickeyMouse[0.81]mouse buttonDrag-and-drop[0.91]Mouse(computing)[0.95]Mouse(rodent)[0.56]GameController[0.64]mouse buttonThe semantics of a text fragment is the averagevector (centroid) of the semantics of its words
  19. 19. Uses of ESA Text Categorization Semantic Relatedness Information Retrieval
  20. 20. More semantic projects Word-sense disambiguation Multi-lingual dictionary from language links Cross-lingual search (Cross-Lingual-ESA) WikiData
  21. 21. Questions?
  22. 22. References Cyc: Lenat et al, CYC: Using Common Sense Knowledge to Overcome Brittleness and KnowledgeAcquisition Bottlenecks, AI Magazine Vol. 6 No. 4, 1985 Cycorp: YAGO: Suchanek et al, YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW2007 YAGO on Max-Planck Institut: ESA: E. Gabrilovich and S. Markovitch, Enhancing Text Categorization with EncyclopedicKnowledge, AAAI 2006 E. Gabrilovich and S. Markovitch, Computing Semantic Relatedness using Wikipedia-based ExplicitSemantic Analysis, IJCAI 2007 Egozi et al, Concept-Based Information Retrieval using Explicit Semantic Analysis, TOIS, 2011 Others: Rada Mihalcea, Using Wikipedia for AutomaticWord Sense Disambiguation, Proceedings of NAACLHLT, 2007 Erdmann et al, An Approach for Extracting Bilingual Terminology from Wikipedia, LNCS Vol.4947, 2008 Potthast et al, A Wikipedia-Based Multilingual Retrieval Model, Advances in InformationRetrieval, 2008