Successfully reported this slideshow.
Your SlideShare is downloading. ×

Extracting Meaning from Wikipedia

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 23 Ad

More Related Content

Similar to Extracting Meaning from Wikipedia (20)

Recently uploaded (20)

Advertisement

Extracting Meaning from Wikipedia

  1. 1. KNOWLEDGE EXTRACTION FROM WIKIPEDIA Ofer Egozi
  2. 2. Doug Lenat “Intelligence is 10 million rules…” Cyc, 1984 (#$genls #$Tree-ThePlant #$Plant) (#$implies (#$and (#$isa ?OBJ ?SUBSET) (#$genls ?SUBSET ?SUPERSET)) (#$isa ?OBJ ?SUPERSET)) …an oak is a plant Predicted to complete in 10 years.
  3. 3. Cyc Today Can make impressive inferences, such as: • You have to be awake to eat • You cannot remember events that have not happened yet • If you cut a lump of peanut butter in half, each half is also a lump of peanut butter; if you cut a table in half, neither half is a table • When people die, they stay dead But after 30 years and 700 man-years, only 2M+ rules… What went wrong?
  4. 4. Knowledge Acquisition
  5. 5. Machine Translation Rule-Based Machine Translation (1970s): • Dictionary for both languages • Rules representing language structure • Parsing sentences to find structure • Mapping between structures Built by human experts, accumulating rules over time. Rules end up conflicting and ambiguous ‫תפוח‬ ‫אוכל‬ ‫ילד‬ Object-verb-subject Boy eats apple Subject-verb-object
  6. 6. Machine Translation Statistical Translation (1990s): • Massive bilingual corpora • Corpus alignment • Calculate probability for word in 1st language to match word in 2nd language • Use n-gram to build models that take context into account Franz Och Built by data scientists, no linguists needed Improves as more data gets added
  7. 7. Encyclopedia? Asymptotic goal: Enter “the world’s most general knowledge,” down to ever more detailed levels. A preliminary milestone would be to finish encoding a one- volume desk encyclopedia... …There are approximately 30,000 articles in a typical one- volume desk encyclopedia… For comparison, the Encyclopedia Brittanica has nine times as many articles... A conservative estimate for the data enterers’ rate is one paragraph per day; this would make their total effort about 150 man-years. Doug Lenat, 1985
  8. 8. Wikipedia
  9. 9. Un+Structured Data
  10. 10. YAGO  “Yet Another Great Ontology”, 2007, MPI  10M entities, 120M facts  http://en.wikipedia.org/wiki/Albert_Einstein  (AlbertEinstein, bornInYear, 1879)  (AlbertEinstein, hasWonPrize, NobelPrize)  (AlbertEinstein, isA, Physicist)  Uses the WordNet curated ontology, and expands it into Wikipedia entities  E.g. Albert Einstein is a Person
  11. 11. YAGO
  12. 12. YAGO  Knowledge acquisition:  Work started in 2006  2007: 1M entities, 5M facts  2012: 10M entities, 120M facts  Now adding places  Data export  Query over SPARQL
  13. 13. DBpedia  Created an ontology from scratch  Crowdsourced the rule definition and mining  More coverage, but less coherent model and structure  2.3M entities, 400M facts  Uses YAGO ontology as part of resources  Data export, and SPARQL queries
  14. 14. ESA  Explicit Semantic Analysis  Prof. Shaul Markovitch, Dr. Evgeniy Gabrilovich and yours truly  The name is a pun on Latent Semantic Analysis (LSA) – a quick context recap follows…
  15. 15. Latent Semantic Analysis  Technique to find “hidden” semantic relations between groups of terms in documents
  16. 16. ESA  Wikipedia articles are clear, coherent and universal semantic concepts Panther a Article words are associated with the concept (TF.IDF) Cat [0.92] Leopard [0.84] Roar [0.77]
  17. 17. ESA Cat Panthera [0.92] Cat [0.95] Jane Fonda [0.07] The semantics of a word is the vector of its associations with Wikipedia concepts
  18. 18. ESA button Dick Button [0.84] Button [0.93] Game Controlle r [0.32] Mouse (computing ) [0.81] mouse Mouse (computing ) [0.84] Mouse (rodent) [0.91] John Steinbec k [0.17] Mickey Mouse [0.81] mouse button Drag- and-drop [0.91] Mouse (computing ) [0.95] Mouse (rodent) [0.56] Game Controlle r [0.64] mouse button The semantics of a text fragment is the average vector (centroid) of the semantics of its words
  19. 19. Uses of ESA  Text Categorization  Semantic Relatedness  Information Retrieval
  20. 20. More semantic projects  Word-sense disambiguation  Multi-lingual dictionary from language links  Cross-lingual search (Cross-Lingual-ESA)  WikiData
  21. 21. Questions?
  22. 22. References  Cyc:  Lenat et al, CYC: Using Common Sense Knowledge to Overcome Brittleness and Knowledge Acquisition Bottlenecks, AI Magazine Vol. 6 No. 4, 1985  Cycorp: http://www.cyc.com/  YAGO:  Suchanek et al, YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW 2007  YAGO on Max-Planck Institut: http://www.mpi-inf.mpg.de/yago-naga/yago/  ESA:  E. Gabrilovich and S. Markovitch, Enhancing Text Categorization with Encyclopedic Knowledge, AAAI 2006  E. Gabrilovich and S. Markovitch, Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, IJCAI 2007  Egozi et al, Concept-Based Information Retrieval using Explicit Semantic Analysis, TOIS, 2011  Others:  Rada Mihalcea, Using Wikipedia for AutomaticWord Sense Disambiguation, Proceedings of NAACL HLT, 2007  Erdmann et al, An Approach for Extracting Bilingual Terminology from Wikipedia, LNCS Vol. 4947, 2008  Potthast et al, A Wikipedia-Based Multilingual Retrieval Model, Advances in Information Retrieval, 2008

Editor's Notes

  • Lenat actually explained that Cyc will solve the bottleneck by moving it to the decision of what data to enter, rather than the entry process itself. Compared to entering specific rules, entering facts and generalized rules is certainly better, but still manual.
  • Fast forward 20 years…
  • Fast forward 20 years…
  • There were quite a few efforts to use this wealth of information, I’ll speak about one that was quite impressive in its breadth and comparable to Cyc

×