Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Semantic Text Processing Powered by Wikipedia


Published on

A technical overview of our Wikipedia-based Semantic Text Analysis Technology

Published in: Technology, Business
  • Be the first to comment

Semantic Text Processing Powered by Wikipedia

  1. 1. Semantic Text Processing Powered by Wikipedia Maxim Grinev [email_address]
  2. 2. Technology Overview <ul><li>Next Generation Text Analysis bootstrapped by Wikipedia </li></ul><ul><li>Wikipedia is a new enabling resource for NLP </li></ul><ul><ul><li>Comprehensive coverage ( 6M terms versus 65K in Britannica ) </li></ul></ul><ul><ul><li>Continuously brought up-to-date </li></ul></ul><ul><ul><li>Rich Structure ( cross-references between articles, categories, redirect pages, disambiguation pages, info-boxes ) </li></ul></ul><ul><li>New Algorithms: </li></ul><ul><ul><li>Advanced NLP: Word Sense Disambiguation, Keywords Extraction, Topic Inference </li></ul></ul><ul><ul><li>Automatic Ontology Management: Organizing Concept into Thematically Grouped Tag Clouds </li></ul></ul><ul><ul><li>Semantic Search: Concept-based Similarity Search, Smart Faceted Navigation </li></ul></ul><ul><ul><li>Improved Recommendations: Semantic Document Similarity </li></ul></ul><ul><li>Zero-cost deployment and customization: No machine learning techniques which require human labor, no “cold start” </li></ul>
  3. 3. <ul><li>We analyse Wikipedia Links Structure to compute Semantic Relatedness of Wikipedia terms </li></ul><ul><li>We use Dice-measure with weighted links (bi-directional links, direct links, “see also” links, etc) </li></ul>Basic Technique: Semantic Relatedness of Terms Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev, Denis Turdakov Accuracy Estimate and Optimization Techniques for SimRank Computation, VLDB 2008
  4. 4. Terms Detection and Disambiguation <ul><li>Example: IBM may stand for International Business Machines Corp . or International Brotherhood of Magicians </li></ul><ul><li>We use Wikipedia redirection (synonyms) and disambiguation pages (homonyms) to detect and disambiguate terms in a text </li></ul><ul><li>Example: Platform is mentioned in the context of implementation , open-source , web-server, HTTP </li></ul><ul><li>Denis Turdakov, Pavel Velikhov </li></ul><ul><ul><li>“ Semantic Relatedness Metric for Wikipedia Concepts Based on </li></ul></ul><ul><ul><li>Link Analysis and its Application to Word Sense Disambiguation ” </li></ul></ul><ul><li>SYRCoDIS, 2008 </li></ul>
  5. 5. Keywords Extraction <ul><li>Build document semantic graph using semantic relatedness between Wikipedia terms detected in the doc </li></ul><ul><li>Discover community structure of the document semantic graph </li></ul><ul><ul><li>Community – densely interconnected group of nodes in a graph </li></ul></ul><ul><ul><li>Girvan-Newman algorithm for detection community structure in networks </li></ul></ul><ul><li>Select “best” communities: </li></ul><ul><ul><li>Densed communities contain key terms </li></ul></ul><ul><ul><li>Sparse communities contain not important terms, and possible disambiguation mistakes </li></ul></ul>Maria Grineva, Maxim Grinev, Dmitry Lizorkin Extracting Key Terms From Noisy and Multitheme Documents WWW2009: 18th International World Wide Web Conference
  6. 6. Keywords Extraction (Example) Semantic graph built from a news article &quot; Apple to Make ITunes More Accessible For the Blind &quot;
  7. 7. Advantages of the Keywords Extraction Method <ul><li>No training . Instead of training the system with hand-created examples, we use semantic information derived from Wikipedia </li></ul><ul><li>Noise and multi-theme stability. Good at filtering out noise and discover topics in Web pages </li></ul><ul><li>Thematically grouped key terms . Significantly improve further inferring of document topics </li></ul><ul><li>High accuracy . Evaluated using human judgments </li></ul>
  8. 8. Other Methods <ul><li>General Topic Inference for a doc </li></ul><ul><ul><li>using spreading activation over Wikipedia categories graph </li></ul></ul><ul><ul><li>Example: Amazon EC2, Microsoft Azure, Google MapReduce => Cloud Computing </li></ul></ul><ul><li>Building Thematically Grouped Tag Clouds for many docs </li></ul><ul><ul><li>Girvan-Newman algorithm to split into thematic groups </li></ul></ul><ul><ul><li>Topic inference for each group </li></ul></ul><ul><li>Document classification </li></ul><ul><ul><li>Semantic similarity is used to indentify indirect relationships between terms (e.g. a doc about collaborative filtering is classified to recommender system ) </li></ul></ul>
  9. 9. Semantic Search & Navigation <ul><li>Search by Concept : </li></ul><ul><ul><li>Advantages of query and in-doc terms disambiguation </li></ul></ul><ul><ul><li>Result: documents about the concept and related concepts ordered by relevance (keywordness) </li></ul></ul><ul><li>Smart Faceted Navigation : query-relevant facets using semantic relatedness </li></ul><ul><li>Concept-tips to grasp the result documents </li></ul><ul><ul><li>Each document in the result is accompanied with concepts-tips that explain how this document is relevant to the Query </li></ul></ul>
  10. 10. Facets Generation
  11. 11. Facets Generation (cont.)
  12. 12. Facets Generation (cont.)
  13. 13. Facets Generation (cont.)
  14. 14. Thank You!