Semantic Text Processing Powered by Wikipedia


Published on

A technical overview of our Wikipedia-based Semantic Text Analysis Technology

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • We've developed a new technology for semantic text analysis and semantic search. The main idea behind our technology is that we use knowledge extreacted from Wikipedia to facilitate text analysis. To recent moment Wikipedia has grown into the biggest database of concepts and their relationships that ever existed. Wikipedia is great for a number of reasons (i t provides a number of things ) : 1) Comprehensive coverage (it contains very general concepts such car, computer, government, etc and a lot of niche concepts such as new small startup companies or people known only in some mmunities)  2) Continuously brought up-to-date (it is often updated just in minutes after announcements) 3) It is well-structured (it has redirects (Ivan the Terrible redirected to Ivan IV of Russia) which is synonims, it has disambiguation pages (homonyms) which includes different meaning for a term (IBM may stands for International Business Machines or International Brotherhood of Magicians). Using Wikipedia as a big knowledge base allows us to significantly improve a number of techniques and develop new techniques that were not possible before. Here is list of techniques that we developed: Advance NLP etc It is just a list of techniques. I will explain how it all works.
  • betweenness – how much is edge “in between” different communities modularity - partition is a good one, if there are many edges within communities and only a few between them
  • Semantic Text Processing Powered by Wikipedia

    1. 1. Semantic Text Processing Powered by Wikipedia Maxim Grinev [email_address]
    2. 2. Technology Overview <ul><li>Next Generation Text Analysis bootstrapped by Wikipedia </li></ul><ul><li>Wikipedia is a new enabling resource for NLP </li></ul><ul><ul><li>Comprehensive coverage ( 6M terms versus 65K in Britannica ) </li></ul></ul><ul><ul><li>Continuously brought up-to-date </li></ul></ul><ul><ul><li>Rich Structure ( cross-references between articles, categories, redirect pages, disambiguation pages, info-boxes ) </li></ul></ul><ul><li>New Algorithms: </li></ul><ul><ul><li>Advanced NLP: Word Sense Disambiguation, Keywords Extraction, Topic Inference </li></ul></ul><ul><ul><li>Automatic Ontology Management: Organizing Concept into Thematically Grouped Tag Clouds </li></ul></ul><ul><ul><li>Semantic Search: Concept-based Similarity Search, Smart Faceted Navigation </li></ul></ul><ul><ul><li>Improved Recommendations: Semantic Document Similarity </li></ul></ul><ul><li>Zero-cost deployment and customization: No machine learning techniques which require human labor, no “cold start” </li></ul>
    3. 3. <ul><li>We analyse Wikipedia Links Structure to compute Semantic Relatedness of Wikipedia terms </li></ul><ul><li>We use Dice-measure with weighted links (bi-directional links, direct links, “see also” links, etc) </li></ul>Basic Technique: Semantic Relatedness of Terms Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev, Denis Turdakov Accuracy Estimate and Optimization Techniques for SimRank Computation, VLDB 2008
    4. 4. Terms Detection and Disambiguation <ul><li>Example: IBM may stand for International Business Machines Corp . or International Brotherhood of Magicians </li></ul><ul><li>We use Wikipedia redirection (synonyms) and disambiguation pages (homonyms) to detect and disambiguate terms in a text </li></ul><ul><li>Example: Platform is mentioned in the context of implementation , open-source , web-server, HTTP </li></ul><ul><li>Denis Turdakov, Pavel Velikhov </li></ul><ul><ul><li>“ Semantic Relatedness Metric for Wikipedia Concepts Based on </li></ul></ul><ul><ul><li>Link Analysis and its Application to Word Sense Disambiguation ” </li></ul></ul><ul><li>SYRCoDIS, 2008 </li></ul>
    5. 5. Keywords Extraction <ul><li>Build document semantic graph using semantic relatedness between Wikipedia terms detected in the doc </li></ul><ul><li>Discover community structure of the document semantic graph </li></ul><ul><ul><li>Community – densely interconnected group of nodes in a graph </li></ul></ul><ul><ul><li>Girvan-Newman algorithm for detection community structure in networks </li></ul></ul><ul><li>Select “best” communities: </li></ul><ul><ul><li>Densed communities contain key terms </li></ul></ul><ul><ul><li>Sparse communities contain not important terms, and possible disambiguation mistakes </li></ul></ul>Maria Grineva, Maxim Grinev, Dmitry Lizorkin Extracting Key Terms From Noisy and Multitheme Documents WWW2009: 18th International World Wide Web Conference
    6. 6. Keywords Extraction (Example) Semantic graph built from a news article &quot; Apple to Make ITunes More Accessible For the Blind &quot;
    7. 7. Advantages of the Keywords Extraction Method <ul><li>No training . Instead of training the system with hand-created examples, we use semantic information derived from Wikipedia </li></ul><ul><li>Noise and multi-theme stability. Good at filtering out noise and discover topics in Web pages </li></ul><ul><li>Thematically grouped key terms . Significantly improve further inferring of document topics </li></ul><ul><li>High accuracy . Evaluated using human judgments </li></ul>
    8. 8. Other Methods <ul><li>General Topic Inference for a doc </li></ul><ul><ul><li>using spreading activation over Wikipedia categories graph </li></ul></ul><ul><ul><li>Example: Amazon EC2, Microsoft Azure, Google MapReduce => Cloud Computing </li></ul></ul><ul><li>Building Thematically Grouped Tag Clouds for many docs </li></ul><ul><ul><li>Girvan-Newman algorithm to split into thematic groups </li></ul></ul><ul><ul><li>Topic inference for each group </li></ul></ul><ul><li>Document classification </li></ul><ul><ul><li>Semantic similarity is used to indentify indirect relationships between terms (e.g. a doc about collaborative filtering is classified to recommender system ) </li></ul></ul>
    9. 9. Semantic Search & Navigation <ul><li>Search by Concept : </li></ul><ul><ul><li>Advantages of query and in-doc terms disambiguation </li></ul></ul><ul><ul><li>Result: documents about the concept and related concepts ordered by relevance (keywordness) </li></ul></ul><ul><li>Smart Faceted Navigation : query-relevant facets using semantic relatedness </li></ul><ul><li>Concept-tips to grasp the result documents </li></ul><ul><ul><li>Each document in the result is accompanied with concepts-tips that explain how this document is relevant to the Query </li></ul></ul>
    10. 10. Facets Generation
    11. 11. Facets Generation (cont.)
    12. 12. Facets Generation (cont.)
    13. 13. Facets Generation (cont.)
    14. 14. Thank You!