Semantic Text Processing Powered by Wikipedia Maxim Grinev [email_address]
Technology Overview Next Generation Text Analysis bootstrapped  by Wikipedia  Wikipedia is a new enabling resource for NLP Comprehensive coverage ( 6M terms versus 65K in Britannica ) Continuously brought up-to-date Rich Structure ( cross-references between articles, categories, redirect pages, disambiguation pages, info-boxes ) New Algorithms: Advanced NLP:  Word Sense Disambiguation, Keywords Extraction, Topic Inference Automatic Ontology Management:  Organizing Concept into Thematically Grouped Tag Clouds   Semantic Search:  Concept-based Similarity Search, Smart Faceted Navigation Improved Recommendations:  Semantic Document Similarity Zero-cost deployment and customization:  No machine learning techniques which require human labor, no “cold start”
We analyse Wikipedia Links Structure to compute Semantic Relatedness of Wikipedia terms We use  Dice-measure   with weighted links (bi-directional links, direct links, “see also” links, etc) Basic Technique: Semantic Relatedness of Terms Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev, Denis Turdakov Accuracy Estimate and Optimization Techniques for SimRank Computation,  VLDB 2008
Terms Detection and Disambiguation Example:  IBM  may stand for  International Business Machines Corp . or  International Brotherhood of Magicians We use Wikipedia  redirection (synonyms)  and  disambiguation pages (homonyms)  to detect and disambiguate terms in a text Example:  Platform  is mentioned in the context of  implementation ,  open-source ,  web-server, HTTP Denis Turdakov, Pavel Velikhov “ Semantic Relatedness Metric for Wikipedia Concepts Based on  Link Analysis and its Application to Word Sense Disambiguation ” SYRCoDIS, 2008
Keywords Extraction Build  document semantic graph  using semantic relatedness between Wikipedia terms detected in the doc Discover community structure of the document semantic graph  Community – densely interconnected group of nodes in a graph Girvan-Newman algorithm  for detection community structure in networks Select “best” communities: Densed  communities contain  key terms Sparse  communities contain  not  important   terms, and possible  disambiguation mistakes Maria Grineva, Maxim Grinev, Dmitry Lizorkin Extracting Key Terms From Noisy and Multitheme Documents WWW2009: 18th International World Wide Web Conference
Keywords Extraction (Example) Semantic graph built from a news article  " Apple to Make ITunes More Accessible For the Blind "
Advantages of the Keywords Extraction Method No training .  Instead of training the system with hand-created examples, we use semantic information derived from Wikipedia Noise and multi-theme stability.  Good at filtering out noise and discover topics in Web pages Thematically grouped key terms .  Significantly improve further inferring of document topics High accuracy .  Evaluated using human judgments
Other Methods General Topic Inference for a doc using spreading activation over Wikipedia categories graph Example:  Amazon EC2, Microsoft Azure, Google MapReduce  => Cloud Computing Building Thematically Grouped Tag Clouds for many docs Girvan-Newman algorithm to split into thematic groups Topic inference for each group Document classification Semantic similarity is used to indentify indirect relationships between terms (e.g. a doc about  collaborative filtering  is classified to  recommender system )
Semantic Search & Navigation Search by Concept : Advantages of query and in-doc terms disambiguation Result: documents about the concept and related concepts ordered by relevance (keywordness) Smart  Faceted Navigation : query-relevant facets using semantic relatedness  Concept-tips  to grasp the result documents Each document in the result is accompanied with concepts-tips that explain how this document is relevant to the Query
Facets Generation
Facets Generation (cont.)
Facets Generation (cont.)
Facets Generation (cont.)
Thank You!

Semantic Text Processing Powered by Wikipedia

  • 1.
    Semantic Text ProcessingPowered by Wikipedia Maxim Grinev [email_address]
  • 2.
    Technology Overview NextGeneration Text Analysis bootstrapped by Wikipedia Wikipedia is a new enabling resource for NLP Comprehensive coverage ( 6M terms versus 65K in Britannica ) Continuously brought up-to-date Rich Structure ( cross-references between articles, categories, redirect pages, disambiguation pages, info-boxes ) New Algorithms: Advanced NLP: Word Sense Disambiguation, Keywords Extraction, Topic Inference Automatic Ontology Management: Organizing Concept into Thematically Grouped Tag Clouds Semantic Search: Concept-based Similarity Search, Smart Faceted Navigation Improved Recommendations: Semantic Document Similarity Zero-cost deployment and customization: No machine learning techniques which require human labor, no “cold start”
  • 3.
    We analyse WikipediaLinks Structure to compute Semantic Relatedness of Wikipedia terms We use Dice-measure with weighted links (bi-directional links, direct links, “see also” links, etc) Basic Technique: Semantic Relatedness of Terms Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev, Denis Turdakov Accuracy Estimate and Optimization Techniques for SimRank Computation, VLDB 2008
  • 4.
    Terms Detection andDisambiguation Example: IBM may stand for International Business Machines Corp . or International Brotherhood of Magicians We use Wikipedia redirection (synonyms) and disambiguation pages (homonyms) to detect and disambiguate terms in a text Example: Platform is mentioned in the context of implementation , open-source , web-server, HTTP Denis Turdakov, Pavel Velikhov “ Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation ” SYRCoDIS, 2008
  • 5.
    Keywords Extraction Build document semantic graph using semantic relatedness between Wikipedia terms detected in the doc Discover community structure of the document semantic graph Community – densely interconnected group of nodes in a graph Girvan-Newman algorithm for detection community structure in networks Select “best” communities: Densed communities contain key terms Sparse communities contain not important terms, and possible disambiguation mistakes Maria Grineva, Maxim Grinev, Dmitry Lizorkin Extracting Key Terms From Noisy and Multitheme Documents WWW2009: 18th International World Wide Web Conference
  • 6.
    Keywords Extraction (Example)Semantic graph built from a news article " Apple to Make ITunes More Accessible For the Blind "
  • 7.
    Advantages of theKeywords Extraction Method No training . Instead of training the system with hand-created examples, we use semantic information derived from Wikipedia Noise and multi-theme stability. Good at filtering out noise and discover topics in Web pages Thematically grouped key terms . Significantly improve further inferring of document topics High accuracy . Evaluated using human judgments
  • 8.
    Other Methods GeneralTopic Inference for a doc using spreading activation over Wikipedia categories graph Example: Amazon EC2, Microsoft Azure, Google MapReduce => Cloud Computing Building Thematically Grouped Tag Clouds for many docs Girvan-Newman algorithm to split into thematic groups Topic inference for each group Document classification Semantic similarity is used to indentify indirect relationships between terms (e.g. a doc about collaborative filtering is classified to recommender system )
  • 9.
    Semantic Search &Navigation Search by Concept : Advantages of query and in-doc terms disambiguation Result: documents about the concept and related concepts ordered by relevance (keywordness) Smart Faceted Navigation : query-relevant facets using semantic relatedness Concept-tips to grasp the result documents Each document in the result is accompanied with concepts-tips that explain how this document is relevant to the Query
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.

Editor's Notes

  • #3 We've developed a new technology for semantic text analysis and semantic search. The main idea behind our technology is that we use knowledge extreacted from Wikipedia to facilitate text analysis. To recent moment Wikipedia has grown into the biggest database of concepts and their relationships that ever existed. Wikipedia is great for a number of reasons (i t provides a number of things ) : 1) Comprehensive coverage (it contains very general concepts such car, computer, government, etc and a lot of niche concepts such as new small startup companies or people known only in some mmunities)  2) Continuously brought up-to-date (it is often updated just in minutes after announcements) 3) It is well-structured (it has redirects (Ivan the Terrible redirected to Ivan IV of Russia) which is synonims, it has disambiguation pages (homonyms) which includes different meaning for a term (IBM may stands for International Business Machines or International Brotherhood of Magicians). Using Wikipedia as a big knowledge base allows us to significantly improve a number of techniques and develop new techniques that were not possible before. Here is list of techniques that we developed: Advance NLP etc It is just a list of techniques. I will explain how it all works.
  • #6 betweenness – how much is edge “in between” different communities modularity - partition is a good one, if there are many edges within communities and only a few between them