Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Extracting Key Terms From Noisy and Multi-theme Documents


Published on

Published in: Technology, Education
  • Be the first to comment

Extracting Key Terms From Noisy and Multi-theme Documents

  1. 1. Extracting Key Terms From Noisy and Multi - theme D ocuments Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS
  2. 2. Outline <ul><li>Key terms extraction: traditional approaches and applications </li></ul><ul><li>Using Wikipedia as a knowledge base for Natural Language Processing </li></ul><ul><li>Main techniques of our approach: </li></ul><ul><ul><li>Wikipedia-based semantic relatedness </li></ul></ul><ul><ul><li>Network analysis algorithm to detect community structure in networks </li></ul></ul><ul><li>Our method </li></ul><ul><li>Experimental evaluation </li></ul>
  3. 3. Key Terms Extraction <ul><li>B asic step for various NLP tasks : </li></ul><ul><ul><li>document classification </li></ul></ul><ul><ul><li>document clustering </li></ul></ul><ul><ul><li>text summarization </li></ul></ul><ul><ul><li>inferring a more general topic of a text document </li></ul></ul><ul><li>C ore task of Internet content - based advertising systems , such as Google AdSense and Yahoo! Contextual Match </li></ul><ul><ul><li>Web pages are typically noisy ( side bars/menus, comments, future announces, etc. ) </li></ul></ul><ul><ul><li>Dealing with multi-theme Web pages (portal home pages, etc.) </li></ul></ul>
  4. 4. Approaches to Key Terms Extraction <ul><li>Based on statistical learning : </li></ul><ul><ul><li>use for example: frequency criterion (TFxIDF model), keyphrase-frequency, distance between terms normalized by the number of words in the document ( KEA ) </li></ul></ul><ul><ul><li>compute statistical features over Wikipedia corpus ( Wikify! ) </li></ul></ul><ul><ul><li>require training set </li></ul></ul><ul><li>Based on analyzing syntactic or semantic term relatedness within a document </li></ul><ul><ul><li>compute semantic relatedness between terms (using, for example, Wikipedia) </li></ul></ul><ul><ul><li>modeling document as a semantic graph of terms and applying graph analysis techniques to it ( TextRank ) </li></ul></ul><ul><ul><li>no training set required </li></ul></ul>
  5. 5. Using Wikipedia as a Knowledge Base for Natural Language Processing <ul><li>Wikipedia ( – free open encyclopedia </li></ul><ul><ul><li>Today Wikipedia is the biggest encyclopedia ( more than 2 . 7 million articles in English Wikipedia ) </li></ul></ul><ul><ul><li>It is always up-to-date thanks to millions of editors over the world </li></ul></ul><ul><ul><li>Has huge network of cross-references between articles, large number of categories, redirect pages, disambiguation pages = > rich resource for bootstrapping NLP and IR tasks </li></ul></ul>
  6. 6. Basic Techniques of Our Method: Semantic Relatedness of Terms <ul><li>S emantic relatedness assigns a score for a pair of terms that represents the strength of relatedness between the terms </li></ul><ul><li>We use Wikipedia compute terms semantic relatedness </li></ul><ul><li>We use semantic relatedness to model document as a graph of terms </li></ul>
  7. 7. <ul><li>Wikipedia-based semantic relatedness for the two terms c an be computed using : </li></ul><ul><ul><li>the links found within their corresponding Wikipedia articles </li></ul></ul><ul><ul><li>Wikipedia categories structure </li></ul></ul><ul><ul><li>the article’s textual content </li></ul></ul><ul><li>Using Dice-measure for Wikipedia-based semantic relatedness </li></ul>Basic Techniques of Our Method: Semantic Relatedness of Terms
  8. 8. Basic Techniques of Our Method: Detecting Community Structure in Networks <ul><li>We discover terms communities in a document graph </li></ul><ul><li>Community – densely interconnected group of nodes in a network </li></ul><ul><li>Girvan-Newman algorithm for detection community structure in networks: </li></ul><ul><li>betweenness – how much is edge “in between” different communities </li></ul><ul><li>modularity - partition is a good one, if there are many edges within communities and only a few between them </li></ul>
  9. 9. Our Method <ul><li>Candidate t erms e xtraction </li></ul><ul><li>Word sense disambiguation </li></ul><ul><li>Building semantic graph </li></ul><ul><li>Discovering community structure of the semantic graph </li></ul><ul><li>Selecting valuable communities </li></ul>
  10. 10. Our Method: Candidate T erms E xtraction <ul><li>Goal: e xtract all terms from the document and f or each term prepare a set of Wikipedia articles that can describe its meaning </li></ul><ul><li>P arse the input document and extract all possible n - grams </li></ul><ul><li>For each n-gram (+ its morphological variations ) provide a set of Wikipedia article titles </li></ul><ul><ul><li>“ drinks ”, “ drinking ”, “ drink ” => [Wikipedia:] Drink ; Drinking </li></ul></ul>
  11. 11. Our Method: Word Sense Disambiguation <ul><li>Goal: choose the most appropriate W ikipedia article from the set of candidate articles for each ambiguous term extracted on the previous step </li></ul><ul><li>U se of Wikipedia disambiguation and redirect pages to obtain candidate meanings of ambiguous terms </li></ul><ul><li>Denis Turdakov, Pavel Velikhov </li></ul><ul><ul><li>“ Semantic Relatedness Metric for Wikipedia Concepts Based on </li></ul></ul><ul><ul><li>Link Analysis and its Application to Word Sense Disambiguation ” </li></ul></ul><ul><li>SYRCoDIS, 2008 </li></ul>
  12. 12. Our Method: Building Semantic Graph <ul><li>Goal: building document semantic graph using semantic relatedness between terms </li></ul>Semantic graph built from a news article &quot; Apple to Make ITunes More Accessible For the Blind &quot;
  13. 13. Our Method: Detecting Community Structure of the Semantic Graph
  14. 14. Our Method: Selecting Valuable Communities <ul><li>Goal: rank term communities in a way that: </li></ul><ul><ul><li>the highest ranked communities contain key terms </li></ul></ul><ul><ul><li>the lowest ranked communities contain not important terms, and possible disambiguation mistakes </li></ul></ul><ul><li>Use: </li></ul><ul><ul><li>density of community – sum of inner edges of community divided by the number of vertices in this community </li></ul></ul><ul><ul><li>informativeness – sum of keyphraseness measure (Wikipedia-based TFxIDF analogue) of community terms </li></ul></ul><ul><li>Community rank: density*informativeness </li></ul>
  15. 15. Our Method: Selecting Valuable Communities <ul><li>In 73% of web pages decline in communities scores separates key-terms communities from non-important ones </li></ul>
  16. 16. Advantages of the Method <ul><li>No training . Instead of training the system with hand-created examples, we use semantic information derived from Wikipedia </li></ul><ul><li>Noise and multi-theme stability. Good at filtering out noise and discover topics in Web pages </li></ul><ul><li>Thematically grouped key terms . Significantly improve further inferring of document topics using, for example, spreading activation over Wikipedia categories graph </li></ul><ul><li>High accuracy . Evaluated using human judgments (further in this presentation) </li></ul>
  17. 17. Experimental Evaluation on Noise-free dataset <ul><li>Classical – TFxIDF , Yahoo! Terms Extractor </li></ul><ul><li>Wikipedia-based – Wikify! , TextRank </li></ul><ul><li>Evaluation on noise-free dataset (blog posts) using human judgment </li></ul>
  18. 18. <ul><li>Comparison to other methods </li></ul>Experimental Evaluation on Web Pages <ul><li>Performance of our method on different kinds of Web pages </li></ul>
  19. 19. <ul><li>Multi-theme stability evaluated on compound Web pages (popular news site, portal homepages, etc.) </li></ul>Experimental Evaluation on Web Pages
  20. 20. Thank You! Any Questions? Email [email_address] [email_address] [email_address]