Generating Links by Mining Quotations

347 views

Published on

  • Be the first to comment

Generating Links by Mining Quotations

  1. 1. Generating Links by Mining Quotations OKAN KOLAK AND BILL N. SCHILIT PRESENTATION BY DUSTIN SMITH THE UNIVERSITY OF TEXAS AT AUSTIN SCHOOL OF INFORMATION
  2. 2. Outline 2 Introduction Challenges Algorithm  Phase 1: Generating the Shingle Table  Phase 2: Extracting Shared Sequences  Phase 3: Sequence Grouping  Filtering and Ranking User Interface EvaluationINF384H 10/24/2011
  3. 3. Introduction 3 What is the goal and why?  Engaging user interface in Google Books  Richer hypertext for scanned books  Achieving these goals at scale for large sets of books  Via MapReduceINF384H 10/24/2011
  4. 4. Challenges 4 Mining quality quotation from millions of books in a scalable and efficient manner. Filtering out misleading quotations and ranking the good quotations based on quality. Incorporating the proposed link structure online in a clear and effective way for users.INF384H 10/24/2011
  5. 5. Algorithm: Phase 1 5 Generation of shingle tables Text is parsed, Pass text through normalized, and Generate a shingle shingler output as a stream of table overlapping shinglesINF384H 10/24/2011
  6. 6. Algorithm: Phase 1 (cont) 6 Each book is passed through the shingler A shingle is a stream of text of k length. Ex.  A 2-shingle for the text “a lucky dog” would be “a lucky” and “lucky dog”.INF384H 10/24/2011
  7. 7. Algorithm: Phase 1 (cont) 7 Prior to shingling, the text is parsed and normalized. Possible normalizations:  Lowercasing  Removing punctuations and accents  Stemming  Removing stop-words  Collapsing numbers to single tokensINF384H 10/24/2011
  8. 8. Algorithm: Phase 1 (cont) 8 Shingle Tables Key Shingle info Shingle info Shingle key(1) <B,i> <B,i> Shingle key(2) <B,i> <B,i> Shingle key: a unique shingle footprint B: Book ID where the shingle exists i: index of the shingle in its relative BINF384H 10/24/2011
  9. 9. Algorithm: Phase 1 (cont) 9 Shingle Tables  Requires a single linear pass and a very large sorting phase  They observe that quotes of length <8 are not significant quotations and so they set their shingle length to 8 words.INF384H 10/24/2011
  10. 10. Algorithm: Phase 2 10 Involves extracting shingles that are shared between books Books are processed 1 at a time  Current book = “Source book”  All other books = “Target books”INF384H 10/24/2011
  11. 11. Algorithm: Phase 2 (cont) 11 Process for a single book: Take each shingle Generate a list of and use the shingles in the shingle table to order that they find all other appear occurrencesINF384H 10/24/2011
  12. 12. Algorithm: Phase 2 (cont) 12 Pseudo-code for Phase 2:INF384H 10/24/2011
  13. 13. Algorithm: Phase 2 (cont) 13 MapReduce adaptation: Mapper: Start with shingle table as input into the Mapper Use the equivalent method for looking up all shingle buckets for a given book’s shingles Emit (source book ID, relevant shingle bucket) Reducer: Input (source book ID, list of relevant shingle buckets) Use the algorithm from previous slide (Figure 1) with a few modificationsINF384H 10/24/2011
  14. 14. Algorithm: Phase 2 (cont) 14 One notable issue:  Common shingles that are shared by many books will greatly increase overhead.  These are often insignificant quotes and should be discarded.INF384H 10/24/2011
  15. 15. Algorithm: Phase 3 15 Sequence Grouping: Why?INF384H 10/24/2011
  16. 16. Algorithm: Phase 3 (cont) 16 Sequence Grouping: How does it work?INF384H 10/24/2011
  17. 17. Filtering and Ranking 17 They identify certain phrases as copyright sentences, legal boilerplate, publisher addresses, bibliography citations, publisher addresses, titles of other books by the author or publisher  These are not desirable or quality quotations.  Need to filter these outINF384H 10/24/2011
  18. 18. Filtering and Ranking (cont) 18 Filtering:• Quotations on “low content” pages• Unusual characteristic filtering • Too many digits or special characters, repeated tokens, etc.• Book edition filteringINF384H 10/24/2011
  19. 19. Filtering and Ranking (cont) 19 Ranking:Some quotes are more interesting than others, ie:“The unemployment rate is the percentage of thelabor force that is unemployed” vs. “All humanbeings are born free and equal in dignity andrights…”• This is difficult to distinguish automaticallyINF384H 10/24/2011
  20. 20. Filtering and Ranking (cont) 20 Scoring method for rankingBasically:Too short and too long receive low scoresOptimal length and is in the middle ground and apiecewise function is used to represent this scoring.• What defines “too short ” and “too long” is determined by “experimental tuning”• Same scoring method for frequencyINF384H 10/24/2011
  21. 21. User Interface 21 How to present this concept of general links between books? “Popular Passages” not “Quotations” Display issues:  Long quotes containing shorter, more familiar quotes  Quote order variationsSkyline vectors are used to address these issues anddoes so effectively. • Basically the “best” quotes are chosen for presentation to the userINF384H 10/24/2011
  22. 22. User Interface (cont) 22 Navigation within books  Goals:  Provide a general feel for the book  Provide an interface in which the user can quickly navigate to important passages within the bookINF384H 10/24/2011
  23. 23. User Interface (cont) 23 Navigation between booksINF384H 10/24/2011
  24. 24. Evaluation 24 Manual labeling to determine accuracy User studied (passive) over a 30 day period Analysis of distribution of link types within Google’s scanned books.INF384H 10/24/2011
  25. 25. Evaluation (cont) 25 Manual labeling:• Sampled 120 passages from low scores and 120 from high scores (to avoid precision bias).• Use a Likert scale of 1 to 5 with 1-2 meaning good, 3 meaning neutral, and 4-5 meaning bad.• Inter-annotator agreement was 88.5% (± 3.5% to account for neutral labels)• 88% marked goodINF384H 10/24/2011
  26. 26. Evaluation (cont) 26 User study:• Consisted of monitoring user activity in Google Books. • Specifically if they navigated via popular passages (Quotations); other book edition links (Editions); to other similar books within a cluster (Related); or to books that cite the current book (Cited By) • Results INF384H 10/24/2011
  27. 27. Evaluation (cont) 27INF384H 10/24/2011
  28. 28. Evaluation (cont) 28 Coverage:  What is the distribution of these link types in scanned books?INF384H 10/24/2011
  29. 29. Related Work & Future Work 29 Related Work  Automatic Hypertext  Plagiarism Detection Future Work  Improved Ranking  Incremental Processing  Primary Source Identification  AttributionINF384H 10/24/2011
  30. 30. Questions + Discussion 30The End.Questions & discussion.….Go Rangers!INF384H 10/24/2011

×