Generating Links by Mining Quotations


Published on

  • Be the first to comment

Generating Links by Mining Quotations

  2. 2. Outline 2 Introduction Challenges Algorithm  Phase 1: Generating the Shingle Table  Phase 2: Extracting Shared Sequences  Phase 3: Sequence Grouping  Filtering and Ranking User Interface EvaluationINF384H 10/24/2011
  3. 3. Introduction 3 What is the goal and why?  Engaging user interface in Google Books  Richer hypertext for scanned books  Achieving these goals at scale for large sets of books  Via MapReduceINF384H 10/24/2011
  4. 4. Challenges 4 Mining quality quotation from millions of books in a scalable and efficient manner. Filtering out misleading quotations and ranking the good quotations based on quality. Incorporating the proposed link structure online in a clear and effective way for users.INF384H 10/24/2011
  5. 5. Algorithm: Phase 1 5 Generation of shingle tables Text is parsed, Pass text through normalized, and Generate a shingle shingler output as a stream of table overlapping shinglesINF384H 10/24/2011
  6. 6. Algorithm: Phase 1 (cont) 6 Each book is passed through the shingler A shingle is a stream of text of k length. Ex.  A 2-shingle for the text “a lucky dog” would be “a lucky” and “lucky dog”.INF384H 10/24/2011
  7. 7. Algorithm: Phase 1 (cont) 7 Prior to shingling, the text is parsed and normalized. Possible normalizations:  Lowercasing  Removing punctuations and accents  Stemming  Removing stop-words  Collapsing numbers to single tokensINF384H 10/24/2011
  8. 8. Algorithm: Phase 1 (cont) 8 Shingle Tables Key Shingle info Shingle info Shingle key(1) <B,i> <B,i> Shingle key(2) <B,i> <B,i> Shingle key: a unique shingle footprint B: Book ID where the shingle exists i: index of the shingle in its relative BINF384H 10/24/2011
  9. 9. Algorithm: Phase 1 (cont) 9 Shingle Tables  Requires a single linear pass and a very large sorting phase  They observe that quotes of length <8 are not significant quotations and so they set their shingle length to 8 words.INF384H 10/24/2011
  10. 10. Algorithm: Phase 2 10 Involves extracting shingles that are shared between books Books are processed 1 at a time  Current book = “Source book”  All other books = “Target books”INF384H 10/24/2011
  11. 11. Algorithm: Phase 2 (cont) 11 Process for a single book: Take each shingle Generate a list of and use the shingles in the shingle table to order that they find all other appear occurrencesINF384H 10/24/2011
  12. 12. Algorithm: Phase 2 (cont) 12 Pseudo-code for Phase 2:INF384H 10/24/2011
  13. 13. Algorithm: Phase 2 (cont) 13 MapReduce adaptation: Mapper: Start with shingle table as input into the Mapper Use the equivalent method for looking up all shingle buckets for a given book’s shingles Emit (source book ID, relevant shingle bucket) Reducer: Input (source book ID, list of relevant shingle buckets) Use the algorithm from previous slide (Figure 1) with a few modificationsINF384H 10/24/2011
  14. 14. Algorithm: Phase 2 (cont) 14 One notable issue:  Common shingles that are shared by many books will greatly increase overhead.  These are often insignificant quotes and should be discarded.INF384H 10/24/2011
  15. 15. Algorithm: Phase 3 15 Sequence Grouping: Why?INF384H 10/24/2011
  16. 16. Algorithm: Phase 3 (cont) 16 Sequence Grouping: How does it work?INF384H 10/24/2011
  17. 17. Filtering and Ranking 17 They identify certain phrases as copyright sentences, legal boilerplate, publisher addresses, bibliography citations, publisher addresses, titles of other books by the author or publisher  These are not desirable or quality quotations.  Need to filter these outINF384H 10/24/2011
  18. 18. Filtering and Ranking (cont) 18 Filtering:• Quotations on “low content” pages• Unusual characteristic filtering • Too many digits or special characters, repeated tokens, etc.• Book edition filteringINF384H 10/24/2011
  19. 19. Filtering and Ranking (cont) 19 Ranking:Some quotes are more interesting than others, ie:“The unemployment rate is the percentage of thelabor force that is unemployed” vs. “All humanbeings are born free and equal in dignity andrights…”• This is difficult to distinguish automaticallyINF384H 10/24/2011
  20. 20. Filtering and Ranking (cont) 20 Scoring method for rankingBasically:Too short and too long receive low scoresOptimal length and is in the middle ground and apiecewise function is used to represent this scoring.• What defines “too short ” and “too long” is determined by “experimental tuning”• Same scoring method for frequencyINF384H 10/24/2011
  21. 21. User Interface 21 How to present this concept of general links between books? “Popular Passages” not “Quotations” Display issues:  Long quotes containing shorter, more familiar quotes  Quote order variationsSkyline vectors are used to address these issues anddoes so effectively. • Basically the “best” quotes are chosen for presentation to the userINF384H 10/24/2011
  22. 22. User Interface (cont) 22 Navigation within books  Goals:  Provide a general feel for the book  Provide an interface in which the user can quickly navigate to important passages within the bookINF384H 10/24/2011
  23. 23. User Interface (cont) 23 Navigation between booksINF384H 10/24/2011
  24. 24. Evaluation 24 Manual labeling to determine accuracy User studied (passive) over a 30 day period Analysis of distribution of link types within Google’s scanned books.INF384H 10/24/2011
  25. 25. Evaluation (cont) 25 Manual labeling:• Sampled 120 passages from low scores and 120 from high scores (to avoid precision bias).• Use a Likert scale of 1 to 5 with 1-2 meaning good, 3 meaning neutral, and 4-5 meaning bad.• Inter-annotator agreement was 88.5% (± 3.5% to account for neutral labels)• 88% marked goodINF384H 10/24/2011
  26. 26. Evaluation (cont) 26 User study:• Consisted of monitoring user activity in Google Books. • Specifically if they navigated via popular passages (Quotations); other book edition links (Editions); to other similar books within a cluster (Related); or to books that cite the current book (Cited By) • Results INF384H 10/24/2011
  27. 27. Evaluation (cont) 27INF384H 10/24/2011
  28. 28. Evaluation (cont) 28 Coverage:  What is the distribution of these link types in scanned books?INF384H 10/24/2011
  29. 29. Related Work & Future Work 29 Related Work  Automatic Hypertext  Plagiarism Detection Future Work  Improved Ranking  Incremental Processing  Primary Source Identification  AttributionINF384H 10/24/2011
  30. 30. Questions + Discussion 30The End.Questions & discussion.….Go Rangers!INF384H 10/24/2011