Your SlideShare is downloading. ×
Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011)

979
views

Published on

Published in: Technology, Business

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
979
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
33
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 6 September 29, 2011 Jason Baldridge Matt Lease Department of Linguistics School of Information University of Texas at Austin University of Texas at AustinJasonbaldridge at gmail dot com ml at ischool dot utexas dot edu
  • 2. Acknowledgments Course design and slides based on Jimmy Lin’s cloud computing courses at the University of Maryland, College ParkSome figures courtesy of the followingexcellent Hadoop books (order yours today!)• Chuck Lam’s Hadoop In Action (2010)• Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)
  • 3. Today’s Agenda• Automatic Spelling Correction – Review: Information Retrieval (IR) • Boolean Search • Vector Space Modeling • Inverted Indexing in MapReduce – Probabilisitic modeling via noisy channel• Index Compression – Order inversion in MapReduce• In-class exercise• Hadoop: Pipelined & Chained jobs
  • 4. Automatic Spelling Correction
  • 5. Automatic Spelling Correction Three main stages  Error detection  Candidate generation  Candidate ranking / choose best candidate Usage cases  Flagging possible misspellings / spell checker  Suggesting possible corrections  Automatically correcting (inferred) misspellings • “as you type” correction • web queries • real-time closed captioning • …
  • 6. Types of spelling errors  Unknown words: “She is their favorite acress in town.”  Can be identified using a dictionary…  …but could be a valid word not in the dictionary  Dictionary could be automatically constructed from large corpora • Filter out rare words (misspellings, or valid but unlikely)… • Why filter out rare words that are valid?  Unknown words violating phonotactics:  e.g. “There isn’t enough room in this tonw for the both of us.”  Given dictionary, could automatically construct “n-gram dictionary” of all character n-grams known in the language • e.g. English words don’t end with “nw”, so flag tonw  Incorrect homophone: “She drove their.”  Valid word, wrong usage; infer appropriateness from context  Typing errors reflecting kayout of leyboard
  • 7. Candidate generation How to generate possible corrections for acress? Inspiration: how do people do it?  People may suggest words like actress, across, access, acres, caress, and cress – what do these have in common?  What about “blam” and “zigzag”? Two standard strategies for candidate generation  Minimum edit distance • Generate all candidates within 1+ edit step(s) • Possible edit operations: insertion, deletion, substitution, transposition, … • Filter through a dictionary • See Peter Norvig’s post: http://norvig.com/spell-correct.html  Character ngrams: see next slide…
  • 8. Character ngram Spelling Correction Information Retrieval (IR) model  Query=typo word  Document collection = dictionary (i.e. set of valid words)  Representation: word is set of character ngrams Let’s use n=3 (trigram), with # to mark word start/end Examples  across: [#ac, acr, cro, oss, ss#]  acress: [#ac, acr, cre, res, ess, ss#]  actress: [#ac, act, ctr, tre, res, ess, ss#]  blam: [#bl, bla, lam, am#]  mississippi: [#mis, iss, ssi, sis, sip, ipp, ppi, pi#] Uhm, IR model???  Review…
  • 9. Abstract IR Architecture Query Documents online offline Representation Representation Function Function Query Representation Document Representation Comparison Function Index Results
  • 10. Document  Boolean RepresentationMcDonalds slims down spudsFast-food chain to reduce certain types of “Bag of Words”fat in its french fries with new cooking oil.NEW YORK (CNN/Money) - McDonalds Corp. is McDonaldscutting the amount of "bad" fat in its french friesnearly in half, the fast-food chain said Tuesday asit moves to make all its fried menu items fathealthier.But does that mean the popular shoestring fries frieswont taste the same? The company says no. "Itsa win-win for our customers because they aregetting the same great french-fry taste along with newan even healthier nutrition profile," said MikeRoberts, president of McDonalds USA. frenchBut others are not so sure. McDonalds will notspecifically discuss the kind of oil it plans to use,but at least one nutrition expert says playing with Companythe formula could mean a different taste.Shares of Oak Brook, Ill.-based McDonalds Said(MCD: down $0.54 to $23.22, Research,Estimates) were lower Tuesday afternoon. It wasunclear Tuesday whether competitors Burger nutritionKing and Wendys International (WEN: down$0.80 to $34.91, Research, Estimates) wouldfollow suit. Neither company could immediately …be reached for comment.…
  • 11. Boolean Retrieval Doc 1 Doc 2 Doc 3 Doc 4 dogs dolphins football football dolphins
  • 12. Inverted Index: Boolean Retrieval Doc 1 Doc 2 Doc 3 Doc 4 one fish, two fish red fish, blue fish cat in the hat green eggs and ham 1 2 3 4 blue 1 blue 2 cat 1 cat 3 egg 1 egg 4 fish 1 1 fish 1 2 green 1 green 4 ham 1 ham 4 hat 1 hat 3 one 1 one 1 red 1 red 2 two 1 two 1
  • 13. Inverted Indexing via MapReduce Doc 1 Doc 2 Doc 3 one fish, two fish red fish, blue fish cat in the hat one 1 red 2 cat 3Map two 1 blue 2 hat 3 fish 1 fish 2 Shuffle and Sort: aggregate values by keys cat 3 blue 2Reduce fish 1 2 hat 3 one 1 two 1 red 2
  • 14. Inverted Indexing in MapReduce1: class Mapper2: procedure Map(docid n; doc d)3: H = new Set4: for all term t in doc d do5: H.add(t)6: for all term t in H do7: Emit(term t, n)1: class Reducer2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])3: List P = docids.values()4: Emit(term t; P)
  • 15. Scalability Bottleneck  Desired output format: <term, [doc1, doc2, …]>  Just emitting each <term, docID> pair won’t produce this  How to produce this without buffering?  Side-effect: write directly to HDFS instead of emitting  Complications? • Persistent data must be cleaned up if reducer restarted…
  • 16. Using the Inverted Index  Boolean Retrieval: to execute a Boolean query  Build query syntax tree OR ( blue AND fish ) OR ham ham AND  For each clause, look up postings blue fish blue 2 fish 1 2  Traverse postings and apply Boolean operator  Efficiency analysis  Start with shortest posting first  Postings traversal is linear (if postings are sorted) • Oops… we didn’t actually do this in building our index…
  • 17. Inverted Indexing in MapReduce1: class Mapper2: procedure Map(docid n; doc d)3: H = new Set4: for all term t in doc d do5: H.add(t)6: for all term t in H do7: Emit(term t, n)1: class Reducer2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])3: List P = docids.values()4: Emit(term t; P)
  • 18. Inverted Indexing in MapReduce: try 21: class Mapper2: procedure Map(docid n; doc d)3: H = new Set4: for all term t in doc d do5: H.add(t)6: for all term t in H do7: Emit(term t, n)1: class Reducer2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])3: List P = docids.values() fish4: Sort(P) 1 25: Emit(term t; P)
  • 19. (Another) Scalability Bottleneck  Reducers buffers all docIDs associated with term (to sort)  What if term occurs in many documents?  Secondary sorting  Use composite key  Partition function  Key Comparator  Side-effect: write directly to HDFS as before…
  • 20. Inverted index for spelling correction Like search, spelling correction must be fast  How can we quickly identify candidate corrections? II: Map each character ngram  list of all words containing it  #ac -> { act, across, actress, acquire, … }  acr -> { across, acrimony, macro, … }  cre -> { crest, acre, acres, … }  res -> { arrest, rest, rescue, restaurant, … }  ess -> { less, lesson, necessary, actress, … }  ss# -> { less, mess, moss, across, actress, … } How do we build the inverted index in MapReduce?
  • 21. Exercise Write a MapReduce algorithm for creating an inverted index for trigram spelling correction, given a corpus
  • 22. Exercise Write a MapReduce algorithm for creating an inverted index for trigram spelling correction, given a corpus Map(String docid, String text): for each word w in text: for each trigram t in w: Emit(t, w) Reduce(String trigram, Iterator<Text> values): Emit(trigram, values.toSet) Also other alternatives, e.g. in-mapper combining, pairs Is MapReduce even necessary for this?  Dictionary vs. token frequency
  • 23. Spelling correction as Boolean search Given inverted index, how to find set of possible corrections?  Compute union of all words indexed by any of its character ngrams  = Boolean search • Query “acress”  “#ac OR acr OR cre OR res OR ess OR ss# “ Are all corrections equally likely / good?
  • 24. Ranked Information Retrieval Order documents by probability of relevance  Estimate relevance of each document to the query  Rank documents by relevance How do we estimate relevance? Vector space paradigm  Approximate relevance by vector similarity (e.g. cosine)  Represent queries and documents as vectors  Rank documents by vector similarity to the query
  • 25. Vector Space Model t3 d2 d3 d1 θ φ t1 d5 t2 d4 Assumption: Documents that are “close” in vector space “talk about” the same things Retrieve documents based on how close the document vector is to the query vector (i.e., similarity ~ “closeness”)
  • 26. Similarity Metric  Use “angle” between the vectors   d j  dk cos( )    d j dk   i1 wi, j wi,k n d j  dk sim(d j , d k )     i1 wi , j i 1 wi2,k n n d j dk 2  Given pre-normalized vectors, just compute inner product   sim(d j , d k )  d j  d k  i 1 wi , j wi ,k n
  • 27. Boolean Character ngram correction Boolean Information Retrieval (IR) model  Query=typo word  Document collection = dictionary (i.e. set of valid words)  Representation: word is set of character ngrams Let’s use n=3 (trigram), with # to mark word start/end Examples  across: [#ac, acr, cro, oss, ss#]  acress: [#ac, acr, cre, res, ess, ss#]  actress: [#ac, act, ctr, tre, res, ess, ss#]  blam: [#bl, bla, lam, am#]  mississippi: [#mis, iss, ssi, sis, sip, ipp, ppi, pi#]
  • 28. Ranked Character ngram correction Vector space Information Retrieval (IR) model  Query=typo word  Document collection = dictionary (i.e. set of valid words)  Representation: word is vector of character ngram value  Rank candidate corrections according to vector similarity (cosine) Trigram Examples  across: [#ac, acr, cro, oss, ss#]  acress: [#ac, acr, cre, res, ess, ss#]  actress: [#ac, act, ctr, tre, res, ess, ss#]  blam: [#bl, bla, lam, am#]  mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]
  • 29. Spelling Correction in Vector Space t3 d2 d3 d1 θ φ t1 d5 t2 d4 Assumption: Words that are “close together” in ngram vector space have similar orthography Therefore, retrieve words in the dictionary based on how close the word is to the typo (i.e., similarity ~ “closeness”)
  • 30. Ranked Character ngram correction Vector space Information Retrieval (IR) model  Query=typo word  Document collection = dictionary (i.e. set of valid words)  Representation: word is vector of character ngram value  Rank candidate corrections according to vector similarity (cosine) Trigram Examples  across: [#ac, acr, cro, oss, ss#]  acress: [#ac, acr, cre, res, ess, ss#]  actress: [#ac, act, ctr, tre, res, ess, ss#]  blam: [#bl, bla, lam, am#]  mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#] “value” here expresses relative importance of different vector components for the similarity comparison  Use simple count here, what else might we do?
  • 31. IR Term Weighting Term weights consist of two components  Local: how important is the term in this document?  Global: how important is the term in the collection? Here’s the intuition:  Terms that appear often in a document should get high weights  Terms that appear in many documents should get low weights How do we capture this mathematically?  Term frequency (local)  Inverse document frequency (global)
  • 32. TF.IDF Term Weighting N wi , j  tfi , j  log ni wi , j weight assigned to term i in document j tfi, j number of occurrence of term i in document j N number of documents in entire collection ni number of documents with term i
  • 33. Inverted Index: TF.IDF Doc 1 Doc 2 Doc 3 Doc 4 one fish, two fish red fish, blue fish cat in the hat green eggs and ham tf 1 2 3 4 df blue 1 1 blue 1 2 1 cat 1 1 cat 1 3 1 egg 1 1 egg 1 4 1 fish 2 2 2 fish 2 1 2 2 2 green 1 1 green 1 4 1 ham 1 1 ham 1 4 1 hat 1 1 hat 1 3 1 one 1 1 one 1 1 1 red 1 1 red 1 2 1 two 1 1 two 1 1 1
  • 34. Inverted Indexing via MapReduce Doc 1 Doc 2 Doc 3 one fish, two fish red fish, blue fish cat in the hat one 1 red 2 cat 3Map two 1 blue 2 hat 3 fish 1 fish 2 Shuffle and Sort: aggregate values by keys cat 3 blue 2Reduce fish 1 2 hat 3 one 1 two 1 red 2
  • 35. Inverted Indexing via MapReduce (2) Doc 1 Doc 2 Doc 3 one fish, two fish red fish, blue fish cat in the hat one 1 1 red 2 1 cat 3 1Map two 1 1 blue 2 1 hat 3 1 fish 1 2 fish 2 2 Shuffle and Sort: aggregate values by keys cat 3 1 blue 2 1Reduce fish 1 2 2 2 hat 3 1 one 1 1 two 1 1 red 2 1
  • 36. Inverted Indexing: Pseudo-Code Further exaccerbates earlier scalability issues …
  • 37. Ranked Character ngram correction Vector space Information Retrieval (IR) model  Query=typo word  Document collection = dictionary (i.e. set of valid words)  Representation: word is vector of character ngram value  Rank candidate corrections according to vector similarity (cosine) Trigram Examples  across: [#ac, acr, cro, oss, ss#]  acress: [#ac, acr, cre, res, ess, ss#]  actress: [#ac, act, ctr, tre, res, ess, ss#]  blam: [#bl, bla, lam, am#]  mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#] “value” here expresses relative importance of different vector components for the similarity comparison  What else might we do? TF.IDF for character n-grams?
  • 38. TF.IDF for character n-grams Think about what makes an ngram more discriminating  e.g. in acquire, acq and cqu are more indicative than qui and ire.  Schematically, we want something like: • acquire: [ #ac, acq, cqu, qui, uir, ire, re# ] Possible solution: TF-IDF, where  TF is the frequency of the ngram in the word  IDF is the number of words the ngram occurs in in the vocabulary
  • 39. Correction Beyond Orthography So far we’ve focused on orthography alone The context of a typo also tells us a great deal How can we compare contexts?
  • 40. Correction Beyond Orthography So far we’ve focused on orthography alone The context of a typo also tells us a great deal How can we compare contexts? Idea: use the co-occurrence matrices built during HW2  We have a vector of co-occurrence counts for each word  Extract a similar vector for the typo given its immediate context • “She is their favorite acress in town.”  acress: [ she:1, is:1, their:1, favorite:1, in:1, town:1 ]  Possible enhancement: make vectors sensitive to word order
  • 41. Combining evidence We have orthographic similarity and contextual similarity We can do a simple weighted combination of the two, e.g.: simCombined ( d j , d k )   simOrth( d j , d k )  1    simContext ( d j , d k ) How to do this more efficiently?  Compute top candidates based on simOrth  Take top k for consideration with simContext  …or other way around… The combined model might also be expressed by a similar probabilistic model…
  • 42. March 22, 2005 42Paradigm: Noisy-Channel Modeling  s  arg max P( S | O)  arg max P( S ) P(O | S ) S SWant to recover most likely latent (correct) source word underlying the observed (misspelled) wordP(S): language model gives probability distribution over possible (candidate) source wordsP(O|S): channel model gives probability of each candidate source word being “corrupted” into the observed typo
  • 43. Noisy Channel Model for correction  We want to rank candidates by P(cand | typo)  Using Bayes law, the chain rule, an independence assumption, and logs, we have: P( cand , typo, context )P( cand | typo, context )  P(typo, context )  P( cand , typo, context )  P(typo | cand , context ) P( cand , context )  P(typo | cand ) P( cand , context )  P(typo | cand ) P( cand | context ) P( context )  P(typo | cand ) P( cand | context )  log P(typo | cand )  log P( cand | context )
  • 44. Probabilistic vs. vector space model  Both measure orthographic & contextual “fit” of the candidate given the typo and its usage context  Noisy channel: P( cand | typo, context )  log P(typo | cand )  log P( cand | context )  IR approach:simCombined ( d j , d k )   simOrth( d j , d k )  1    simContext ( d j , d k )  Both can benefit from “big” data (i.e. bigger samples)  Better estimates of probabilities and population frequencies  Usual probabilistic vs. non-probabilistic tradeoffs  Principled theory and methodology for modeling and estimation  How to extend the feature space to include additional information? • Typing haptics (key proximity)? Cognitive errors (e.g. homonyms)?
  • 45. Index Compression
  • 46. Postings Encoding Conceptually: fish 1 2 9 1 21 3 34 1 35 2 80 3 … In Practice: •Instead of document IDs, encode deltas (or d-gaps) • But it’s not obvious that this save space… fish 1 2 8 1 12 3 13 1 1 2 45 3 …
  • 47. Overview of Index Compression Byte-aligned vs. bit-aligned Non-parameterized bit-aligned  Unary codes   (gamma) codes   (delta) codes Parameterized bit-aligned  Golomb codes Want more detail? Read Managing Gigabytes by Witten, Moffat, and Bell!
  • 48. But First... General Data Compression Run Length Encoding  7 7 7 8 8 9 = (7, 3), (8,2), (9,1) Binary Equivalent  0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 = 6, 1, 3, 2, 3  Good with sparse binary data Huffman Coding  Optimal when data is distributed by negative powers of two  e.g. P(a)= ½, P(b) = ¼, P(c)=1/8, P(d)=1/8 • a = 0, b = 10, c= 110, d=111  Prefix codes: no codeword is the prefix of another codeword • If we read 0, we know it’s an “a” following bits are a new codeword • Similarly 10 is a b (no other codeword starts with 10), etc. • Prefix is 1* (i.e. path to internal nodes is all 1s, output on leaves)
  • 49. Unary Codes Encode number as a run of 1s, specifically… x  1 coded as x-1 1s, followed by zero bit terminator  1=0  2 = 10  3 = 110  4 = 1110  ... Great for small numbers… horrible for large numbers  Overly-biased for very small gaps
  • 50.  codes x  1 is coded in two parts: unary length : offset  Start with binary encoded, remove highest-order bit = offset  Length is number of binary digits, encoded in unary  Concatenate length + offset codes Example: 9 in binary is 1001  Offset = 001  Length = 4, in unary code = 1110   code = 1110:001  Another example: 7 (111 in binary) • offset=11, length=3 (110 in unary)   code = 110:11 Analysis  Offset = log x  Length = log x +1  Total = 2 log x +1 (97 bits, 75 bits, …)
  • 51.  codes As with  codes, two parts: unary length & offset  Offset is same as before  Length is encoded by its  code Example: 9 (=1001 in binary)  Offset = 001  Length = 4 (100), offset=00, length 3 = 110 in unary •  code=110:00   code = 110:00:001 Comparison   codes better for smaller numbers   codes better for larger numbers
  • 52. Golomb Codes x  1, parameter b x encoded in two parts  Part 1: q = ( x - 1 ) / b , code q + 1 in unary  Part 2: remainder r<b, r = x - qb – 1 coded in truncated binary Truncated binary defines prefix code  if b is a power of 2 • easy case: truncated binary = regular binary  else • First 2^(log b + 1) – b values encoded in log b bits • Remaining values encoded in log b + 1 bits Let’s see some examples
  • 53. Golomb Code Examples b = 3, r = [0:2]  First 2^(log 3 + 1) – 3 = 2^2 – 3 = 1 values, in log 3 = 1 bit  First 1 value in 1 bit: 0  Remaining 3-1=2 values in 1+1=2 bits with prefix 1: 10, 11 b = 5, r = [0:4]  First 2^(log 5 + 1) – 5 = 2^3 – 5 = 3 values, in log 5 = 2 bits  First 3 values in 2 bits: 00, 01, 10  Remaining 5-3=2 values in 2+1=3 bits with prefix 11: 110, 111 • Two prefix bits needed since single leading 1 already used in “10” b = 6, r = [0:5]  First 2^(log 6 + 1) – 6 = 2^3 – 6 = 2 values, in log 6 = 2 bits  First 2 values in 2 bits: 00, 01  Remaining 6-2=4 values in 2+1=3 bits with prefix 1: 100, 101, 110, 111
  • 54. Comparison of Coding Schemes Unary   Golomb b=3 b=6 1 0 0 0 0:0 0:00 2 10 10:0 100:0 0:10 0:01 3 110 10:1 100:1 0:11 0:100 4 1110 110:00 101:00 10:0 0:101 5 11110 110:01 101:01 10:10 0:110 6 111110 110:10 101:10 10:11 0:111 7 1111110 110:11 101:11 110:0 10:00 8 11111110 1110:000 11000:000 110:10 10:01 9 111111110 1110:001 11000:001 110:11 10:100 10 1111111110 1110:010 11000:010 1110:0 10:101 See Figure 4.5 in Lin & Dyer p. 77 for b=5 and b=10Witten, Moffat, Bell, Managing Gigabytes (1999)
  • 55. Index Compression: Performance Comparison of Index Size (bits per pointer) Bible TREC Unary 262 1918 Binary 15 20  6.51 6.63  6.23 6.38 Golomb 6.09 5.84 Use Golomb for d-gaps,  codes for term frequencies Optimal b  0.69 (N/df): Different b for every term! Bible: King James version of the Bible; 31,101 verses (4.3 MB) TREC: TREC disks 1+2; 741,856 docs (2070 MB)Witten, Moffat, Bell, Managing Gigabytes (1999)
  • 56. Where are we without compression? (key) (values) (keys) (values) fish 1 2 [2,4] fish 1 [2,4] 34 1 [23] fish 9 [9] 21 3 [1,8,22] fish 21 [1,8,22] 35 2 [8,41] fish 34 [23] 80 3 [2,9,76] fish 35 [8,41] 9 1 [9] fish 80 [2,9,76] How is this different? • Let the framework do the sorting • Directly write postings to disk • Term frequency implicitly stored
  • 57. Index Compression in MapReduce Need df to compress posting for each term How do we compute df?  Count the # of postings in reduce(), then compress  Problem?
  • 58. Order Inversion Pattern  In the mapper:  Emit “special” key-value pairs to keep track of df  In the reducer:  Make sure “special” key-value pairs come first: process them to determine df  Remember: proper partitioning!
  • 59. Getting the df: Modified Mapper Doc 1 one fish, two fish Input document… (key) (value) fish 1 [2,4] Emit normal key-value pairs… one 1 [1] two 1 [3] fish  [1] Emit “special” key-value pairs to keep track of df… one  [1] two  [1]
  • 60. Getting the df: Modified Reducer (key) (value) First, compute the df by summing contributions fish  [63] [82] [27] … from all “special” key-value pair… Compress postings incrementally as they arrive fish 1 [2,4] fish 9 [9] fish 21 [1,8,22] Important: properly define sort order to make sure “special” key-value pairs come first! fish 34 [23] fish 35 [8,41] fish 80 [2,9,76] … Write postings directly to disk Where have we seen this before?
  • 61. In-class Exercise
  • 62. Exercise: where have all the ngrams gone? For each observed (word) trigram in collection, output its observed (docID, wordIndex) locations Input Doc 1 Doc 2 Doc 3 one fish two fish one fish two salmon two fish two fish Output Possible Tools: * pairs/stripes? one fish two [(1,1),(2,1)] * combining? fish two fish [(1,2),(3,2)] * secondary sorting? fish two salmon [(2,2)] * order inversion? two fish two [(3,1)] * side effects?
  • 63. Exercise: shinglingGiven observed (docID, wordIndex) ngram locationsFor each document, for each of its ngrams (in order),give a list of the ngram locations for that ngram Input one fish two [(1,1),(2,1)] fish two fish [(1,2),(3,2)] fish two salmon [(2,2)] Possible Tools: [(3,1)] two fish two * pairs/stripes?Output * combining? Doc 1 [ [(1,1),(2,1)], [(1,2),(3,2)] ] * secondary sorting? Doc 2 [ [(1,1),(2,1)], [(2,2)] ] * order inversion? Doc 3 [ [(3,1)], [(1,2),(3,2)] ] * side effects?
  • 64. Exercise: shingling (2)How can we recognize when longer ngrams arealigned across documents?Example doc 1: a b c d e doc 2: a b c d f doc 3: e b c d f doc 4: a b c d eFind “a b c d” in docs 1 2 and 4, “b c d f” in 2 & 3 “a b c d e” in 1 and 4
  • 65. class Alignment int index // start position in this document int length // sequence length in ngrams typedef Pair<int docID, int position> Ngram; int otherID // ID of other document int otherIndex // start position in other documentclass NgramExtender Set<Alignment> alignments = empty set index=0; NgramExtender(int docID) { _docID = docID } close() { foreach Alignment a, emit(_docID, a) } AlignNgrams(List<Ngram> ngrams) // call this function iteratively in order of ngrams observed in this document ... @inproceedings{Kolak:2008, author = {Kolak, Okan and Schilit, Bill N.}, title = {Generating links by mining quotations}, booktitle = {19th ACM conference on Hypertext and hypermedia}, year = {2008}, pages = {117--126} }
  • 66. class Alignment int index // start position in this document int length // sequence length in ngrams typedef Pair<int docID, int position> Ngram; int otherID // ID of other document int otherIndex // start position in other documentclass NgramExtender Set<Alignment> alignments = empty set index=0; NgramExtender(int docID) { _docID = docID } close() { foreach Alignment a, emit(_docID, a) } AlignNgrams(List<Ngram> ngrams) // call this function iteratively in order of ngrams observed in this document ++index; foreach Alignment a in alignments Ngram next = new Ngram(a.otherID, a.otherIndex + a.length) if (ngrams.contains(next)) // extend alignment a.length += 1; ngrams.remove(next) else // terminate alignment emit _docID, (a); alignments.remove(a) foreach ngram in ngrams alignments.add( new Alignment( index, 1, ngram.docID, ngram.otherIndex )
  • 67. Sequences of MapReduce Jobs
  • 68. Building more complex MR algorithms Monolithic single Map + single Reduce  What we’ve done so far  Fitting all computation to this model can be difficult and ugly  We generally strive for modularization when possible What else can we do?  Pipeline: [Map Reduce] [Map Reduce] … (multiple sequential jobs)  Chaining: [Map+ Reduce Map*] • 1 or more Mappers • 1 reducer • 0 or more Mappers  Pipelined Chain: [Map+ Reduce Map*] [Map+ Reduce Map*] …  Express arbitrary dependencies between jobs
  • 69. Modularization and WordCount General benefits of modularization  Re-use for easier/faster development  Consistent behavior across applications  Easier/faster to maintain/extend for benefit of many applications Even basic word count can be broken down  Pre-processing • How will we tokenize? Perform stemming? Remove stopwords?  Main computation: count tokenized tokens and group by word  Post-processing • Transform the values? (e.g. log-damping) Let’s separate tokenization into its own module  Many other tasks can likely benefit First approach: pipeline…
  • 70. Pipeline WordCount Modules TokenizeTokenizer Mapper• String -> List[String] No Reducer• Keep doc ID key• E.g.(10032, “the 10 cats sleep”) -> (10032, [“the”, “10”, “cats”, “sleep”]) CountObserver Mapper LongSumReducer• List[String] -> List[(String, Int)] • Sum token counts• E.g. (10032, [“the”, “10”, “cats”, “sleep”] -> [(“the”,1), (“10”, 1), (“cats”,1), • E.g. (“sleep”, [1, 5, 2]) -> (“sleep”,1)] (“sleep”, 8)
  • 71. Pipeline WordCount in Hadoop Two distinct jobs: tokenize and count  Data sharing between jobs via persistent output  Can use combiners and partitioners as usual (won’t bother here) Let’s use SequenceFileOutputFormat rather than TextOutputFormat  sequence of binary key-value pairs; faster / smaller  tokenization output will stick around unless we delete it Tokenize job  Just a mapper, no reducer: conf.setNumReduceTasks(0) or IdentityReducer  Output goes to directory we specify  Files will be read back in by the counting job  Output is array of tokens  We need to make a suitable Writable for String arrays Count job  Input types defined by the input SequenceFile (don’t need to be specified)  Mapper is trivial  observes tokens from incoming data  Key: (docid) & Value: (Array of Strings, encoded as a Writable)
  • 72. Pipeline WordCount (old Hadoop API)Configuration conf = new Configuration();String tmpDir1to2 = "/tmp/intermediate1to2";// Tokenize jobJobConf tokenizationJob = new JobConf(conf);tokenizationJob.setJarByClass(PipelineWordCount.class);FileInputFormat.setInputPaths(tokenizationJob, new Path(inputPath));FileOutputFormat.setOutputPath(tokenizationJob, new Path(tmpDir1to2));tokenizationJob.setOutputFormat(SequenceFileOutputFormat.class);tokenizationJob.setMapperClass(AggressiveTokenizerMapper.class);tokenizationJob.setOutputKeyClass(LongWritable.class);tokenizationJob.setOutputValueClass(TextArrayWritable.class);tokenizationJob.setNumReduceTasks(0);// Count jobJobConf countingJob = new JobConf(conf);countingJob.setJarByClass(PipelineWordCount.class);countingJob.setInputFormat(SequenceFileInputFormat.class);FileInputFormat.setInputPaths(countingJob, new Path(tmpDir1to2));FileOutputFormat.setOutputPath(countingJob, new Path(outputPath));countingJob.setMapperClass(TrivialWordObserver.class);countingJob.setReducerClass(MapRedIntSumReducer.class);countingJob.setOutputKeyClass(Text.class);countingJob.setOutputValueClass(IntWritable.class);countingJob.setNumReduceTasks(reduceTasks);JobClient.runJob(tokenizationJob);JobClient.runJob(countingJob);
  • 73. Pipeline jobs in Hadoop Old API  JobClinet.runJob(..) does not return until job finishes New API  Use Job rather than JobConf  Use job.waitForCompletion instead of JobClient.runJob Why Old API?  In 0.20.2, chaining only possible under old API  We want to re-use the same components for chaining (next…)
  • 74. Chaining in Hadoop Mapper 1 Mapper 1 Map+ Reduce Map* Intermediates Intermediates  1 or more Mappers • Can use IdentityMapper  1 reducer Mapper 2 Mapper 2 • No reducers: conf.setNumReduceTasks(0)?  0 or more Mappers Usual combiners and partitioners By default, data passed between Reducer Reducer Mappers by usual writing of intermediate data to disk  Can always use side-effects…  There is a better, built-in way to bypass Mapper 3 Mapper 3 this and pass (Key,Value) pairs by reference instead • Requires different Mapper semantics! Persistent Output Persistent Output
  • 75. Hadoop: ChainMapper & ChainReducer Below JobConf objects (deprecated in Hadoop 0.20.2).  No undeprecated replacement in 0.20.2…  Examples here work for later versions with small changesConfiguration conf = new Configuration();JobConf job = new JobConf(conf);...boolean passByRef = false; // pass output (Key,Value) pairs to next Mapper by reference?JobConf map1Conf = new JobConf(false);ChainMapper.addMapper(job, Map1.class, Map1InputKey.class, Map1InputValue.class, Map1OutputKey.class, Map1OutputValue.class, passByRef, map1Conf);JobConf map2Conf = new JobConf(false);ChainMapper.addMapper(job, Map2.class, Map1OutputKey.class, Map1OutputValue.class, Map2OutputKey.class, Map2OutputValue.class, passByRef, map2Conf);JobConf reduceConf = new JobConf(false);ChainReducer.setReducer(job, Reducer.class, Map2OutputKey.class, Map2OutputValue.class, ReducerOutputKey.class, ReducerOutputValue.class, passByRef, reduceConf)JobConf map3Conf = new JobConf(false);ChainReducer.addMapper (job, Map3.class, ReducerOutputKey.class, ReducerOutputValue.class, Map3OutputKey.class, Map3OutputValue.class, passByRef, map3Conf)JobClient.runJob(job);
  • 76. Chaining in Hadoop Let’s continue our running example:  Mapper 1: Tokenize  Mapper 2: Observe (count) words  Reducer: same IntSum reducer as always  Mapper 3 Log-dampen counts • We didn’t have this in our pipeline example but we’ll add here…
  • 77. Chained Tokenizer + WordCount// Set up configuration and intermediate directory locationConfiguration conf = new Configuration();JobConf chainJob = new JobConf(conf);chainJob.setJobName("Chain job");chainJob.setJarByClass(ChainWordCount.class); // single jar for all Mappers and Reducers…chainJob.setNumReduceTasks(reduceTasks);FileInputFormat.setInputPaths(chainJob, new Path(inputPath));FileOutputFormat.setOutputPath(chainJob, new Path(outputPath));// pass output (Key,Value) pairs to next Mapper by reference?boolean passByRef = false;JobConf map1 = new JobConf(false); // tokenizationChainMapper.addMapper(chainJob, AggressiveTokenizerMapper.class, LongWritable.class, Text.class, LongWritable.class, TextArrayWritable.class, passByRef, map1);JobConf map2 = new JobConf(false); // Add token observer jobChainMapper.addMapper(chainJob, TrivialWordObserver.class, LongWritable.class, TextArrayWritable.class, Text.class, LongWritable.class, passByRef, map2);JobConf reduce = new JobConf(false); // Set the int sum reducerChainReducer.setReducer(chainJob, LongSumReducer.class, Text.class, LongWritable.class, Text.class, LongWritable.class, passByRef, reduce);JobConf map3 = new JobConf(false); // log-scaling of countsChainReducer.addMapper(chainJob, ComputeLogMapper.class, Text.class, LongWritable.class, Text.class, FloatWritable.class, passByRef, map3);JobClient.runJob(chainJob);
  • 78. Hadoop Chaining: Pass by Reference Chaining allows possible optimization  Chained mappers run in same JVM thread, so opportunity to avoid serialization to/from disk with pipelined jobs  Also lesser benefit of avoiding extra object destruction / construction Gotchas  OutputCollector.collect(K k, V v) promises not alter the content of k and v  But if Map1 passes (k,v) by reference to Map2 via collect(), Map2 may alter (k,v) & thereby violate the contract What to do?  Option 1: Honor the contract – don’t alter input (k,v) in Map2  Option 2: Re-negotiate terms – don’t re-use (k,v) in Map1 after collect()  Document carefully to avoid later changes silently breaking this…
  • 79. Setting Dependencies Between Jobs JobControl and Job provide the mechanism // create jobconf1 and jobconf2 as appropriate // … Job job1=new Job(jobconf1) Job job2=new Job(jobconf2); job2.addDependingJob(job1); JobControl jbcntrl=new JobControl("jbcntrl"); jbcntrl.addJob(job1); jbcntrl.addJob(job2); jbcntrl.run() New API: no JobConf, create Job from Configuration, …
  • 80. Higher Level Abstractions Pig: language and execution environment for expressing MapReduce data flows. (pretty much the standard)  See White, Chapter 11 Cascading: another environment with a higher level of abstraction for composing complex data flows  See White, Chapter 16, pp 539-552 Cascalog: query language based on Cascading that uses Clojure (a JVM-based LISP variant)  Word count in Cascalog  Certainly more concise – though you need to grok the syntax. (?<- (stdout) [?word ?count] (sentence ?s) (split ?s :> ?word) (c/ count ?count))