Common MapReduce    Patterns     Chris K Wensel     BuzzWords 2011
Engineer, Not Academic•   Concurrent, Inc., Founder     • Cascading support and tools     • http://concurrentinc.com/•   C...
Overview• MapReduce• Heavy Lifting• Analytics• Optimizations                    Copyright Concurrent, Inc. 2011. All right...
MapReduce• A “divide and conquer” strategy for parallelizing  workloads against collections of data• Map & Reduce are two ...
Keys and Values•   Map translates input to keys    and values to new keys and    values                             [K1,V1...
Word CountMapper [0, "when in the course of       human events"]            Map     ["when",1]     ["in",1]         ["the"...
Divide and Conquer          Parallelism• Since the ‘records’ entering the Map and ‘groups’  entering the Reduce are indepe...
Cluster         Cluster              Rack                 Rack            Rack              Node       Node      Node    N...
Another View                  [K1,V1]            Map     [K2,V2]                                             Combine   Gro...
Complex job                      assemblies•   Real applications are many MapReduce jobs chained together•   Linked by int...
Real World Apps                                                                                                           ...
Heavy Lifting•   Thing we must do because data can be heavy•   These patterns are natural to MapReduce and easy to impleme...
Record Filtering• Think unix ‘grep’• Filtering is discarding unwanted values (or  preserving wanted)• Only uses a Map func...
Parsing, Conversion•   Think unix ‘sed’•   A Map function that takes an input key and/or value and    translates it into a...
Counting, Summing• The same as SQL aggregation functions• Simply applying some function to the values  collection seen in ...
Merging•   Where many files of the same type are converted to one    output path•   Map side merges    •   One directory wi...
Binning•   Where the values associated w/ unique keys are    persisted together•   Typically a directory path based on key...
Distributed Tasks•   Simply where a Map or Reduce function executes some    ‘task’ based on the input key and value.•   Ex...
Basic Analytic Patterns•   Some of these patterns are unnatural to MapReduce•   We think in terms of columns/fields, not ke...
Composite Keys/Values              [K1,V1]     <A1,B1,C1,...>• It is easier to think in columns/fields • e.g. “firstname” & ...
Group By                            GroupBy                                1001                                        Jim...
Group By                     Mapper                         Reducer    Piggyback Code             [K1,V1]                 ...
Unique         Mapper          [0, "when in the course of                human events"]             Map     ["when",null] ...
Secondary Sort      (group)   (sorted value)         (remaining value)    Date          Time           Url    08/08/2008, ...
Secondary Sort               Mapper                          Reducer                          [K1,V1]                     ...
Secondary Unique        Mapper                                                                                       Assum...
Joining                           lhs data                                          rhs data                   1001       ...
Join Definitions•   Consider the input data [key, value]:    •  LHS = [0,a] [1,b] [2,c]    •  RHS = [0,A]                  ...
CoGrouping• Before Joining, CoGrouping must happen• Simply concurrent GroupBy operations on each  input data set          ...
GroupBy vs CoGroup                                         lhs data                                                       ...
CoGroup Joined                          lhs data                                         rhs data                  1001   ...
CoGrouping    Mapper    [n]                            [n+1]                Reducer            [K1,V1]                    ...
Joining       Reducer                    [K2,{V2,V2,....}]                                                          <A2,B2...
Optimizations • Patterns for reducing IO• Identity Mapper                          • Partial Aggregates• Map Side Join    ...
Identity Mapper                                                                                                           ...
Map Side Joins• Bypasses the (immediate) need for a Reducer• Symmetrical   • Where LHS and RHS are of equivalent size   • ...
Combiners        Mapper         [0, "when in the course of               human events"]           Map     ["when",1]    ["...
Partial Aggregates         Mapper          [0, "when in the course of                human events"]                   ["wh...
Partial Aggregates                   [a,b,c,a,a,b]                    [a,b,c,a,a,b]     partial unique                    ...
Tradeoffs• CPU for IO == fault tolerance• Memory for IO == performance                              Copyright Concurrent, ...
Similarity Join• Compare all values LHS to values RHS to find  duplicates (or similar values)• Naive approaches • Cross Joi...
Set-Similarity Joining• “Efficient Parallel Set-Similarity Joins Using  MapReduce” - R Vernica, M Carey, C Li• Only compare...
4             11                                                    4             22                                      ...
Tokenize              Count Job                      Map     Reduce         Map           Reduce           File           ...
Duality• Note the use of the previous patterns to route  data to implement a more efficient algorithm                      ...
Use a Higher               Abstraction•   Command Line    • Multitool - CLI for parallel sed, grep & joins•   API    • Cas...
References•   Set Similarity    •  http://www.slideshare.net/ydn/4-similarity-joinshadoopsummit2010    •  http://asterix.i...
I’m Hiring• Enterprise Java server and web client• Language design, compilers, and interpreters• No Hadoop experience requ...
Resources•   Chris K Wensel    •chris@wensel.net    •@cwensel•   Cascading & Cascalog    •http://cascading.org    •@cascad...
Appendix       Copyright Concurrent, Inc. 2011. All rights reserved.
Simple Total Sorting•   Where lines in a result file should be sorted•   Must set number of reducers to 1    •   Sorting in...
Why Sorting Isn’t                 “Total”       [aaa,aab,aac]    Mapper                                 aaa               ...
Distributed Total Sort• To work, the Shuffling phase must be modified  with: • Custom Partitioner to partition on the    dis...
Distributed Total Sort -            Details                                                                       a       ...
Upcoming SlideShare
Loading in...5
×

Buzz words

723

Published on

1 Comment
6 Likes
Statistics
Notes
  • deeply intruduction
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
723
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
64
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • - commutativity is the ability to change the order of something without changing the end result.\n- associativity is a property that a binary operation can have. It means that, within an expression containing two or more of the same associative operators in a row, the order of operations does not matter as long as the sequence of the operands is not changed. That is, rearranging the parentheses in such an expression will not change its value.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Buzz words

    1. 1. Common MapReduce Patterns Chris K Wensel BuzzWords 2011
    2. 2. Engineer, Not Academic• Concurrent, Inc., Founder • Cascading support and tools • http://concurrentinc.com/• Cascading, Lead Developer (started Sept 2007) • An alternative API to MapReduce • http://cascading.org/• Formerly Hadoop mentoring and training • Sun - Apple - HP - LexisNexis - startups - etc• Formerly Systems Architect & Consultant • Thomson/Reuters - TeleAtlas - startups - etc Copyright Concurrent, Inc. 2011. All rights reserved.
    3. 3. Overview• MapReduce• Heavy Lifting• Analytics• Optimizations Copyright Concurrent, Inc. 2011. All rights reserved.
    4. 4. MapReduce• A “divide and conquer” strategy for parallelizing workloads against collections of data• Map & Reduce are two user defined functions chained via Key Value Pairs• It’s really Map->Group->Reduce where Group is built in Copyright Concurrent, Inc. 2011. All rights reserved.
    5. 5. Keys and Values• Map translates input to keys and values to new keys and values [K1,V1] Map [K2,V2]*• System Groups each unique [K2,V2] Group [K2,{V2,V2,....}] key with all its values [K2,{V2,V2,....}] Reduce [K3,V3]*• Reduce translates the values of each unique key to new keys and values * = zero or more Copyright Concurrent, Inc. 2011. All rights reserved.
    6. 6. Word CountMapper [0, "when in the course of human events"] Map ["when",1] ["in",1] ["the",1] [...,1] ["when",1] ["when",1] ["when",1] ["when",1] Group ["when",{1,1,1,1,1}] ["when",1]Reducer ["when",{1,1,1,1,1}] Reduce ["when",5] Copyright Concurrent, Inc. 2011. All rights reserved.
    7. 7. Divide and Conquer Parallelism• Since the ‘records’ entering the Map and ‘groups’ entering the Reduce are independent• That is, there is no expectation of order or requirement to share state between records/ groups• Arbitrary numbers of Map and Reduce function instances can be created against arbitrary portions of input data Copyright Concurrent, Inc. 2011. All rights reserved.
    8. 8. Cluster Cluster Rack Rack Rack Node Node Node Node ... map map map map map reduce reduce reduce• Multiple instances of each Map and Reduce function are distributed throughout the cluster Copyright Concurrent, Inc. 2011. All rights reserved.
    9. 9. Another View [K1,V1] Map [K2,V2] Combine Group [K2,{V2,...}] Reduce [K3,V3] Mapper Task same code Mapper Reducer Shuffle Task Task Mapper Reducer Shuffle Task Task Mapper Reducer Shuffle Task Task Mapper Task Mappers must complete before Reducers can beginsplit1 split2 split3 split4 ... part-00000 part-00001 part-000N file directory Copyright Concurrent, Inc. 2011. All rights reserved.
    10. 10. Complex job assemblies• Real applications are many MapReduce jobs chained together• Linked by intermediate (usually temporary) files• Executed in order, by hand, from the ‘client’ application Count Job Sort Job [ k, [v] ] [ k, [v] ] Map Reduce Map Reduce [ k, v ] [ k, v ] [ k, v ] [ k, v ] File File File [ k, v ] = key and value pair [ k, [v] ] = key and associated values collection Copyright Concurrent, Inc. 2011. All rights reserved.
    11. 11. Real World Apps [37/75] map+reduce [54/75] map+reduce[41/75] map+reduce [43/75] map+reduce [42/75] map+reduce [45/75] map+reduce [44/75] map+reduce [39/75] map+reduce [36/75] map+reduce [46/75] map+reduce [40/75] map+reduce [50/75] map+reduce [38/75] map+reduce [49/75] map+reduce [51/75] map+reduce [47/75] map+reduce [52/75] map+reduce [53/75] map+reduce [48/75] map+reduce[23/75] map+reduce [25/75] map+reduce [24/75] map+reduce [27/75] map+reduce [26/75] map+reduce [21/75] map+reduce [19/75] map+reduce [28/75] map+reduce [22/75] map+reduce [32/75] map+reduce [20/75] map+reduce [31/75] map+reduce [33/75] map+reduce [29/75] map+reduce [34/75] map+reduce [35/75] map+reduce [30/75] map+reduce [7/75] map+reduce [2/75] map+reduce [8/75] map+reduce [10/75] map+reduce [9/75] map+reduce [5/75] map+reduce [3/75] map+reduce [11/75] map+reduce [6/75] map+reduce [13/75] map+reduce [4/75] map+reduce [16/75] map+reduce [14/75] map+reduce [15/75] map+reduce [17/75] map+reduce [18/75] map+reduce [12/75] map+reduce [60/75] map [62/75] map [61/75] map [58/75] map [55/75] map [56/75] map+reduce [57/75] map [71/75] map [72/75] map [59/75] map [64/75] map+reduce [63/75] map+reduce [65/75] map+reduce [68/75] map+reduce [67/75] map+reduce [70/75] map+reduce [69/75] map+reduce [73/75] map+reduce [66/75] map+reduce [74/75] map+reduce [75/75] map+reduce [1/75] map+reduce1 app, 75 jobsgreen = map + reducepurple = mapblue = join/mergeorange = map split Copyright Concurrent, Inc. 2011. All rights reserved.
    12. 12. Heavy Lifting• Thing we must do because data can be heavy• These patterns are natural to MapReduce and easy to implement• But have some room for composition/aggregation within a Map/ Reduce (i.e., Filter + Binning)• (leading us to think of Hadoop as an ETL framework)• Record Filtering• Parsing, Conversion • Binning• Counting, Summing • Distributed Tasks• Unique Copyright Concurrent, Inc. 2011. All rights reserved.
    13. 13. Record Filtering• Think unix ‘grep’• Filtering is discarding unwanted values (or preserving wanted)• Only uses a Map function, no Reducer Copyright Concurrent, Inc. 2011. All rights reserved.
    14. 14. Parsing, Conversion• Think unix ‘sed’• A Map function that takes an input key and/or value and translates it into a new format• Examples: • raw logs to delimited text or archival efficient binary • entity extraction Copyright Concurrent, Inc. 2011. All rights reserved.
    15. 15. Counting, Summing• The same as SQL aggregation functions• Simply applying some function to the values collection seen in Reduce• Other examples: • average, max, min, unique Copyright Concurrent, Inc. 2011. All rights reserved.
    16. 16. Merging• Where many files of the same type are converted to one output path• Map side merges • One directory with as many part files as Mappers• Reduce side merges • Allows for removing duplicates or deleted items • One directory with as many part files as Reducers• Examples • Nutch • Normalizing log files (apache, log4j, etc) Copyright Concurrent, Inc. 2011. All rights reserved.
    17. 17. Binning• Where the values associated w/ unique keys are persisted together• Typically a directory path based on key’s value• Must be conscious of total open files, remember no appends• Examples: • web log files by year/month/day • trade data by symbol Copyright Concurrent, Inc. 2011. All rights reserved.
    18. 18. Distributed Tasks• Simply where a Map or Reduce function executes some ‘task’ based on the input key and value.• Examples: • web crawling, • load testing services, • rdbms/nosql updates, • file transfers (S3), • image to pdf (NYT on EC2) Copyright Concurrent, Inc. 2011. All rights reserved.
    19. 19. Basic Analytic Patterns• Some of these patterns are unnatural to MapReduce• We think in terms of columns/fields, not key value pairs• (leading us to think of Hadoop as a RDBMS) • Group By • Secondary Unique • Unique • CoGrouping and Joining • Secondary Sort Copyright Concurrent, Inc. 2011. All rights reserved.
    20. 20. Composite Keys/Values [K1,V1] <A1,B1,C1,...>• It is easier to think in columns/fields • e.g. “firstname” & “lastname”, not “line”• Whether a set of columns are Keys or Values is arbitrary• Keys become a means to piggyback the properties of MR and become an impl detail Copyright Concurrent, Inc. 2011. All rights reserved.
    21. 21. Group By GroupBy 1001 Jim dept_id Mary name Susan 1002 Fred Wilma Ernie Barny• Group By is where Value fields are grouped by Grouping fields• Above, Map output key is “dept_id” and value is “name” Copyright Concurrent, Inc. 2011. All rights reserved.
    22. 22. Group By Mapper Reducer Piggyback Code [K1,V1] [K2,{V2,V2,....}] [K1,V1] -> <A1,B1,C1,D1> [K2,V2] -> <A2,B2,{<C2,D2>,...}> User Code Map Reduce <A2,B2> -> K2, <C2,D2> -> V2 <A3,B3> -> K3, <C3,D3> -> V3 [K2,V2] [K3,V3]• So the K2 key becomes a composite Key of • key: [grouping], value: [values] Copyright Concurrent, Inc. 2011. All rights reserved.
    23. 23. Unique Mapper [0, "when in the course of human events"] Map ["when",null] ["in",null] [...,null] ["when",1] ["when",1] ["when",1] ["when",1] Group ["when",{nulls}] ["when",null] Reducer ["when",{nulls}] Reduce ["when",null]• Or Distinct (as in SQL)• Globally finding all the unique values in a dataset • Usually finding unique values in a column• Often used to filter a second dataset using a join Copyright Concurrent, Inc. 2011. All rights reserved.
    24. 24. Secondary Sort (group) (sorted value) (remaining value) Date Time Url 08/08/2008, 1:00:00, http://www.example.com/foo 08/08/2008, 1:01:00, http://www.example.com/bar 08/08/2008, 1:01:30, http://www.example.com/baz• Secondary Sorting is where • Some Fields are grouped on, and • Some of the remaining Fields are sorted within their grouping Copyright Concurrent, Inc. 2011. All rights reserved.
    25. 25. Secondary Sort Mapper Reducer [K1,V1] [K2,{V2,V2,....}] [K1,V1] -> <A1,B1,C1,D1> [K2,V2] -> <A2,B2,{<C2,D2>,...}> Map Reduce <A2,B2><C2> -> K2, <D2> -> V2 <A3,B3> -> K3, <C3,D3> -> V3 [K2,V2] [K3,V3]• So the K2 key becomes a composite Key of • key: [grouping, secondary], value: [remaining values]• The trick is to piggyback the Reduce sort yet not be compared during the unique key comparison Copyright Concurrent, Inc. 2011. All rights reserved.
    26. 26. Secondary Unique Mapper Assume Secondary Sorting magic happens here [0, "when in the course of human events"] Map [0, "when"] [0, "in"] [0,"the"] [0,...] ["when",1] ["when",1] ["when",1] ["when",1] Group [0,{"in","in","the","when","when",...}] [0,"when"] Reducer [0,{"in","in","the","when","when",...}] Reduce ["in",null] ["the",null] ["when",null]• Secondary Unique is where the grouping values are uniqued • .... in a “scale free” way• Perform a Secondary Sort...• Reducer removes duplicates by discarding every value that matches the previous value • since values are now ordered, no need to maintain a Set of values Copyright Concurrent, Inc. 2011. All rights reserved.
    27. 27. Joining lhs data rhs data 1001 dept_id Jim Accounting dept_name Mary Accounting name Susan Accounting 1002 Fred Shipping Wilma Shipping Ernie Shipping Barny Shipping• Where two or more input data sets are ‘joined’ by a common key • Like a SQL join Copyright Concurrent, Inc. 2011. All rights reserved.
    28. 28. Join Definitions• Consider the input data [key, value]: • LHS = [0,a] [1,b] [2,c] • RHS = [0,A] [2,C] [3,D]• Joins on the key: • Inner • [0,a,A] [2,c,C] • Outer (Left Outer, Right Outer) • [0,a,A] [1,b,null] [2,c,C] [3,null,D] • Left (Left Inner, Right Outer) • [0,a,A] [1,b,null] [2,c,C] • Right (Left Outer, Right Inner) • [0,a,A] [2,c,C] [3,null,D] Copyright Concurrent, Inc. 2011. All rights reserved.
    29. 29. CoGrouping• Before Joining, CoGrouping must happen• Simply concurrent GroupBy operations on each input data set Copyright Concurrent, Inc. 2011. All rights reserved.
    30. 30. GroupBy vs CoGroup lhs data rhs data GroupBy CoGroup 1001 1001 Jim Jim Accountingdept_id Mary Mary name Susan Susan dept_name 1002 1002 Fred Fred Shipping Wilma Wilma Ernie Ernie Barny Barny Independent collections of unordered values Copyright Concurrent, Inc. 2011. All rights reserved.
    31. 31. CoGroup Joined lhs data rhs data 1001 dept_id Jim Accounting dept_name Mary Accounting name Susan Accounting 1002 Fred Shipping Wilma Shipping Ernie Shipping Barny Shipping• Considering the previous data, a typical Inner Join Copyright Concurrent, Inc. 2011. All rights reserved.
    32. 32. CoGrouping Mapper [n] [n+1] Reducer [K1,V1] [K1,V1] [K2,{V2,V2,....}] [K2,V2] -> <A2,B2,{<C2,D2,C2,D2>,...}> [K1,V1] -> <A1,B1,C1,D1> [K1,V1] -> <A1,B1,C1,D1> Reduce Map <A3,B3> -> K3, <C3,D3> -> V3 <A2,B2> -> K2, [n]<C2,D2> -> V2 [K2,V2] [K3,V3]• Maps must run for each input set in same Job (n, n+1, etc)• CoGrouping must happen against each common key Copyright Concurrent, Inc. 2011. All rights reserved.
    33. 33. Joining Reducer [K2,{V2,V2,....}] <A2,B2,{[n]<C2,D2>,[n+1]..}> [K2,V2] -> <A2,B2,{<C2,D2,C2,D2>,...}> <A2,B2,{<C2,D2>,...},{<C2,D2>,...}> Reduce {<C2,D2>,...} Join {<C2,D2>,...} <A3,B3> -> K3, <C3,D3> -> V3 <C2,D2,C2,D2> [K3,V3] <A2,B2,{<C2,D2,C2,D2>,...}>• The CoGroups must be joined• Finally the Reduce can be applied Copyright Concurrent, Inc. 2011. All rights reserved.
    34. 34. Optimizations • Patterns for reducing IO• Identity Mapper • Partial Aggregates• Map Side Join • Similarity Joins• Combiners Copyright Concurrent, Inc. 2011. All rights reserved.
    35. 35. Identity Mapper [head] Dfs[TextLine[[offset, line]->[ALL]]][/logs/stumbles/short-stumbles-20090504.log]] [{2}:offset, line] [{2}:offset, line] Each(import)[RegexParser[decl:day, urlid, method][args:1]] [{3}:day, urlid, method] [{3}:day, urlid, method] Each(import)[ExpressionFilter[decl:day, urlid, method]] [{3}:day, urlid, method] [{3}:day, urlid, method] TempHfs[SequenceFile[[day, urlid, method]]][import/71897/] [{3}:day, urlid, method] [{3}:day, urlid, method] [{3}:day, urlid, method] identity [{3}:day, urlid, method] Each(paidCount)[Not[decl:day, urlid, method]] function [{3}:day, urlid, method] [{3}:day, urlid, method] Each(organicCount)[OrganicFilter[decl:day, urlid, method]] [{3}:day, urlid, method] GroupBy(paidCount)[by:[day, urlid]] [{3}:day, urlid, method] paidCount[{2}:day, urlid] GroupBy(organicCount)[by:[day, urlid]] [{3}:day, urlid, method] organicCount[{2}:day, urlid] Every(paidCount)[Count[decl:count]] [{3}:day, urlid, method] [{3}:day, urlid, count] Every(organicCount)[Count[decl:count]]• [{3}:day, urlid, method] Move Map operations to the Each(paidCount)[Identity[decl:paid_day, paid_urlid, paid_count]] [{3}:day, urlid, count] [{3}:day, urlid, method] [{3}:paid_day, paid_urlid, paid_count] [{3}:paid_day, paid_urlid, paid_count] TempHfs[SequenceFile[[paid_day, paid_urlid, paid_count]]][paidCount/33072/] TempHfs[SequenceFile[[day, urlid, count]]][organicCount/97544/] [{3}:paid_day, paid_urlid, paid_count] [{3}:paid_day, paid_urlid, paid_count] previous Reduce [{3}:paid_day, paid_urlid, paid_count] [{3}:day, urlid, count] [{3}:paid_day, paid_urlid, paid_count] [{3}:day, urlid, count] [{3}:day, urlid, count] [{3}:day, urlid, count] CoGroup(organicCount*paidCount)[by:organicCount:[day, urlid]paidCount:[paid_day, paid_urlid]] Each(paidDomainCount)[LookupDomainFunction[decl:domainid][args:1]] organicCount[{2}:day, urlid],paidCount[{2}:paid_day, paid_urlid] [{4}:paid_day, paid_urlid, paid_count, domainid] Each(organicDomainCount)[LookupDomainFunction[decl:domainid][args:1]] [{6}:day, urlid, count, paid_day, paid_urlid, paid_count] [{4}:paid_day, paid_urlid, paid_count, domainid] [{4}:day, urlid, count, domainid] Each(organicCount*paidCount)[ExpressionFunction[decl:urlid_day]] GroupBy(paidDomainCount)[by:[paid_day, domainid]] [{4}:day, urlid, count, domainid]• [{7}:day, urlid, count, paid_day, paid_urlid, paid_count, urlid_day] paidDomainCount[{2}:paid_day, domainid] GroupBy(organicDomainCount)[by:[day, domainid]] [{7}:day, urlid, count, paid_day, paid_urlid, paid_count, urlid_day] [{4}:paid_day, paid_urlid, paid_count, domainid] Replace with an Identity organicDomainCount[{2}:day, domainid] Each(organicCount*paidCount)[ExpressionFunction[decl:fixed_count]] Every(paidDomainCount)[Sum[decl:sum][args:1]] [{4}:day, urlid, count, domainid] [{8}:day, urlid, count, paid_day, paid_urlid, paid_count, urlid_day, fixed_count] [{3}:paid_day, domainid, sum] Every(organicDomainCount)[Sum[decl:sum][args:1]] [{8}:day, urlid, count, paid_day, paid_urlid, paid_count, urlid_day, fixed_count] [{4}:paid_day, paid_urlid, paid_count, domainid] function [{3}:day, domainid, sum] Each(organicCount*paidCount)[ExpressionFunction[decl:fixed_paid_count]] Each(paidDomainCount)[Identity[decl:paid_day, paid_domainid, paid_sum]] [{4}:day, urlid, count, domainid] [{9}:day, urlid, count, paid_day, paid_urlid, paid_count, urlid_day, fixed_count, fixed_paid_count] [{3}:paid_day, paid_domainid, paid_sum] TempHfs[SequenceFile[[day, domainid, sum]]][organicDomainCount/49784/] [{9}:day, urlid, count, paid_day, paid_urlid, paid_count, urlid_day, fixed_count, fixed_paid_count] [{3}:paid_day, paid_domainid, paid_sum] Each(organicCount*paidCount)[Identity[decl:urlid_day, default:organic_stumbles, default:paid_stumbles]] TempHfs[SequenceFile[[paid_day, paid_domainid, paid_sum]]][paidDomainCount/54349/] [{3}:paid_day, paid_domainid, paid_sum] [{3}:day, domainid, sum]• [{3}:paid_day, paid_domainid, paid_sum] [{3}:day, domainid, sum] [{3}:urlid_day, default:organic_stumbles, default:paid_stumbles] [{3}:urlid_day, default:organic_stumbles, default:paid_stumbles] CoGroup(organicDomainCount*paidDomainCount)[by:organicDomainCount:[day, domainid]paidDomainCount:[paid_day, paid_domainid]] Assumes Map operations cascading.hbase.HBaseTap@d9a475a0 organicDomainCount[{2}:day, domainid],paidDomainCount[{2}:paid_day, paid_domainid] [{6}:day, domainid, sum, paid_day, paid_domainid, paid_sum] Each(organicDomainCount*paidDomainCount)[ExpressionFunction[decl:domainid_day]] reduce the data [{7}:day, domainid, sum, paid_day, paid_domainid, paid_sum, domainid_day] [{7}:day, domainid, sum, paid_day, paid_domainid, paid_sum, domainid_day] [{3}:urlid_day, default:organic_stumbles, default:paid_stumbles] Each(organicDomainCount*paidDomainCount)[ExpressionFunction[decl:fixed_sum]] [{3}:urlid_day, default:organic_stumbles, default:paid_stumbles] [{8}:day, domainid, sum, paid_day, paid_domainid, paid_sum, domainid_day, fixed_sum] [{8}:day, domainid, sum, paid_day, paid_domainid, paid_sum, domainid_day, fixed_sum] Each(organicDomainCount*paidDomainCount)[ExpressionFunction[decl:fixed_paid_sum]] [{9}:day, domainid, sum, paid_day, paid_domainid, paid_sum, domainid_day, fixed_sum, fixed_paid_sum] [{9}:day, domainid, sum, paid_day, paid_domainid, paid_sum, domainid_day, fixed_sum, fixed_paid_sum] Each(organicDomainCount*paidDomainCount)[Identity[decl:domainid_day, default:organic_stumbles, default:paid_stumbles]] [{3}:domainid_day, default:organic_stumbles, default:paid_stumbles] [{3}:domainid_day, default:organic_stumbles, default:paid_stumbles] cascading.hbase.HBaseTap@3d07f00 [{3}:domainid_day, default:organic_stumbles, default:paid_stumbles] [{3}:domainid_day, default:organic_stumbles, default:paid_stumbles] [tail] Copyright Concurrent, Inc. 2011. All rights reserved.
    36. 36. Map Side Joins• Bypasses the (immediate) need for a Reducer• Symmetrical • Where LHS and RHS are of equivalent size • Requires data to be sorted on key• Asymmetrical • One side is small enough to fit in memory • Typically a hashtable lookup Copyright Concurrent, Inc. 2011. All rights reserved.
    37. 37. Combiners Mapper [0, "when in the course of human events"] Map ["when",1] ["in",1] ["the",1] [...,1] Combiner ["when",1] ["when",1] Group ["when",{1,1}] ["when",{1,1}] Reduce ["when",2] Same Implementation ["when",1] ["when",1] Group ["when",{2,1,2}] ["when",2] Reducer ["when",{2,1,2}] Reduce ["when",5]• Where Reduce runs Map side, and again Reduce side• Only works if Reduce is commutative and associative• Reduces bandwidth by trading CPU for IO • Serialization/deserialization during local sorting before combining Copyright Concurrent, Inc. 2011. All rights reserved.
    38. 38. Partial Aggregates Mapper [0, "when in the course of human events"] ["when",1] ["in",1] ["the",1] [...,1] Map Partial Provides an opportunity to ["when",1] ["when",1] ["when",2] promote the functionality of the next Map to this Reduce ["when",1] ["when",1] Group ["when",{2,1,2}] ["when",2] Reducer ["when",{2,1,2}] Reduce ["when",5]• Supports any aggregate type, while being composable with other aggregates• Reduces bandwidth by trading Memory for IO • Very important for a CPU constrained cluster • Use a bounded LRU to keep constant memory (requires tuning) Copyright Concurrent, Inc. 2011. All rights reserved.
    39. 39. Partial Aggregates [a,b,c,a,a,b] [a,b,c,a,a,b] partial unique partial unique [a,b,c,a,b] [a,b,c,a,b] [a,b,c,a,a,b] [a,b,c,a,a,b] partial unique partial unique [a,b,c,a,b] [a,b,c,a,b] LRU* {_,_} *cache size of 2 a -> {a,_} -> _ b -> {b,a} -> _ incoming discarded c -> {c,b} -> a value value a -> {a,c} -> b a -> {a,c} b -> {b,a} -> c• OK that dupes emit from a Mapper and across Mappers (or prev Reducers!)• Final aggregation happens in Reducer• Larger the cache, fewer dupes Copyright Concurrent, Inc. 2011. All rights reserved.
    40. 40. Tradeoffs• CPU for IO == fault tolerance• Memory for IO == performance Copyright Concurrent, Inc. 2011. All rights reserved.
    41. 41. Similarity Join• Compare all values LHS to values RHS to find duplicates (or similar values)• Naive approaches • Cross Join (all data through one reducer) • In-common features (very common features will bottleneck) Copyright Concurrent, Inc. 2011. All rights reserved.
    42. 42. Set-Similarity Joining• “Efficient Parallel Set-Similarity Joins Using MapReduce” - R Vernica, M Carey, C Li• Only compare candidate pairs• Candidates share uncommon features Copyright Concurrent, Inc. 2011. All rights reserved.
    43. 43. 4 11 4 22 4 33 2 44 3: order by least frequent 1: records 1 discard common 1 2: count tokens 1 1 1 3 3 5: candidate pairs 3 4: uncommon features 6: final compare in common • 1 and 3 share uncommon features • thus are candidates for a full comparison Copyright Concurrent, Inc. 2011. All rights reserved.
    44. 44. Tokenize Count Job Map Reduce Map Reduce File File File Join Tokens/Counts Job File Map Reduce File Sort/Prefix Filter Job Map Reduce FileMatch two sets Self Join Job Map Reduce using prefix File filtering Unique Pairs Job Map Reduce File Join LHS Job Map Reduce File Join RHS / Match Job Map Reduce File Copyright Concurrent, Inc. 2011. All rights reserved.
    45. 45. Duality• Note the use of the previous patterns to route data to implement a more efficient algorithm Copyright Concurrent, Inc. 2011. All rights reserved.
    46. 46. Use a Higher Abstraction• Command Line • Multitool - CLI for parallel sed, grep & joins• API • Cascading - Java Query API and Planner • Plume - “approximate clone of FlumeJava”• Interactive Shell • Cascalog - Clojure+Cascading query language (API also) • Pig - A text Syntax • Hive - Syntax + Infrastructure - SQL “like” Copyright Concurrent, Inc. 2011. All rights reserved.
    47. 47. References• Set Similarity • http://www.slideshare.net/ydn/4-similarity-joinshadoopsummit2010 • http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/• MapReduce Text Processing • http://www.umiacs.umd.edu/~jimmylin/book.html• Plume/FlumeJava • http://portal.acm.org/citation.cfm?id=1806596.1806638 • http://github.com/tdunning/Plume/wiki Copyright Concurrent, Inc. 2011. All rights reserved.
    48. 48. I’m Hiring• Enterprise Java server and web client• Language design, compilers, and interpreters• No Hadoop experience required• More info • http://www.concurrentinc.com/careers/ Copyright Concurrent, Inc. 2011. All rights reserved.
    49. 49. Resources• Chris K Wensel •chris@wensel.net •@cwensel• Cascading & Cascalog •http://cascading.org •@cascading• Concurrent, Inc. •http://concurrentinc.com •@concurrent Copyright Concurrent, Inc. 2011. All rights reserved.
    50. 50. Appendix Copyright Concurrent, Inc. 2011. All rights reserved.
    51. 51. Simple Total Sorting• Where lines in a result file should be sorted• Must set number of reducers to 1 • Sorting in MR is local per Reduce, not global across Reducers Copyright Concurrent, Inc. 2011. All rights reserved.
    52. 52. Why Sorting Isn’t “Total” [aaa,aab,aac] Mapper aaa Mapper aac Reducer [aaa,zzx] aab Mapper Reducer [aac,zzz] zzx Mapper zzz Reducer [aab,zzy] zzy [zzx,zzy,zzz] Mapper• Keys emitted from Map are naturally sorted at a given Reducer• But are Partitioned to Reducers in a random way• Thus, only one Reducer can be used for a total sort Copyright Concurrent, Inc. 2011. All rights reserved.
    53. 53. Distributed Total Sort• To work, the Shuffling phase must be modified with: • Custom Partitioner to partition on the distribution of ordered Keys • Custom Comparator for comparing Key types • Strings work by default Copyright Concurrent, Inc. 2011. All rights reserved.
    54. 54. Distributed Total Sort - Details a ... z ar ... ax za ... zo ara ... ari axe ... axi zag ... zap zon ... zoo aran aria axis zone• Sample all K2 values and build balanced distribution for num reducers • Sample all input keys and divide into partitions • Write out boundaries of partitions• Supply Partitioner that looks up partition for current K2 value • Read boundaries into a Trie (pronounced ‘try’) data structure• Use appropriate Comparator for Key type Copyright Concurrent, Inc. 2011. All rights reserved.
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×