Buzz words

Common MapReduce
Patterns

Chris K Wensel

BuzzWords 2011

Engineer, Not Academic
• Concurrent, Inc., Founder
• Cascading support and tools
• http://concurrentinc.com/

• Cascading, Lead Developer (started Sept 2007)
• An alternative API to MapReduce
• http://cascading.org/

• Formerly Hadoop mentoring and training
• Sun - Apple - HP - LexisNexis - startups - etc

• Formerly Systems Architect & Consultant
• Thomson/Reuters - TeleAtlas - startups - etc
Copyright Concurrent, Inc. 2011. All rights reserved.

Overview

• MapReduce
• Heavy Lifting
• Analytics
• Optimizations

MapReduce
• A “divide and conquer” strategy for parallelizing
workloads against collections of data

• Map & Reduce are two user deﬁned functions
chained via Key Value Pairs

• It’s really Map->Group->Reduce where Group is
built in


Keys and Values
• Map translates input to keys
and values to new keys and
values [K1,V1] Map [K2,V2]*

• System Groups each unique [K2,V2] Group [K2,{V2,V2,....}]
key with all its values

[K2,{V2,V2,....}] Reduce [K3,V3]*

• Reduce translates the values
of each unique key to new
keys and values * = zero or more


Word Count
Mapper
[0, "when in the course of
human events"] Map ["when",1] ["in",1] ["the",1] [...,1]

["when",1]
["when",1]
["when",1]
["when",1] Group ["when",{1,1,1,1,1}]
["when",1]
Reducer

["when",{1,1,1,1,1}] Reduce ["when",5]


Divide and Conquer
Parallelism
• Since the ‘records’ entering the Map and ‘groups’
entering the Reduce are independent

• That is, there is no expectation of order or
requirement to share state between records/
groups

• Arbitrary numbers of Map and Reduce function
instances can be created against arbitrary portions
of input data

Cluster
Cluster

Rack Rack Rack

Node Node Node Node ...

map map map map map

reduce reduce reduce

• Multiple instances of each Map and Reduce
function are distributed throughout the cluster


Another View
[K1,V1] Map [K2,V2]
Combine Group [K2,{V2,...}] Reduce [K3,V3]

Mapper
Task same code

Mapper Reducer
Shuffle
Task Task

Mapper Reducer
Shuffle
Task Task

Mapper Reducer
Shuffle Task
Task

Mapper
Task
Mappers must
complete before
Reducers can
begin
split1 split2 split3 split4 ... part-00000 part-00001 part-000N

file directory


Complex job
assemblies
• Real applications are many MapReduce jobs chained together

• Linked by intermediate (usually temporary) ﬁles

• Executed in order, by hand, from the ‘client’ application

Count Job Sort Job
[ k, [v] ] [ k, [v] ]
Map Reduce Map Reduce

[ k, v ] [ k, v ] [ k, v ] [ k, v ]

File File File

[ k, v ] = key and value pair
[ k, [v] ] = key and associated values collection

Real World Apps
[37/75] map+reduce

[54/75] map+reduce

[41/75] map+reduce [43/75] map+reduce [42/75] map+reduce [45/75] map+reduce [44/75] map+reduce [39/75] map+reduce [36/75] map+reduce [46/75] map+reduce [40/75] map+reduce [50/75] map+reduce [38/75] map+reduce [49/75] map+reduce [51/75] map+reduce [47/75] map+reduce [52/75] map+reduce [53/75] map+reduce [48/75] map+reduce



[60/75] map [62/75] map [61/75] map [58/75] map [55/75] map [56/75] map+reduce [57/75] map [71/75] map [72/75] map
[59/75] map

[64/75] map+reduce [63/75] map+reduce [65/75] map+reduce [68/75] map+reduce [67/75] map+reduce [70/75] map+reduce [69/75] map+reduce [73/75] map+reduce [66/75] map+reduce [74/75] map+reduce

[75/75] map+reduce

[1/75] map+reduce

1 app, 75 jobs

green = map + reduce
purple = map
blue = join/merge
orange = map split

Heavy Lifting
• Thing we must do because data can be heavy

• These patterns are natural to MapReduce and easy to implement

• But have some room for composition/aggregation within a Map/
Reduce (i.e., Filter + Binning)

• (leading us to think of Hadoop as an ETL framework)

• Record Filtering
• Parsing, Conversion • Binning
• Counting, Summing • Distributed Tasks
• Unique

Record Filtering

• Think unix ‘grep’
• Filtering is discarding unwanted values (or
preserving wanted)

• Only uses a Map function, no Reducer


Parsing, Conversion
• Think unix ‘sed’

• A Map function that takes an input key and/or value and
translates it into a new format

• Examples:

• raw logs to delimited text or archival efﬁcient binary

• entity extraction


Counting, Summing

• The same as SQL aggregation functions
• Simply applying some function to the values
collection seen in Reduce

• Other examples:
• average, max, min, unique

Merging
• Where many files of the same type are converted to one
output path
• Map side merges
• One directory with as many part files as Mappers
• Reduce side merges
• Allows for removing duplicates or deleted items
• One directory with as many part files as Reducers
• Examples
• Nutch
• Normalizing log files (apache, log4j, etc)

Binning
• Where the values associated w/ unique keys are
persisted together
• Typically a directory path based on key’s value
• Must be conscious of total open ﬁles, remember no
appends
• Examples:
• web log ﬁles by year/month/day
• trade data by symbol


Distributed Tasks
• Simply where a Map or Reduce function executes some
‘task’ based on the input key and value.
• Examples:
• web crawling,
• load testing services,
• rdbms/nosql updates,
• ﬁle transfers (S3),
• image to pdf (NYT on EC2)


Basic Analytic Patterns
• Some of these patterns are unnatural to MapReduce

• We think in terms of columns/ﬁelds, not key value
pairs

• (leading us to think of Hadoop as a RDBMS)

• Group By
• Secondary Unique
• Unique
• CoGrouping and Joining
• Secondary Sort


Composite Keys/Values
[K1,V1] <A1,B1,C1,...>

• It is easier to think in columns/ﬁelds
• e.g. “ﬁrstname” & “lastname”, not “line”
• Whether a set of columns are Keys or Values is
arbitrary
• Keys become a means to piggyback the
properties of MR and become an impl detail

Group By
GroupBy
1001

Jim
dept_id
Mary

name Susan

1002

Fred

Wilma

Ernie

Barny

• Group By is where Value ﬁelds are grouped by Grouping ﬁelds
• Above, Map output key is “dept_id” and value is “name”

Group By
Mapper Reducer
Piggyback Code [K1,V1] [K2,{V2,V2,....}]

[K1,V1] -> <A1,B1,C1,D1> [K2,V2] -> <A2,B2,{<C2,D2>,...}>
User Code

Map Reduce

<A2,B2> -> K2, <C2,D2> -> V2 <A3,B3> -> K3, <C3,D3> -> V3

[K2,V2] [K3,V3]

• So the K2 key becomes a composite Key of
• key: [grouping], value: [values]


Unique
Mapper
human events"] Map ["when",null] ["in",null] [...,null]

["when",1]
["when",1]
["when",1]
["when",1] Group ["when",{nulls}]
["when",null]
Reducer

["when",{nulls}] Reduce ["when",null]

• Or Distinct (as in SQL)
• Globally finding all the unique values in a dataset
• Usually finding unique values in a column
• Often used to filter a second dataset using a join

Secondary Sort
(group) (sorted value) (remaining value)
Date Time Url

08/08/2008, 1:00:00, http://www.example.com/foo

08/08/2008, 1:01:00, http://www.example.com/bar

08/08/2008, 1:01:30, http://www.example.com/baz

• Secondary Sorting is where
• Some Fields are grouped on, and
• Some of the remaining Fields are sorted within
their grouping

Secondary Sort
Mapper Reducer
[K1,V1] [K2,{V2,V2,....}]

[K1,V1] -> <A1,B1,C1,D1> [K2,V2] -> <A2,B2,{<C2,D2>,...}>

Map Reduce

<A2,B2><C2> -> K2, <D2> -> V2 <A3,B3> -> K3, <C3,D3> -> V3

[K2,V2] [K3,V3]

• So the K2 key becomes a composite Key of
• key: [grouping, secondary], value: [remaining values]
• The trick is to piggyback the Reduce sort yet not be compared
during the unique key comparison

Secondary Unique
Mapper Assume Secondary Sorting
magic happens here
human events"] Map [0, "when"] [0, "in"] [0,"the"] [0,...]

["when",1]
["when",1]
["when",1]
["when",1] Group [0,{"in","in","the","when","when",...}]
[0,"when"]
Reducer

[0,{"in","in","the","when","when",...}] Reduce ["in",null] ["the",null] ["when",null]

• Secondary Unique is where the grouping values are uniqued
• .... in a “scale free” way
• Perform a Secondary Sort...
• Reducer removes duplicates by discarding every value that
matches the previous value
• since values are now ordered, no need to maintain a Set of
values

Joining
lhs data
rhs data

1001

dept_id Jim Accounting
dept_name
Mary Accounting

name Susan Accounting

1002

Fred Shipping

Wilma Shipping

Ernie Shipping

Barny Shipping

• Where two or more input data sets are ‘joined’ by a
common key
• Like a SQL join

Join Deﬁnitions
• Consider the input data [key, value]:
• LHS = [0,a] [1,b] [2,c]
• RHS = [0,A] [2,C] [3,D]
• Joins on the key:
• Inner
• [0,a,A] [2,c,C]
• Outer (Left Outer, Right Outer)
• [0,a,A] [1,b,null] [2,c,C] [3,null,D]
• Left (Left Inner, Right Outer)
• [0,a,A] [1,b,null] [2,c,C]
• Right (Left Outer, Right Inner)
• [0,a,A] [2,c,C] [3,null,D]

CoGrouping

• Before Joining, CoGrouping must happen
• Simply concurrent GroupBy operations on each
input data set


GroupBy vs CoGroup
lhs data
rhs data
GroupBy CoGroup
1001 1001

Jim Jim Accounting
dept_id
Mary Mary

name Susan Susan dept_name

1002 1002

Fred Fred Shipping

Wilma Wilma

Ernie Ernie

Barny Barny

Independent collections
of unordered values


CoGroup Joined
lhs data
rhs data

1001

dept_id Jim Accounting
dept_name
Mary Accounting

name Susan Accounting

1002

Fred Shipping

Wilma Shipping

Ernie Shipping

Barny Shipping

• Considering the previous data, a typical Inner Join

CoGrouping
Mapper [n] [n+1] Reducer
[K1,V1] [K1',V1'] [K2,{V2,V2,....}]

[K2,V2] -> <A2,B2,{<C2,D2,C2',D2'>,...}>
[K1,V1] -> <A1,B1,C1,D1> [K1',V1'] -> <A1',B1',C1',D1'>

Reduce
Map

<A3,B3> -> K3, <C3,D3> -> V3
<A2,B2> -> K2, [n]<C2,D2> -> V2

[K2,V2] [K3,V3]

• Maps must run for each input set in same Job (n, n+1, etc)
• CoGrouping must happen against each common key

Joining
Reducer
[K2,{V2,V2,....}]
<A2,B2,{[n]<C2,D2>,[n+1]..}>

[K2,V2] -> <A2,B2,{<C2,D2,C2',D2'>,...}>
<A2,B2,{<C2,D2>,...},{<C2',D2'>,...}>

Reduce
{<C2,D2>,...} Join {<C2',D2'>,...}

<A3,B3> -> K3, <C3,D3> -> V3 <C2,D2,C2',D2'>

[K3,V3] <A2,B2,{<C2,D2,C2',D2'>,...}>

• The CoGroups must be joined

• Finally the Reduce can be applied

Optimizations

• Patterns for reducing IO

• Identity Mapper
• Partial Aggregates
• Map Side Join
• Similarity Joins
• Combiners

Identity Mapper [head]

Dfs['TextLine[['offset', 'line']->[ALL]]']['/logs/stumbles/short-stumbles-20090504.log']']

[{2}:'offset', 'line']
[{2}:'offset', 'line']

Each('import')[RegexParser[decl:'day', 'urlid', 'method'][args:1]]

[{3}:'day', 'urlid', 'method']

Each('import')[ExpressionFilter[decl:'day', 'urlid', 'method']]


TempHfs['SequenceFile[['day', 'urlid', 'method']]'][import/71897/]

identity [{3}:'day', 'urlid', 'method']
Each('paidCount')[Not[decl:'day', 'urlid', 'method']]

function [{3}:'day', 'urlid', 'method']
Each('organicCount')[OrganicFilter[decl:'day', 'urlid', 'method']]

GroupBy('paidCount')[by:['day', 'urlid']]

paidCount[{2}:'day', 'urlid']
GroupBy('organicCount')[by:['day', 'urlid']]

organicCount[{2}:'day', 'urlid']
Every('paidCount')[Count[decl:'count']]

[{3}:'day', 'urlid', 'count']
Every('organicCount')[Count[decl:'count']]

•

Move Map operations to the
Each('paidCount')[Identity[decl:'paid_day', 'paid_urlid', 'paid_count']]
[{3}:'paid_day', 'paid_urlid', 'paid_count']
[{3}:'paid_day', 'paid_urlid', 'paid_count']

TempHfs['SequenceFile[['paid_day', 'paid_urlid', 'paid_count']]'][paidCount/33072/] TempHfs['SequenceFile[['day', 'urlid', 'count']]'][organicCount/97544/]

[{3}:'paid_day', 'paid_urlid', 'paid_count'] [{3}:'paid_day', 'paid_urlid', 'paid_count']

previous Reduce
[{3}:'paid_day', 'paid_urlid', 'paid_count'] [{3}:'day', 'urlid', 'count']
[{3}:'paid_day', 'paid_urlid', 'paid_count'] [{3}:'day', 'urlid', 'count']
CoGroup('organicCount*paidCount')[by:organicCount:['day', 'urlid']paidCount:['paid_day', 'paid_urlid']] Each('paidDomainCount')[LookupDomainFunction[decl:'domainid'][args:1]]

organicCount[{2}:'day', 'urlid'],paidCount[{2}:'paid_day', 'paid_urlid'] [{4}:'paid_day', 'paid_urlid', 'paid_count', 'domainid']
Each('organicDomainCount')[LookupDomainFunction[decl:'domainid'][args:1]]
[{6}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count'] [{4}:'paid_day', 'paid_urlid', 'paid_count', 'domainid']

[{4}:'day', 'urlid', 'count', 'domainid']
Each('organicCount*paidCount')[ExpressionFunction[decl:'urlid_day']] GroupBy('paidDomainCount')[by:['paid_day', 'domainid']]

•
[{7}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day'] paidDomainCount[{2}:'paid_day', 'domainid']
GroupBy('organicDomainCount')[by:['day', 'domainid']]
[{7}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day'] [{4}:'paid_day', 'paid_urlid', 'paid_count', 'domainid']

Replace with an Identity
organicDomainCount[{2}:'day', 'domainid']
Each('organicCount*paidCount')[ExpressionFunction[decl:'fixed_count']] Every('paidDomainCount')[Sum[decl:'sum'][args:1]]

[{8}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day', 'fixed_count'] [{3}:'paid_day', 'domainid', 'sum']
Every('organicDomainCount')[Sum[decl:'sum'][args:1]]
[{8}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day', 'fixed_count'] [{4}:'paid_day', 'paid_urlid', 'paid_count', 'domainid']

function
[{3}:'day', 'domainid', 'sum']
Each('organicCount*paidCount')[ExpressionFunction[decl:'fixed_paid_count']] Each('paidDomainCount')[Identity[decl:'paid_day', 'paid_domainid', 'paid_sum']]

[{9}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day', 'fixed_count', 'fixed_paid_count'] [{3}:'paid_day', 'paid_domainid', 'paid_sum']
TempHfs['SequenceFile[['day', 'domainid', 'sum']]'][organicDomainCount/49784/]
[{9}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day', 'fixed_count', 'fixed_paid_count'] [{3}:'paid_day', 'paid_domainid', 'paid_sum']

Each('organicCount*paidCount')[Identity[decl:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles']] TempHfs['SequenceFile[['paid_day', 'paid_domainid', 'paid_sum']]'][paidDomainCount/54349/]

[{3}:'paid_day', 'paid_domainid', 'paid_sum'] [{3}:'day', 'domainid', 'sum']

•
[{3}:'paid_day', 'paid_domainid', 'paid_sum'] [{3}:'day', 'domainid', 'sum']
[{3}:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles']
CoGroup('organicDomainCount*paidDomainCount')[by:organicDomainCount:['day', 'domainid']paidDomainCount:['paid_day', 'paid_domainid']]

Assumes Map operations cascading.hbase.HBaseTap@d9a475a0
organicDomainCount[{2}:'day', 'domainid'],paidDomainCount[{2}:'paid_day', 'paid_domainid']
[{6}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum']

Each('organicDomainCount*paidDomainCount')[ExpressionFunction[decl:'domainid_day']]

reduce the data
[{7}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day']
[{7}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day']

Each('organicDomainCount*paidDomainCount')[ExpressionFunction[decl:'fixed_sum']]

[{8}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day', 'fixed_sum']
[{8}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day', 'fixed_sum']

Each('organicDomainCount*paidDomainCount')[ExpressionFunction[decl:'fixed_paid_sum']]

[{9}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day', 'fixed_sum', 'fixed_paid_sum']
[{9}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day', 'fixed_sum', 'fixed_paid_sum']

Each('organicDomainCount*paidDomainCount')[Identity[decl:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles']]

[{3}:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles']

cascading.hbase.HBaseTap@3d07f00


[tail]


Map Side Joins
• Bypasses the (immediate) need for a Reducer
• Symmetrical
• Where LHS and RHS are of equivalent size
• Requires data to be sorted on key
• Asymmetrical
• One side is small enough to ﬁt in memory
• Typically a hashtable lookup

Combiners
Mapper
human events"] Map ["when",1] ["in",1] ["the",1] [...,1]

Combiner
["when",1]
["when",1] Group ["when",{1,1}]

["when",{1,1}] Reduce ["when",2]
Same Implementation

["when",1]
["when",1] Group ["when",{2,1,2}]
["when",2]
Reducer

["when",{2,1,2}] Reduce ["when",5]

• Where Reduce runs Map side, and again Reduce side
• Only works if Reduce is commutative and associative
• Reduces bandwidth by trading CPU for IO
• Serialization/deserialization during local sorting before combining

Partial Aggregates
Mapper
human events"] ["when",1] ["in",1] ["the",1] [...,1]

Map
Partial
Provides an opportunity to
["when",1]
["when",1] ["when",2] promote the functionality of
the next Map to this Reduce

["when",1]
["when",1] Group ["when",{2,1,2}]
["when",2]
Reducer

["when",{2,1,2}] Reduce ["when",5]

• Supports any aggregate type, while being composable with other
aggregates
• Reduces bandwidth by trading Memory for IO
• Very important for a CPU constrained cluster
• Use a bounded LRU to keep constant memory (requires tuning)

Partial Aggregates
[a,b,c,a,a,b]
[a,b,c,a,a,b] partial unique
partial unique [a,b,c,a,b]
[a,b,c,a,b]
[a,b,c,a,a,b]
[a,b,c,a,a,b] partial unique
partial unique [a,b,c,a,b]
[a,b,c,a,b]

LRU*

{_,_}
*cache size of 2
a -> {a,_} -> _

b -> {b,a} -> _
incoming discarded
c -> {c,b} -> a
value value
a -> {a,c} -> b

a -> {a,c}

b -> {b,a} -> c

• OK that dupes emit from a Mapper and across
Mappers (or prev Reducers!)
• Final aggregation happens in Reducer
• Larger the cache, fewer dupes Copyright Concurrent, Inc. 2011. All rights reserved.

Tradeoffs

• CPU for IO == fault tolerance
• Memory for IO == performance


Similarity Join
• Compare all values LHS to values RHS to ﬁnd
duplicates (or similar values)
• Naive approaches
• Cross Join (all data through one reducer)
• In-common features (very common features will
bottleneck)


Set-Similarity Joining

• “Efﬁcient Parallel Set-Similarity Joins Using
MapReduce” - R Vernica, M Carey, C Li

• Only compare candidate pairs
• Candidates share uncommon features

4 1
1

4 2
2

4 3
3

2 4
4

3: order by least frequent
1: records 1 discard common

1

2: count tokens

1 1
1 3

3 5: candidate pairs 3

4: uncommon features 6: ﬁnal compare
in common

• 1 and 3 share uncommon features
• thus are candidates for a full comparison

Tokenize Count Job
Map Reduce Map Reduce

File

File File

Join Tokens/Counts Job
File Map Reduce

File

Sort/Prefix Filter Job
Map Reduce

File

Match two sets Self Join Job
Map Reduce

using prefix File

filtering Unique Pairs Job
Map Reduce

File

Join LHS Job
Map Reduce

File

Join RHS / Match Job
Map Reduce File


Duality

• Note the use of the previous patterns to route
data to implement a more efﬁcient algorithm


Use a Higher
Abstraction
• Command Line
• Multitool - CLI for parallel sed, grep & joins

• API
• Cascading - Java Query API and Planner
• Plume - “approximate clone of FlumeJava”

• Interactive Shell
• Cascalog - Clojure+Cascading query language (API also)
• Pig - A text Syntax
• Hive - Syntax + Infrastructure - SQL “like”

References

• Set Similarity
• http://www.slideshare.net/ydn/4-similarity-joinshadoopsummit2010
• http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/

• MapReduce Text Processing
• http://www.umiacs.umd.edu/~jimmylin/book.html

• Plume/FlumeJava
• http://portal.acm.org/citation.cfm?id=1806596.1806638
• http://github.com/tdunning/Plume/wiki


I’m Hiring

• Enterprise Java server and web client
• Language design, compilers, and interpreters
• No Hadoop experience required
• More info
• http://www.concurrentinc.com/careers/

Resources
• Chris K Wensel
•chris@wensel.net
•@cwensel

• Cascading & Cascalog
•http://cascading.org
•@cascading

• Concurrent, Inc.
•http://concurrentinc.com
•@concurrent

Appendix


Simple Total Sorting

• Where lines in a result ﬁle should be sorted

• Must set number of reducers to 1
• Sorting in MR is local per Reduce, not global across
Reducers


Why Sorting Isn’t
“Total”
[aaa,aab,aac] Mapper

aaa

Mapper aac Reducer [aaa,zzx]

aab

Mapper Reducer [aac,zzz]

zzx

Mapper zzz Reducer [aab,zzy]

zzy

[zzx,zzy,zzz] Mapper

• Keys emitted from Map are naturally sorted at a given Reducer
• But are Partitioned to Reducers in a random way
• Thus, only one Reducer can be used for a total sort

Distributed Total Sort

• To work, the Shufﬂing phase must be modiﬁed
with:
• Custom Partitioner to partition on the
distribution of ordered Keys
• Custom Comparator for comparing Key types
• Strings work by default

Distributed Total Sort -
Details
a ... z

ar ... ax za ... zo

ara ... ari axe ... axi zag ... zap zon ... zoo

aran aria axis zone

• Sample all K2 values and build balanced distribution for num reducers

• Sample all input keys and divide into partitions

• Write out boundaries of partitions

• Supply Partitioner that looks up partition for current K2 value

• Read boundaries into a Trie (pronounced ‘try’) data structure

• Use appropriate Comparator for Key type

Buzz words

More Related Content

What's hot

Viewers also liked

Similar to Buzz words

Recently uploaded

Buzz words

Editor's Notes