Hadoop Design Patterns

Hadoop Design
Patterns
Donald Miner
@donaldpminer
Donald.Miner@emc.com

© Copyright 2013 EMC Corporation. All rights reserved.

1

Book was made available December 2012
Written by Donald Miner and Adam Shook,
Hadoop Architects at Greenplum.


2

What Are Design Patterns?
(in general)

• Reusable solution frameworks to
problems
• Domain independent
• Not a cookbook, but not a guide
• Not a finished solution


3

Why Design Patterns?
(in general)

 Makes the intent of code easier to understand
 Provides a common language for solutions
 Be able to reuse code
 Known performance profiles and limitations of
solutions


4

Why MapReduce Design Patterns?
 Recurring patterns in data-related problem solving
 Groups are building patterns independently
 Lots of new users every day
 MapReduce is a new way of thinking
 Foundation for higher-level tools (Pig, Hive, …)
 Community is reaching the right level of maturity

5

Pattern Template
Each pattern follows a standard template
Intent

Consequences

Motivation

Resemblances

Applicability

Performance analysis

Structure

Examples


6

Pattern Categories
Summarization
Filtering
Data Organization
Joins
Metapatterns
Input and output


7

Filtering Patterns
Extract interesting subsets
Keep only a subset of the data
 Filtering
– Removes records of data based on a condition

 Bloom filtering
– Removes records of data based on a bloom filter membership test

 Top ten
– Returns the top-k records, given a ranking criteria

 Distinct
– Remove duplicates from a data set

8

Summarization Patterns
Top-down summaries
Give a top-level view of the data
 Numerical summarizations
– Perform numerical calculations on groups of data

 Inverted index
– Build a lookup table

 Counting with counters
– Count the occurrences of particular things


9

Data organization patterns
Reorganize, restructure
Change the way the data is organized
 Structured to hierarchical
– Denormalize data into documents
 Partitioning
– Place data into partitions based on a hash key
 Binning
– Place each record into zero or more bins
 Total order sorting
– Sort the data set in ascending or descending order
 Shuffling
– Completely randomize the order of the data


10

Join patterns
Bringing data sets together
Take several data sets and bring them together into one
 Reduce-side join
– General purpose join

 Replicated join
– Replicates the smaller data set everywhere before the join

 Composite join
– Joins if the data sets are sorted and partitioned in the same way

 Cartesian product
– Match up every record to every other record

11

Input and output patterns
Custom input and output
Perform custom behavior for input or output
 Generating data
– Generate data from nothing

 External source output
– Send data to an external source

 External source input
– Pull data from an external source

 Partition pruning
– Remove chunks of data because we know some parts are not useful

12

Example Pattern: “Top Ten”
(filtering)

Intent

Retrieve a relatively small number of top K records, according
to a ranking scheme in your data set, no matter how large
the data.

Motivation

Finding outliers
Top ten lists are fun
Building dashboards
Sorting/Limit isn’t going to work here


13

Applicability

Rank-able records
Limited number of output records

Consequences

The top K records are returned.


14

Structure
class m apper :
setu p():
i nitia lize t op ten sorte d li st
map( key, reco rd ):
i nsert rec or d into top t en s or ted lis t
i f len gth of array is gr eate r- than 10 :
tru ncat e list to a le ngth o f 10
clea nup() :
f or re cord i n top s orted ten l ist:
e mit n ull, re cord
class r educe r:
setu p():
i nitia lize t op ten sorte d li st
redu ce(ke y, r ec ords):
s ort r ecor ds
t runca te r ec ords to top 10
f or re cord i n recor ds:
emi t re co rd


15

Resemblances
SQL:
SELECT * FROM table ORDER BY col4 DESC LIMIT 10;

Pig:
B = ORDER A BY col4 DESC;
C = LIMIT B 10;


16

Performance analysis

Pretty quick: map-heavy, low network usage
Pay attention to how many records the reducer is getting
[number of input splits] x K

Example

Top ten StackOverflow users by reputation


17

Pivotal Sessions at EMC World
Session

Presenter

Dates/Times

The Pivotal Platform: A Purpose-Built Platform for Big-DataDriven Applications

Josh Klahr

Tue 5:30 - 6:30, Palazzo E Wed
11:30 - 12:30, Delfino 4005

Pivotal: Data Scientists on the Front Line: Examples of
Data Science in Action

Noelle Sio

Tue 10:00 - 11:00, Lando 4205
Thu 8:30 - 9:30, Palazzo F

Pivotal: Operationalizing 1000-node Hadoop Cluster –
Analytics Workbench

Clinton Ooi
Bhavin Modi

Tue 11:30 - 12:30, Palazzo L Thu
10:00- 11:00 am, Delfino 4001A

Pivotal: for Powerful Processing of Unstructured Data For
Valuable Insights

SK
Krishnamurthy

Mon 4:00 - 5:00, Lando 4201 A
Tue 4:00 - 5:00, Palazzo M

Pivotal: Big & Fast data – merging real-time data and deep
analytics

Michael
Crutcher

Mon 1:00 - 2:00, Lando 4201 A
Wed 10:00 - 11:00, Palazzo M

Pivotal: Virtualize Big Data to Make The Elephant Dance

June Yang
Dan Baskette

Mon 11:30 - 12:30, Marcello
4401A Wed 4:00 - 5:00, Palazzo
E

Hadoop Design Patterns

Don Miner

Mon 2:30 - 3:30, Palazzo F Wed
8:30 - 9:30, Delfino 4005


18

Hadoop Design Patterns

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop Design Patterns

Similar to Hadoop Design Patterns (20)

More from EMC

More from EMC (20)

Recently uploaded

Recently uploaded (20)

Hadoop Design Patterns