More Related Content Similar to Hadoop Design Patterns (20) Hadoop Design Patterns2. Book was made available December 2012
Written by Donald Miner and Adam Shook,
Hadoop Architects at Greenplum.
© Copyright 2013 EMC Corporation. All rights reserved.
2
3. What Are Design Patterns?
(in general)
• Reusable solution frameworks to
problems
• Domain independent
• Not a cookbook, but not a guide
• Not a finished solution
© Copyright 2013 EMC Corporation. All rights reserved.
3
4. Why Design Patterns?
(in general)
Makes the intent of code easier to understand
Provides a common language for solutions
Be able to reuse code
Known performance profiles and limitations of
solutions
© Copyright 2013 EMC Corporation. All rights reserved.
4
5. Why MapReduce Design Patterns?
Recurring patterns in data-related problem solving
Groups are building patterns independently
Lots of new users every day
MapReduce is a new way of thinking
Foundation for higher-level tools (Pig, Hive, …)
Community is reaching the right level of maturity
© Copyright 2013 EMC Corporation. All rights reserved.
5
6. Pattern Template
Each pattern follows a standard template
Intent
Consequences
Motivation
Resemblances
Applicability
Performance analysis
Structure
Examples
© Copyright 2013 EMC Corporation. All rights reserved.
6
8. Filtering Patterns
Extract interesting subsets
Keep only a subset of the data
Filtering
– Removes records of data based on a condition
Bloom filtering
– Removes records of data based on a bloom filter membership test
Top ten
– Returns the top-k records, given a ranking criteria
Distinct
– Remove duplicates from a data set
© Copyright 2013 EMC Corporation. All rights reserved.
8
9. Summarization Patterns
Top-down summaries
Give a top-level view of the data
Numerical summarizations
– Perform numerical calculations on groups of data
Inverted index
– Build a lookup table
Counting with counters
– Count the occurrences of particular things
© Copyright 2013 EMC Corporation. All rights reserved.
9
10. Data organization patterns
Reorganize, restructure
Change the way the data is organized
Structured to hierarchical
– Denormalize data into documents
Partitioning
– Place data into partitions based on a hash key
Binning
– Place each record into zero or more bins
Total order sorting
– Sort the data set in ascending or descending order
Shuffling
– Completely randomize the order of the data
© Copyright 2013 EMC Corporation. All rights reserved.
10
11. Join patterns
Bringing data sets together
Take several data sets and bring them together into one
Reduce-side join
– General purpose join
Replicated join
– Replicates the smaller data set everywhere before the join
Composite join
– Joins if the data sets are sorted and partitioned in the same way
Cartesian product
– Match up every record to every other record
© Copyright 2013 EMC Corporation. All rights reserved.
11
12. Input and output patterns
Custom input and output
Perform custom behavior for input or output
Generating data
– Generate data from nothing
External source output
– Send data to an external source
External source input
– Pull data from an external source
Partition pruning
– Remove chunks of data because we know some parts are not useful
© Copyright 2013 EMC Corporation. All rights reserved.
12
13. Example Pattern: “Top Ten”
(filtering)
Intent
Retrieve a relatively small number of top K records, according
to a ranking scheme in your data set, no matter how large
the data.
Motivation
Finding outliers
Top ten lists are fun
Building dashboards
Sorting/Limit isn’t going to work here
© Copyright 2013 EMC Corporation. All rights reserved.
13
14. Example Pattern: “Top Ten”
Applicability
Rank-able records
Limited number of output records
Consequences
The top K records are returned.
© Copyright 2013 EMC Corporation. All rights reserved.
14
15. Example Pattern: “Top Ten”
Structure
class m apper :
setu p():
i nitia lize t op ten sorte d li st
map( key, reco rd ):
i nsert rec or d into top t en s or ted lis t
i f len gth of array is gr eate r- than 10 :
tru ncat e list to a le ngth o f 10
clea nup() :
f or re cord i n top s orted ten l ist:
e mit n ull, re cord
class r educe r:
setu p():
i nitia lize t op ten sorte d li st
redu ce(ke y, r ec ords):
s ort r ecor ds
t runca te r ec ords to top 10
f or re cord i n recor ds:
emi t re co rd
© Copyright 2013 EMC Corporation. All rights reserved.
15
16. Example Pattern: “Top Ten”
Resemblances
SQL:
SELECT * FROM table ORDER BY col4 DESC LIMIT 10;
Pig:
B = ORDER A BY col4 DESC;
C = LIMIT B 10;
© Copyright 2013 EMC Corporation. All rights reserved.
16
17. Example Pattern: “Top Ten”
Performance analysis
Pretty quick: map-heavy, low network usage
Pay attention to how many records the reducer is getting
[number of input splits] x K
Example
Top ten StackOverflow users by reputation
© Copyright 2013 EMC Corporation. All rights reserved.
17
18. Pivotal Sessions at EMC World
Session
Presenter
Dates/Times
The Pivotal Platform: A Purpose-Built Platform for Big-DataDriven Applications
Josh Klahr
Tue 5:30 - 6:30, Palazzo E Wed
11:30 - 12:30, Delfino 4005
Pivotal: Data Scientists on the Front Line: Examples of
Data Science in Action
Noelle Sio
Tue 10:00 - 11:00, Lando 4205
Thu 8:30 - 9:30, Palazzo F
Pivotal: Operationalizing 1000-node Hadoop Cluster –
Analytics Workbench
Clinton Ooi
Bhavin Modi
Tue 11:30 - 12:30, Palazzo L Thu
10:00- 11:00 am, Delfino 4001A
Pivotal: for Powerful Processing of Unstructured Data For
Valuable Insights
SK
Krishnamurthy
Mon 4:00 - 5:00, Lando 4201 A
Tue 4:00 - 5:00, Palazzo M
Pivotal: Big & Fast data – merging real-time data and deep
analytics
Michael
Crutcher
Mon 1:00 - 2:00, Lando 4201 A
Wed 10:00 - 11:00, Palazzo M
Pivotal: Virtualize Big Data to Make The Elephant Dance
June Yang
Dan Baskette
Mon 11:30 - 12:30, Marcello
4401A Wed 4:00 - 5:00, Palazzo
E
Hadoop Design Patterns
Don Miner
Mon 2:30 - 3:30, Palazzo F Wed
8:30 - 9:30, Delfino 4005
© Copyright 2013 EMC Corporation. All rights reserved.
18