Hadoop Design
Patterns
Donald Miner
@donaldpminer
Donald.Miner@emc.com

© Copyright 2013 EMC Corporation. All rights reser...
Book was made available December 2012
Written by Donald Miner and Adam Shook,
Hadoop Architects at Greenplum.

© Copyright...
What Are Design Patterns?
(in general)

• Reusable solution frameworks to
problems
• Domain independent
• Not a cookbook, ...
Why Design Patterns?
(in general)

 Makes the intent of code easier to understand
 Provides a common language for soluti...
Why MapReduce Design Patterns?
 Recurring patterns in data-related problem solving
 Groups are building patterns indepen...
Pattern Template
Each pattern follows a standard template
Intent

Consequences

Motivation

Resemblances

Applicability

P...
Pattern Categories
Summarization
Filtering
Data Organization
Joins
Metapatterns
Input and output

© Copyright 2013 EMC Cor...
Filtering Patterns
Extract interesting subsets
Keep only a subset of the data
 Filtering
– Removes records of data based ...
Summarization Patterns
Top-down summaries
Give a top-level view of the data
 Numerical summarizations
– Perform numerical...
Data organization patterns
Reorganize, restructure
Change the way the data is organized
 Structured to hierarchical
– Den...
Join patterns
Bringing data sets together
Take several data sets and bring them together into one
 Reduce-side join
– Gen...
Input and output patterns
Custom input and output
Perform custom behavior for input or output
 Generating data
– Generate...
Example Pattern: “Top Ten”
(filtering)

Intent

Retrieve a relatively small number of top K records, according
to a rankin...
Example Pattern: “Top Ten”
Applicability

Rank-able records
Limited number of output records

Consequences

The top K reco...
Example Pattern: “Top Ten”
Structure
class m apper :
setu p():
i nitia lize t op ten sorte d li st
map( key, reco rd ):
i ...
Example Pattern: “Top Ten”
Resemblances
SQL:
SELECT * FROM table ORDER BY col4 DESC LIMIT 10;

Pig:
B = ORDER A BY col4 DE...
Example Pattern: “Top Ten”
Performance analysis

Pretty quick: map-heavy, low network usage
Pay attention to how many reco...
Pivotal Sessions at EMC World
Session

Presenter

Dates/Times

The Pivotal Platform: A Purpose-Built Platform for Big-Data...
Hadoop Design Patterns
Upcoming SlideShare
Loading in...5
×

Hadoop Design Patterns

941

Published on

For users of Hadoop, MapReduce is a new territory. MapReduce design patterns are all about documenting the knowledge and lessons learned of the seasoned Hadoop developer so that new developers can leverage the experts’ experience in solving problems. This talk outlines a few of the most popular patterns and give an verview of the rest.


Objective 1: Understand what kinds of problems are solvable by Hadoop and MapReduce.
After this session you will be able to:
Objective 2: Understand why Hadoop engineers need to know what MapReduce Design Patterns are and what they are useful for day-to-day.
Objective 3: Begin to understand how to summarize, reorganize, and search through your data with Hadoop and MapReduce

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
941
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
41
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hadoop Design Patterns

  1. 1. Hadoop Design Patterns Donald Miner @donaldpminer Donald.Miner@emc.com © Copyright 2013 EMC Corporation. All rights reserved. 1
  2. 2. Book was made available December 2012 Written by Donald Miner and Adam Shook, Hadoop Architects at Greenplum. © Copyright 2013 EMC Corporation. All rights reserved. 2
  3. 3. What Are Design Patterns? (in general) • Reusable solution frameworks to problems • Domain independent • Not a cookbook, but not a guide • Not a finished solution © Copyright 2013 EMC Corporation. All rights reserved. 3
  4. 4. Why Design Patterns? (in general)  Makes the intent of code easier to understand  Provides a common language for solutions  Be able to reuse code  Known performance profiles and limitations of solutions © Copyright 2013 EMC Corporation. All rights reserved. 4
  5. 5. Why MapReduce Design Patterns?  Recurring patterns in data-related problem solving  Groups are building patterns independently  Lots of new users every day  MapReduce is a new way of thinking  Foundation for higher-level tools (Pig, Hive, …)  Community is reaching the right level of maturity © Copyright 2013 EMC Corporation. All rights reserved. 5
  6. 6. Pattern Template Each pattern follows a standard template Intent Consequences Motivation Resemblances Applicability Performance analysis Structure Examples © Copyright 2013 EMC Corporation. All rights reserved. 6
  7. 7. Pattern Categories Summarization Filtering Data Organization Joins Metapatterns Input and output © Copyright 2013 EMC Corporation. All rights reserved. 7
  8. 8. Filtering Patterns Extract interesting subsets Keep only a subset of the data  Filtering – Removes records of data based on a condition  Bloom filtering – Removes records of data based on a bloom filter membership test  Top ten – Returns the top-k records, given a ranking criteria  Distinct – Remove duplicates from a data set © Copyright 2013 EMC Corporation. All rights reserved. 8
  9. 9. Summarization Patterns Top-down summaries Give a top-level view of the data  Numerical summarizations – Perform numerical calculations on groups of data  Inverted index – Build a lookup table  Counting with counters – Count the occurrences of particular things © Copyright 2013 EMC Corporation. All rights reserved. 9
  10. 10. Data organization patterns Reorganize, restructure Change the way the data is organized  Structured to hierarchical – Denormalize data into documents  Partitioning – Place data into partitions based on a hash key  Binning – Place each record into zero or more bins  Total order sorting – Sort the data set in ascending or descending order  Shuffling – Completely randomize the order of the data © Copyright 2013 EMC Corporation. All rights reserved. 10
  11. 11. Join patterns Bringing data sets together Take several data sets and bring them together into one  Reduce-side join – General purpose join  Replicated join – Replicates the smaller data set everywhere before the join  Composite join – Joins if the data sets are sorted and partitioned in the same way  Cartesian product – Match up every record to every other record © Copyright 2013 EMC Corporation. All rights reserved. 11
  12. 12. Input and output patterns Custom input and output Perform custom behavior for input or output  Generating data – Generate data from nothing  External source output – Send data to an external source  External source input – Pull data from an external source  Partition pruning – Remove chunks of data because we know some parts are not useful © Copyright 2013 EMC Corporation. All rights reserved. 12
  13. 13. Example Pattern: “Top Ten” (filtering) Intent Retrieve a relatively small number of top K records, according to a ranking scheme in your data set, no matter how large the data. Motivation Finding outliers Top ten lists are fun Building dashboards Sorting/Limit isn’t going to work here © Copyright 2013 EMC Corporation. All rights reserved. 13
  14. 14. Example Pattern: “Top Ten” Applicability Rank-able records Limited number of output records Consequences The top K records are returned. © Copyright 2013 EMC Corporation. All rights reserved. 14
  15. 15. Example Pattern: “Top Ten” Structure class m apper : setu p(): i nitia lize t op ten sorte d li st map( key, reco rd ): i nsert rec or d into top t en s or ted lis t i f len gth of array is gr eate r- than 10 : tru ncat e list to a le ngth o f 10 clea nup() : f or re cord i n top s orted ten l ist: e mit n ull, re cord class r educe r: setu p(): i nitia lize t op ten sorte d li st redu ce(ke y, r ec ords): s ort r ecor ds t runca te r ec ords to top 10 f or re cord i n recor ds: emi t re co rd © Copyright 2013 EMC Corporation. All rights reserved. 15
  16. 16. Example Pattern: “Top Ten” Resemblances SQL: SELECT * FROM table ORDER BY col4 DESC LIMIT 10; Pig: B = ORDER A BY col4 DESC; C = LIMIT B 10; © Copyright 2013 EMC Corporation. All rights reserved. 16
  17. 17. Example Pattern: “Top Ten” Performance analysis Pretty quick: map-heavy, low network usage Pay attention to how many records the reducer is getting [number of input splits] x K Example Top ten StackOverflow users by reputation © Copyright 2013 EMC Corporation. All rights reserved. 17
  18. 18. Pivotal Sessions at EMC World Session Presenter Dates/Times The Pivotal Platform: A Purpose-Built Platform for Big-DataDriven Applications Josh Klahr Tue 5:30 - 6:30, Palazzo E Wed 11:30 - 12:30, Delfino 4005 Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action Noelle Sio Tue 10:00 - 11:00, Lando 4205 Thu 8:30 - 9:30, Palazzo F Pivotal: Operationalizing 1000-node Hadoop Cluster – Analytics Workbench Clinton Ooi Bhavin Modi Tue 11:30 - 12:30, Palazzo L Thu 10:00- 11:00 am, Delfino 4001A Pivotal: for Powerful Processing of Unstructured Data For Valuable Insights SK Krishnamurthy Mon 4:00 - 5:00, Lando 4201 A Tue 4:00 - 5:00, Palazzo M Pivotal: Big & Fast data – merging real-time data and deep analytics Michael Crutcher Mon 1:00 - 2:00, Lando 4201 A Wed 10:00 - 11:00, Palazzo M Pivotal: Virtualize Big Data to Make The Elephant Dance June Yang Dan Baskette Mon 11:30 - 12:30, Marcello 4401A Wed 4:00 - 5:00, Palazzo E Hadoop Design Patterns Don Miner Mon 2:30 - 3:30, Palazzo F Wed 8:30 - 9:30, Delfino 4005 © Copyright 2013 EMC Corporation. All rights reserved. 18
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×