Hadoop Design Patterns
 

Hadoop Design Patterns

on

  • 997 views

For users of Hadoop, MapReduce is a new territory. MapReduce design patterns are all about documenting the knowledge and lessons learned of the seasoned Hadoop developer so that new developers can ...

For users of Hadoop, MapReduce is a new territory. MapReduce design patterns are all about documenting the knowledge and lessons learned of the seasoned Hadoop developer so that new developers can leverage the experts’ experience in solving problems. This talk outlines a few of the most popular patterns and give an verview of the rest.


Objective 1: Understand what kinds of problems are solvable by Hadoop and MapReduce.
After this session you will be able to:
Objective 2: Understand why Hadoop engineers need to know what MapReduce Design Patterns are and what they are useful for day-to-day.
Objective 3: Begin to understand how to summarize, reorganize, and search through your data with Hadoop and MapReduce

Statistics

Views

Total Views
997
Views on SlideShare
995
Embed Views
2

Actions

Likes
1
Downloads
28
Comments
0

1 Embed 2

http://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop Design Patterns Hadoop Design Patterns Presentation Transcript

  • Hadoop Design Patterns Donald Miner @donaldpminer Donald.Miner@emc.com © Copyright 2013 EMC Corporation. All rights reserved. 1
  • Book was made available December 2012 Written by Donald Miner and Adam Shook, Hadoop Architects at Greenplum. © Copyright 2013 EMC Corporation. All rights reserved. 2
  • What Are Design Patterns? (in general) • Reusable solution frameworks to problems • Domain independent • Not a cookbook, but not a guide • Not a finished solution © Copyright 2013 EMC Corporation. All rights reserved. 3
  • Why Design Patterns? (in general)  Makes the intent of code easier to understand  Provides a common language for solutions  Be able to reuse code  Known performance profiles and limitations of solutions © Copyright 2013 EMC Corporation. All rights reserved. 4
  • Why MapReduce Design Patterns?  Recurring patterns in data-related problem solving  Groups are building patterns independently  Lots of new users every day  MapReduce is a new way of thinking  Foundation for higher-level tools (Pig, Hive, …)  Community is reaching the right level of maturity © Copyright 2013 EMC Corporation. All rights reserved. 5
  • Pattern Template Each pattern follows a standard template Intent Consequences Motivation Resemblances Applicability Performance analysis Structure Examples © Copyright 2013 EMC Corporation. All rights reserved. 6
  • Pattern Categories Summarization Filtering Data Organization Joins Metapatterns Input and output © Copyright 2013 EMC Corporation. All rights reserved. 7
  • Filtering Patterns Extract interesting subsets Keep only a subset of the data  Filtering – Removes records of data based on a condition  Bloom filtering – Removes records of data based on a bloom filter membership test  Top ten – Returns the top-k records, given a ranking criteria  Distinct – Remove duplicates from a data set © Copyright 2013 EMC Corporation. All rights reserved. 8
  • Summarization Patterns Top-down summaries Give a top-level view of the data  Numerical summarizations – Perform numerical calculations on groups of data  Inverted index – Build a lookup table  Counting with counters – Count the occurrences of particular things © Copyright 2013 EMC Corporation. All rights reserved. 9
  • Data organization patterns Reorganize, restructure Change the way the data is organized  Structured to hierarchical – Denormalize data into documents  Partitioning – Place data into partitions based on a hash key  Binning – Place each record into zero or more bins  Total order sorting – Sort the data set in ascending or descending order  Shuffling – Completely randomize the order of the data © Copyright 2013 EMC Corporation. All rights reserved. 10
  • Join patterns Bringing data sets together Take several data sets and bring them together into one  Reduce-side join – General purpose join  Replicated join – Replicates the smaller data set everywhere before the join  Composite join – Joins if the data sets are sorted and partitioned in the same way  Cartesian product – Match up every record to every other record © Copyright 2013 EMC Corporation. All rights reserved. 11
  • Input and output patterns Custom input and output Perform custom behavior for input or output  Generating data – Generate data from nothing  External source output – Send data to an external source  External source input – Pull data from an external source  Partition pruning – Remove chunks of data because we know some parts are not useful © Copyright 2013 EMC Corporation. All rights reserved. 12
  • Example Pattern: “Top Ten” (filtering) Intent Retrieve a relatively small number of top K records, according to a ranking scheme in your data set, no matter how large the data. Motivation Finding outliers Top ten lists are fun Building dashboards Sorting/Limit isn’t going to work here © Copyright 2013 EMC Corporation. All rights reserved. 13
  • Example Pattern: “Top Ten” Applicability Rank-able records Limited number of output records Consequences The top K records are returned. © Copyright 2013 EMC Corporation. All rights reserved. 14
  • Example Pattern: “Top Ten” Structure class m apper : setu p(): i nitia lize t op ten sorte d li st map( key, reco rd ): i nsert rec or d into top t en s or ted lis t i f len gth of array is gr eate r- than 10 : tru ncat e list to a le ngth o f 10 clea nup() : f or re cord i n top s orted ten l ist: e mit n ull, re cord class r educe r: setu p(): i nitia lize t op ten sorte d li st redu ce(ke y, r ec ords): s ort r ecor ds t runca te r ec ords to top 10 f or re cord i n recor ds: emi t re co rd © Copyright 2013 EMC Corporation. All rights reserved. 15
  • Example Pattern: “Top Ten” Resemblances SQL: SELECT * FROM table ORDER BY col4 DESC LIMIT 10; Pig: B = ORDER A BY col4 DESC; C = LIMIT B 10; © Copyright 2013 EMC Corporation. All rights reserved. 16
  • Example Pattern: “Top Ten” Performance analysis Pretty quick: map-heavy, low network usage Pay attention to how many records the reducer is getting [number of input splits] x K Example Top ten StackOverflow users by reputation © Copyright 2013 EMC Corporation. All rights reserved. 17
  • Pivotal Sessions at EMC World Session Presenter Dates/Times The Pivotal Platform: A Purpose-Built Platform for Big-DataDriven Applications Josh Klahr Tue 5:30 - 6:30, Palazzo E Wed 11:30 - 12:30, Delfino 4005 Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action Noelle Sio Tue 10:00 - 11:00, Lando 4205 Thu 8:30 - 9:30, Palazzo F Pivotal: Operationalizing 1000-node Hadoop Cluster – Analytics Workbench Clinton Ooi Bhavin Modi Tue 11:30 - 12:30, Palazzo L Thu 10:00- 11:00 am, Delfino 4001A Pivotal: for Powerful Processing of Unstructured Data For Valuable Insights SK Krishnamurthy Mon 4:00 - 5:00, Lando 4201 A Tue 4:00 - 5:00, Palazzo M Pivotal: Big & Fast data – merging real-time data and deep analytics Michael Crutcher Mon 1:00 - 2:00, Lando 4201 A Wed 10:00 - 11:00, Palazzo M Pivotal: Virtualize Big Data to Make The Elephant Dance June Yang Dan Baskette Mon 11:30 - 12:30, Marcello 4401A Wed 4:00 - 5:00, Palazzo E Hadoop Design Patterns Don Miner Mon 2:30 - 3:30, Palazzo F Wed 8:30 - 9:30, Delfino 4005 © Copyright 2013 EMC Corporation. All rights reserved. 18