0
1© Copyright 2012 EMC Corporation. All rights reserved.
MapReduce
Design Patterns
Donald Miner
Greenplum Hadoop Solutions ...
2© Copyright 2012 EMC Corporation. All rights reserved.
Book was made available December 2012
3© Copyright 2012 EMC Corporation. All rights reserved.
Inspiration for my book
4© Copyright 2012 EMC Corporation. All rights reserved.
What are design patterns?
(in general)
Reusable solutions to probl...
5© Copyright 2012 EMC Corporation. All rights reserved.
Why design patterns?
(in general)
Makes the intent of code easier ...
6© Copyright 2012 EMC Corporation. All rights reserved.
Why MapReduce design patterns?
Recurring patterns in data-related ...
7© Copyright 2012 EMC Corporation. All rights reserved.
Pattern Template
Intent
Motivation
Applicability
Structure
Consequ...
8© Copyright 2012 EMC Corporation. All rights reserved.
Pattern Categories
Summarization
Filtering
Data Organization
Joins...
9© Copyright 2012 EMC Corporation. All rights reserved.
Filtering patterns
Extract interesting subsets
Filtering
Bloom fil...
10© Copyright 2012 EMC Corporation. All rights reserved.
Data organization patterns
Reorganize, restructure
Structured to ...
11© Copyright 2012 EMC Corporation. All rights reserved.
Metapatterns
Patterns of patterns
Job chaining
Chain folding
Job ...
12© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
(filtering)
Intent
Retrieve a relatively small...
13© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Applicability
Rank-able records
Limited number...
14© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Structure
class mapper:
setup():
initialize to...
15© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Resemblances
SQL:
SELECT * FROM table ORDER BY...
16© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Performance analysis
Pretty quick: map-heavy, ...
17© Copyright 2012 EMC Corporation. All rights reserved.
public static class TopTenMapper extends Mapper<Object, Text, Nul...
18© Copyright 2012 EMC Corporation. All rights reserved.
public static class TopTenReducer extends Reducer<NullWritable, T...
19© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
(filtering)
Intent
Keep records that a...
20© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
Applicability
A feature can be extract...
21© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
Structure
class mapper:
setup():
load ...
22© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
Performance analysis
Map-only
Slight o...
23© Copyright 2012 EMC Corporation. All rights reserved.
Candidate new patterns
Link Graph processing patterns (new catego...
24© Copyright 2012 EMC Corporation. All rights reserved.
Future and call to action
Contributing your own patterns
Trends i...
MapReduce Design Patterns
Upcoming SlideShare
Loading in...5
×

MapReduce Design Patterns

7,920

Published on

This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.

Published in: Technology, Business
1 Comment
13 Likes
Statistics
Notes
  • http://dbmanagement.info/Tutorials/MapReduce.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
7,920
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
273
Comments
1
Likes
13
Embeds 0
No embeds

No notes for slide
  • Quick overview of bookThis talk is not to sell you on the book, its to sell you on why MRDPs are important for the communityIntermediate to advanced MapReduce resourceEarly beginners and experts alike can find some use in itSome knowledge of Hadoop is encouragedTom White’s Hadoop: The Definitive Guide is a good start
  • Story about explaining joinsTeaching hadoop classes and explaining how to solve problems was challengingMost of the stuff in this book is not novel– it’s been collected through different sources
  • Spend time talking about each and what purpose they haveRemember the mention that examples are in Hadoop
  • Just briefly outline
  • Transcript of "MapReduce Design Patterns"

    1. 1. 1© Copyright 2012 EMC Corporation. All rights reserved. MapReduce Design Patterns Donald Miner Greenplum Hadoop Solutions Architect @donaldpminer
    2. 2. 2© Copyright 2012 EMC Corporation. All rights reserved. Book was made available December 2012
    3. 3. 3© Copyright 2012 EMC Corporation. All rights reserved. Inspiration for my book
    4. 4. 4© Copyright 2012 EMC Corporation. All rights reserved. What are design patterns? (in general) Reusable solutions to problems Domain independent Not a cookbook, but not a guide Not a finished solution
    5. 5. 5© Copyright 2012 EMC Corporation. All rights reserved. Why design patterns? (in general) Makes the intent of code easier to understand Provides a common language for solutions Be able to reuse code Known performance profiles and limitations of solutions
    6. 6. 6© Copyright 2012 EMC Corporation. All rights reserved. Why MapReduce design patterns? Recurring patterns in data-related problem solving Groups are building patterns independently Lots of new users every day MapReduce is a new way of thinking Foundation for higher-level tools (Pig, Hive, …) Community is reaching the right level of maturity
    7. 7. 7© Copyright 2012 EMC Corporation. All rights reserved. Pattern Template Intent Motivation Applicability Structure Consequences Resemblances Performance analysis Examples
    8. 8. 8© Copyright 2012 EMC Corporation. All rights reserved. Pattern Categories Summarization Filtering Data Organization Joins Metapatterns Input and output
    9. 9. 9© Copyright 2012 EMC Corporation. All rights reserved. Filtering patterns Extract interesting subsets Filtering Bloom filtering Top ten Distinct Summarization patterns top-down summaries Numerical summarizations Inverted index Counting with counters I only want some of my data! I only want a top-level view of my data!
    10. 10. 10© Copyright 2012 EMC Corporation. All rights reserved. Data organization patterns Reorganize, restructure Structured to hierarchical Partitioning Binning Total order sorting Shuffling Join patterns Bringing data sets together Reduce-side join Replicated join Composite join Cartesian product I want to change the way my data is organized! I want to mash my different data sources together!
    11. 11. 11© Copyright 2012 EMC Corporation. All rights reserved. Metapatterns Patterns of patterns Job chaining Chain folding Job merging Input and output patterns Custom input and output Generating data External source output External source input Partition pruning I want to solve a complex problem with multiple patterns! I want to get data or put data in an unusual place!
    12. 12. 12© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” (filtering) Intent Retrieve a relatively small number of top K records, according to a ranking scheme in your data set, no matter how large the data. Motivation Finding outliers Top ten lists are fun Building dashboards Sorting/Limit isn’t going to work here
    13. 13. 13© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” Applicability Rank-able records Limited number of output records Consequences The top K records are returned.
    14. 14. 14© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” Structure class mapper: setup(): initialize top ten sorted list map(key, record): insert record into top ten sorted list if length of array is greater-than 10: truncate list to a length of 10 cleanup(): for record in top sorted ten list: emit null,record class reducer: setup(): initialize top ten sorted list reduce(key, records): sort records truncate records to top 10 for record in records: emit record
    15. 15. 15© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” Resemblances SQL: SELECT * FROM table ORDER BY col4 DESC LIMIT 10; Pig: B = ORDER A BY col4 DESC; C = LIMIT B 10;
    16. 16. 16© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” Performance analysis Pretty quick: map-heavy, low network usage Pay attention to how many records the reducer is getting [number of input splits] x K Example Top ten StackOverflow users by reputation
    17. 17. 17© Copyright 2012 EMC Corporation. All rights reserved. public static class TopTenMapper extends Mapper<Object, Text, NullWritable, Text> { private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>(); public void map(Object key, Text value, Context context) { Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString()); String userId = parsed.get("Id"); String reputation = parsed.get("Reputation"); repToRecordMap.put(Integer.parseInt(reputation), new Text(value)); if (repToRecordMap.size() > 10) { repToRecordMap.remove(repToRecordMap.firstKey()); } } protected void cleanup(Context context) { for (Text t : repToRecordMap.values()) { context.write(NullWritable.get(), t); } } } Top Ten Mapper
    18. 18. 18© Copyright 2012 EMC Corporation. All rights reserved. public static class TopTenReducer extends Reducer<NullWritable, Text, NullWritable, Text> { private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>(); public void reduce(NullWritable key, Iterable<Text> values, Context context) { for (Text value : values) { Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString()); repToRecordMap.put(Integer.parseInt(parsed.get("Reputation")), new Text(value)); if (repToRecordMap.size() > 10) { repToRecordMap.remove(repToRecordMap.firstKey()); } } for (Text t : repToRecordMap.descendingMap().values()) { context.write(NullWritable.get(), t); } } } Top Ten Reducer
    19. 19. 19© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Bloom Filtering” (filtering) Intent Keep records that are a member of some predefined set of values. It is not a problem if the output is a bit inaccurate. Motivation Similar to normal Boolean filtering, but we are filtering on set membership Set membership is evaluated with a Bloom filter
    20. 20. 20© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Bloom Filtering” Applicability A feature can be extracted and tested for set membership Predetermined set is available Some false positives are acceptable Consequences Records that pass the Bloom filter membership test are returned Known Uses Keep all records in a watch list (and a few records that aren’t) Pre-filtering records before an expensive membership test
    21. 21. 21© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Bloom Filtering” Structure class mapper: setup(): load bloom filter into memory map(key, record): if record in bloom filter: emit (record, null) Resemblances UDFs?
    22. 22. 22© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Bloom Filtering” Performance analysis Map-only Slight overhead in moving Bloom filter into memory Bloom filter membership tests are constant time Example Filter StackOverflow comments that do not contain a keyword Distributed HBase query using a Bloom filter
    23. 23. 23© Copyright 2012 EMC Corporation. All rights reserved. Candidate new patterns Link Graph processing patterns (new category) – Shortest past, diameter, graph stats, connected components, etc. – Too domain specific? – Has its own distinct patterns Projection (filtering) – Remove “columns” of data Transformation (data organization?) – Take a data set but transform it into something else
    24. 24. 24© Copyright 2012 EMC Corporation. All rights reserved. Future and call to action Contributing your own patterns Trends in the nature of data – Images, audio, video, biomedical, social … Libraries, abstractions, and tools Ecosystem patterns: YARN, HBase, ZooKeeper, …
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×