MapReduce Design Patterns
Upcoming SlideShare
Loading in...5
×
 

MapReduce Design Patterns

on

  • 3,136 views

This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.

This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.

Statistics

Views

Total Views
3,136
Views on SlideShare
3,133
Embed Views
3

Actions

Likes
5
Downloads
107
Comments
0

1 Embed 3

https://twitter.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Quick overview of bookThis talk is not to sell you on the book, its to sell you on why MRDPs are important for the communityIntermediate to advanced MapReduce resourceEarly beginners and experts alike can find some use in itSome knowledge of Hadoop is encouragedTom White’s Hadoop: The Definitive Guide is a good start
  • Story about explaining joinsTeaching hadoop classes and explaining how to solve problems was challengingMost of the stuff in this book is not novel– it’s been collected through different sources
  • Spend time talking about each and what purpose they haveRemember the mention that examples are in Hadoop
  • Just briefly outline

MapReduce Design Patterns MapReduce Design Patterns Presentation Transcript

  • 1© Copyright 2012 EMC Corporation. All rights reserved. MapReduce Design Patterns Donald Miner Greenplum Hadoop Solutions Architect @donaldpminer
  • 2© Copyright 2012 EMC Corporation. All rights reserved. Book was made available December 2012
  • 3© Copyright 2012 EMC Corporation. All rights reserved. Inspiration for my book
  • 4© Copyright 2012 EMC Corporation. All rights reserved. What are design patterns? (in general) Reusable solutions to problems Domain independent Not a cookbook, but not a guide Not a finished solution
  • 5© Copyright 2012 EMC Corporation. All rights reserved. Why design patterns? (in general) Makes the intent of code easier to understand Provides a common language for solutions Be able to reuse code Known performance profiles and limitations of solutions
  • 6© Copyright 2012 EMC Corporation. All rights reserved. Why MapReduce design patterns? Recurring patterns in data-related problem solving Groups are building patterns independently Lots of new users every day MapReduce is a new way of thinking Foundation for higher-level tools (Pig, Hive, …) Community is reaching the right level of maturity
  • 7© Copyright 2012 EMC Corporation. All rights reserved. Pattern Template Intent Motivation Applicability Structure Consequences Resemblances Performance analysis Examples
  • 8© Copyright 2012 EMC Corporation. All rights reserved. Pattern Categories Summarization Filtering Data Organization Joins Metapatterns Input and output
  • 9© Copyright 2012 EMC Corporation. All rights reserved. Filtering patterns Extract interesting subsets Filtering Bloom filtering Top ten Distinct Summarization patterns top-down summaries Numerical summarizations Inverted index Counting with counters I only want some of my data! I only want a top-level view of my data!
  • 10© Copyright 2012 EMC Corporation. All rights reserved. Data organization patterns Reorganize, restructure Structured to hierarchical Partitioning Binning Total order sorting Shuffling Join patterns Bringing data sets together Reduce-side join Replicated join Composite join Cartesian product I want to change the way my data is organized! I want to mash my different data sources together!
  • 11© Copyright 2012 EMC Corporation. All rights reserved. Metapatterns Patterns of patterns Job chaining Chain folding Job merging Input and output patterns Custom input and output Generating data External source output External source input Partition pruning I want to solve a complex problem with multiple patterns! I want to get data or put data in an unusual place!
  • 12© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” (filtering) Intent Retrieve a relatively small number of top K records, according to a ranking scheme in your data set, no matter how large the data. Motivation Finding outliers Top ten lists are fun Building dashboards Sorting/Limit isn’t going to work here
  • 13© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” Applicability Rank-able records Limited number of output records Consequences The top K records are returned.
  • 14© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” Structure class mapper: setup(): initialize top ten sorted list map(key, record): insert record into top ten sorted list if length of array is greater-than 10: truncate list to a length of 10 cleanup(): for record in top sorted ten list: emit null,record class reducer: setup(): initialize top ten sorted list reduce(key, records): sort records truncate records to top 10 for record in records: emit record
  • 15© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” Resemblances SQL: SELECT * FROM table ORDER BY col4 DESC LIMIT 10; Pig: B = ORDER A BY col4 DESC; C = LIMIT B 10;
  • 16© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” Performance analysis Pretty quick: map-heavy, low network usage Pay attention to how many records the reducer is getting [number of input splits] x K Example Top ten StackOverflow users by reputation
  • 17© Copyright 2012 EMC Corporation. All rights reserved. public static class TopTenMapper extends Mapper<Object, Text, NullWritable, Text> { private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>(); public void map(Object key, Text value, Context context) { Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString()); String userId = parsed.get("Id"); String reputation = parsed.get("Reputation"); repToRecordMap.put(Integer.parseInt(reputation), new Text(value)); if (repToRecordMap.size() > 10) { repToRecordMap.remove(repToRecordMap.firstKey()); } } protected void cleanup(Context context) { for (Text t : repToRecordMap.values()) { context.write(NullWritable.get(), t); } } } Top Ten Mapper
  • 18© Copyright 2012 EMC Corporation. All rights reserved. public static class TopTenReducer extends Reducer<NullWritable, Text, NullWritable, Text> { private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>(); public void reduce(NullWritable key, Iterable<Text> values, Context context) { for (Text value : values) { Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString()); repToRecordMap.put(Integer.parseInt(parsed.get("Reputation")), new Text(value)); if (repToRecordMap.size() > 10) { repToRecordMap.remove(repToRecordMap.firstKey()); } } for (Text t : repToRecordMap.descendingMap().values()) { context.write(NullWritable.get(), t); } } } Top Ten Reducer
  • 19© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Bloom Filtering” (filtering) Intent Keep records that are a member of some predefined set of values. It is not a problem if the output is a bit inaccurate. Motivation Similar to normal Boolean filtering, but we are filtering on set membership Set membership is evaluated with a Bloom filter
  • 20© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Bloom Filtering” Applicability A feature can be extracted and tested for set membership Predetermined set is available Some false positives are acceptable Consequences Records that pass the Bloom filter membership test are returned Known Uses Keep all records in a watch list (and a few records that aren’t) Pre-filtering records before an expensive membership test
  • 21© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Bloom Filtering” Structure class mapper: setup(): load bloom filter into memory map(key, record): if record in bloom filter: emit (record, null) Resemblances UDFs?
  • 22© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Bloom Filtering” Performance analysis Map-only Slight overhead in moving Bloom filter into memory Bloom filter membership tests are constant time Example Filter StackOverflow comments that do not contain a keyword Distributed HBase query using a Bloom filter
  • 23© Copyright 2012 EMC Corporation. All rights reserved. Candidate new patterns Link Graph processing patterns (new category) – Shortest past, diameter, graph stats, connected components, etc. – Too domain specific? – Has its own distinct patterns Projection (filtering) – Remove “columns” of data Transformation (data organization?) – Take a data set but transform it into something else
  • 24© Copyright 2012 EMC Corporation. All rights reserved. Future and call to action Contributing your own patterns Trends in the nature of data – Images, audio, video, biomedical, social … Libraries, abstractions, and tools Ecosystem patterns: YARN, HBase, ZooKeeper, …