20130201 MapReduce Design Patterns


Published on

MapReduce Design Pattern expalines
* Summarization
** Numerical Summarizations
** Inverted Index Summarizations
** Counting with Counters
* Filtering
** Filtering
** Bloom Filtering
** Top Ten
** Distinct
* Data Organization
** Structured to Hierarchical
** Partitioning
** Binning
** Total Order Sorting
** Shuffling
* Join
** Reduce Side Join
** Replicated Join
** Composite Join
** Cartesian Product
* Metapatterns
** Job Chaining
** Chain Folding
** Job Merging
* Input and Output
** Generating Data
** External Source Output
** External Source Input
** Partition Pruning

Published in: Technology
  • Great job on summarizing my book :) It was neat to see someone else go through it so thoroughly and put it back into slide form. Thanks for putting the time into this!
    Are you sure you want to  Yes  No
    Your message goes here
  • Thank you for the informative summary.

    Best wishes.

    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

20130201 MapReduce Design Patterns

  1. 1. MapReduce Design Patterns Will Shen 2013/02/01
  2. 2. Outline Part I: MapReduce Basics • Map and Reduce • A WordCount example • Open-source framework: Hadoop Part II: MapReduce Design Patterns • Summarization Patterns • Filtering Patterns • Data Organization Patterns • Join Patterns • Meta Patterns • Input and Output PatternsReference: Donald Miner and Adam Shook,  “MapReduce Design Patterns: Building EffectiveAlgorithms and Analytics for Hadoop and Other Systems”,  230 pages, OReilly Media; 1 edition(December 22, 2012) 2
  3. 3. Part I: MapReduce Basics Motivation: Large Scale Data Processing • Process lots of data (>1TB) • Want to use hundreds of CPUs MapReduce - Google (2005), US patent (2010) • Automatic parallelization and distribution • Fault-tolerance • I/O scheduling • Status and monitoringGoogle,  “MapReduce:  Simplified  Data  Processing  on  Large  Clusters”,  2005/04/063
  4. 4. What is Map and Reduce Borrows from Functional Programming • Functional operations do not modify data structures  create new ones • Stateless functional operations  no side-effect  order of operations does not matter fun foo(li: int list) = sum(li) + mul(li) + length(li) map: (k1, v1) → [(k2, v2)] reduce: (k2, [v2]) → [(k3, v3)]4
  5. 5. What is MapReducemap: (k1, v1) → [(k2, v2)]reduce: (k2, [v2]) → [(k3, v3)] 5
  6. 6. Parallel Execution Bottleneck: Reduce phase cannot start until map phase completes6
  7. 7. Big Picture of MapReduceInput Reader - Divides input into appropriate size splits(16 to 128 MB)Map - partitioning of the data (compute part of a problem acrossseveral servers)Shuffle - together the values returned by the map functionReduce - processing of the partitions (aggregate the partialresults from all servers into a single result-set)Output Writer - Writes the output of the Reducer7
  8. 8. Example – counting words in documentsmap (String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");reduce (String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(output_key, AsString(result)); 8
  9. 9. Open-source framework Apache Hadoop Hadoop - http://hadoop.apache.org/ Hadoop: not only a Map/Reduce implementation! • HDFS – distributed file system • Pig – high level query language (SQL like) • HBase – distributed column store • Hive – Hadoop based data warehouse • ZooKeeper, Chukwa,  Pipes/Streaming,  …9
  10. 10. How Hadoop runs a MapReduce Job• Client submits the MapReduce job.• JobTracker coordinates the job run.• TaskTrackers run the tasks that the job has been split into.• HDSF is used for sharing job files between the other entities.10
  11. 11. WordCount Java Code in Hadoop11
  12. 12. General Considerations Map execution order is not deterministic Map processing time cannot be predicted Reduce tasks cannot start before all Maps have finished (dataset needs to be fully partitioned) Not suitable for continuous input streams There will be a spike in network utilization after the Map / before the Reduce phase Number & size of key/value pairs • Object creation & serialisation overhead  (Amdahl’s  law!)   Aggregate partial results when possible! • Use Combiners12
  13. 13. Using MaReduce to Solve Problems Map • Word Count: texts  (word, 1) • Inverted Index: documents  (word, doc_id) • Max Temperature: formatted data  (year, temperature) • Mean Rain Precipitation: daily data  (<year- month, lat, long>, temperature) Reduce  applies a count, list, max, and average to a set of values for each key, respectively. Reusable Solutions?13
  14. 14. What is a “Design Pattern” Design Pattern  a general reusable solution to a commonly occurring problem within a given context in software design.14 GoF
  15. 15. Part II: MapReduce Design Patterns1. Summarization: get a top-level view by summarizing and grouping data2. Filtering: view data subsets such as records generated from one user3. Data Organization: reorganize data to work with other systems, or to make MapReduce analysis easier4. Join : analyze different datasets together to discover interesting relationships Total 23 patterns5. Metapattern : piece together several patterns to solve multi-stage problems, or to perform several analytics in the same job6. Input and Output: customize the way you use Hadoop to load or store data A template for solving a common and general data manipulation problem with MapReduce. 15
  16. 16. The 23 Patterns of MapReduce Summarization Join • Numerical Summarizations • Reduce Side Join • Inverted Index Summarizations • Replicated Join • Counting with Counters • Composite Join Filtering • Cartesian Product • Filtering Metapatterns • Bloom Filtering • Job Chaining • Top Ten • Chain Folding • Distinct • Job Merging Data Organization Input and Output • Structured to Hierarchical • Generating Data • Partitioning • External Source Output • Binning • External Source Input • Total Order Sorting • Partition Pruning • Shuffling16
  17. 17. The End Thanks for your attentions. Any Questions?17
  18. 18. Pattern Template in this Book Name: a well-selecting name of the pattern Intent: A quick problem description Motivation: Why you would want to solve this problem or where it would appear. Applicability: A set of criteria that must be true to be able to apply this pattern to a problem. Structure: The layout of the MapReduce job itself. Consequences: The end goal of the output this pattern produces. Resemblances: Show analogies of how this problem would be solved with other languages, like SQL and PIG. Known Uses: some common use cases Performance Analysis: Explains the performance profile of the analytic produced by the pattern.18
  19. 19. 2.1 Summarization Patterns Your data is large and vast, with more data coming into the system every day • ex. web user-logs • You want to produce a top-level, summarized view of the data • You can glean insights not available from looking at a localized set of records alone. Patterns • Numerical Summarizations • Inverted Index Summarizations • Counting with Counters19
  20. 20. Numerical Summarizations 1/4 Intent - Group records together by a key field and calculate a numerical aggregate per group to get a top-level view of the larger data set. Motivation • Many data sets these days are too large for a human to get any real meaning out it by reading through it manually, e.g., terabytes of website log files. • minimum, maximum, average, median, and standard deviation Applicability • You are dealing with numerical data or counting. • The data can be grouped by specific fields20
  21. 21. Numerical Summarizations 2/4 Structure • Mapper: outputs keys that consist of each field to group by, and values consisting of any pertinent numerical items. • Reducer: receives a set of numerical values (v1, v2, v3,  …,   vn) associated with a group-by key records to perform the aggregation  function  λ.  The  value of λ  is  output with the given input key.21
  22. 22. Numerical Summarizations 3/4 Consequences • A set of part files containing a single record per reducer input group. Each record will consist of the key and all aggregate values. Known uses • Word count, Record count • Min, Max, Count of a particular event • Average, Median, Standard deviation Resemblances • SQL SELECT MIN(numericalcol1), MAX(numericalcol1), COUNT(*) FROM table • Pig GROUP BY groupcol2; b = GROUP a BY groupcol2; c = FOREACH b GENERATE group, MIN(a.numericalcol1),22 MAX(a.numericalcol1), COUNT_STAR(a);
  23. 23. Numerical Summarizations 4/4 Performance analysis • Aggregations perform well when the combiner is properly used. • Data skew of reduce groups: many more intermediate key/value pairs with a specific key than other keys, one reducer is going to have a lot more work to do than others.23
  24. 24. Inverted Index Summarizations 1/4 Intent - Generate an index from a data set to allow for faster searches. storing a mapping from content to its locations24
  25. 25. Inverted Index Summarizations 2/4 Motivation • To index large data sets on keywords, so that searches can trace terms back to records that contain specific values. • Search performance of search engine Applicability • You are requiring quick query responses. • The results of such a query can be preprocessed and ingested into a database.25
  26. 26. Inverted Index Summarizations 3/4 Structure26
  27. 27. Inverted Index Summarizations 4/4 Consequences • “filed  value”  -> [unique IDs of records] Performance analysis • Parsing content in Mapper most computationally • The cardinality of the index keys  increase the number of reducers increase parallelism • The  number  of  content  identifiers  per  key,  “the” • a few reducers will take much longer than the others. • Require a custom partitioner27
  28. 28. Counting with Counters 1/3 Intent • An efficient means to retrieve count summarizations of large data sets. Motivation • A count or summation can tell you a lot about your data as a whole. • Simply use  the  framework’s  counters  no reduce phase and no summation Applicability • You have a desire to gather counts or summations over large data sets. • The number of counters you are going to create is small28
  29. 29. Counting with Counters 2/3 Structure • Mapper: processes each input record at a time to increment counters based on certain criteria. • Counter: (a) incremented by one if counting a single instance (b)incremented by some number if executing a summation.29
  30. 30. Counting with Counters 3/3 Consequences • the final output is a set of counters grabbed from the job framework (no actual output) Known uses • Count number of records (over a given time period) • Count a small number of unique instances • Counters can be used to sum fields of data together. Performance analysis • Using counters is very fast, as data is simply read in through the mapper and no output is written. • Performance depends largely on the number of map tasks being executed and how much time it takes to process each record.30
  31. 31. 2.2 Filtering Patterns To understand a smaller piece of data Find a subset of data - a top-ten listing, the results of a de-duplication. Sampling Filtering Patterns: • Filtering • Bloom Filtering • Top Ten • Distinct31
  32. 32. Filtering 1/4 Intent • Filter out records that are not of interest Motivation • Your data set is large and you want to take a subset of this data to focus in on it and perhaps do follow-on analysis. Applicability • The data can be parsed into “records”  that  can  be   categorized through some well-specified criterion determining whether they are to be kept.32
  33. 33. Filtering 2/4 Structure • No  “Reducer” map(key, record): if we want to keep record then emit key,value33
  34. 34. Filtering 3/4 Consequences • A subset of the records that pass the selection criteria. • If the format was kept the same, any job that ran over the larger data set should be able to run over this filtered data set, as well. Known uses • Closer view of data • Tracking a thread of events • Distributed grep • Data cleansing • Simple random sampling • Removing low scoring data (if you can score your data)34
  35. 35. Filtering 4/4 Resemblances • SQL: SELECT * FROM table WHERE VALUE < 3 • Pig: b = FILTER a BY value < 3; Performance analysis • NO reducers • Data never has to be transmitted between the map and reduce phase. • Most of the map tasks pull data off of their locally attached disks and then write back out to that node. • Both the sort phase and the reduce phase are cut out.35
  36. 36. Bloom Filtering 1/4 Intent • Filter such that we keep records that are member of some predefined set of values (hot values). Motivation • To filter the record based on some sort of set membership operation against the hot values. • The set membership is going to be evaluated with a Bloom filter. • M = 18, k = 3 • w is not in the set {x, y, z}36
  37. 37. Bloom Filtering 2/4 Applicability • Data can be separated into records, as in filtering. • A feature can be extracted from each record that could be in a set of hot values. • There is a predetermined set of items for the hot values. • Some false positives are acceptable (i.e., some records will get through when they should not have).37
  38. 38. Bloom Filtering 3/4 Structure – training + actual filtering38
  39. 39. Bloom Filtering 4/5 Consequences • a subset of the records in that passed the Bloom filter membership test. • Exists false positives records Known uses • Removing most of the non-watched values • Prefiltering a data set for an expensive set membership check39
  40. 40. Bloom Filtering 5/5 Performance analysis • Loading up the Bloom filter is not that expensive since the file is relatively small. • Checking a value against the Bloom filter is also a relatively cheap operation – by O(1) hashing40
  41. 41. Top Ten 1/4 Intent • Retrieve a relatively small number of top K records, according to a ranking scheme in your data set, no matter how large the data. Motivation • Finding records that are typically the most interesting • To find the best records for a specific criterion Applicability • It is able to compare one record to another to determine which is “larger” • The number of output records should be significantly fewer than the number of input records  a total ordering of the data set.41
  42. 42. Top Ten 2/4 Structure • Mapper: find local top K • (only one) Reducer: K*M records  the final top k42
  43. 43. Top Ten 3/4 Consequences • The top K records are returned. Known uses • Outlier analysis • Select interesting data (most valuable data) • Catchy dashboards Resemblances • SQL: SELECT * FROM table WHERE col4 DESC LIMIT 10 • Pig: B = ORDER A BY col4 DESC; C = LIMIT B 10;43
  44. 44. Top Ten 4/4 Performance analysis – one single Reducer • How many records (K*M) the reducer is getting? • The sort can become an expensive operation when it has too many records and has to do most of the sorting on local disk, instead of in memory. • The reducer host will receive a lot of data over the network  a network resource hot spot • Naturally, scanning through all the data in the reduce will take a long time if there are many records to look through. • Any sort of memory growth in the reducer has the possibility of blowing through the Java  virtual  machine’s   memory • Writes to the output file are not parallelized44
  45. 45. Distinct 1/4 Intent • To find a unique set of values from similar records Motivation • Reducing a data set to a unique set of values has several uses Applicability • You have duplicates values in data set; it is silly to use this pattern otherwise.45
  46. 46. Distinct 2/4 Structure • It exploits MapReduce’s ability to group keys together to remove duplicates. • Mapper transforms the data and  doesn’t  do  much  in  the   reducer. • Duplicate records are often located close to another in a data set, so a combiner will deduplicate them in the map phase. • Reducer groups the  nulls  together  by  key,  so  we’ll  have  one   null per key  simply output the key map(key, record): emit(record, null) reduce(key, records): emit(key);46
  47. 47. Distinct 3/4 Consequences • The output records are guaranteed to be unique, but any order has not been preserved due to the random partitioning of the records. Known uses • Deduplicate data • Getting distinct values • Protecting from an inner join explosion Resemblances • SQL: SELECT DISTINCT * FROM table; • Pig: b = DISTINCT a;47
  48. 48. Distinct 4/4 Performance analysis • The number of reducers you think you will need. • Basically, if duplicates are very rare within an input split, pretty much all of the data is going to be sent to the reduce phase.48
  49. 49. 2.3 Data Organization patterns The value of individual records is often multiplied by the way they are partitioned, sharded, or sorted, especially true in distributed systems. Patterns: • Structured to Hierarchical • Partitioning • Binning • Total Order Sorting • Shuffling49
  50. 50. Structured to Hierarchical 1/3 Intent • Transform your row-based data to a hierarchical format (JSON or XML) Motivation • Migrating data from an RDBMS to Hadoop  table join • Reformatting your data into a more conducive structure Applicability • You have data sources that are linked by some set of foreign keys. • Your data is structured and row-based. Posts Post Comment Comment Post Comment Comment Comment50
  51. 51. Structured to Hierarchical 2/3 Structure • Mapper load the data and parse the records into one cohesive format. • Combiner  isn’t  going  to  help • Reducer build the hierarchical data structure from the list of data items.51
  52. 52. Structured to Hierarchical 3/3 Consequences • The output will be in a hierarchical form, grouped by the key that you specified Known uses • Pre-joining data • Preparing data for HBase or MongoDB Performance analysis • How much data is being sent to the reducers from the mappers • The memory footprint of the object that the reducer builds. • For a post that has a million comments?52
  53. 53. Partitioning 1/3 Intent • Move the records into categories;;  doesn’t  care  the  order of records. • Take similar records in a data set and partition them into distinct, smaller data sets. Motivation • If you want to look at a particular set of data, the data items are normally spread out across the entire data set  requires an entire scan of all of the data Applicability • Knowing how many partitions you are going to have ahead of time - by day of the week  7 partitions.53
  54. 54. Partitioning 2/3 Structure - to determine what partition a record is going to go54
  55. 55. Partitioning 3/3 Known uses • Partition pruning by continuous value (e.g., date) • Partition pruning by category • Country, phone area code, language • Sharding (to different disks) Performance analysis • The resulting partitions will likely not have similar number of records. Perhaps one partition hold 50%. • If implemented naively, all of this data will get sent to one reducer and will slow down processing significantly.55
  56. 56. Binning 1/3 Intent • For each record in the data set, file each one into one or more categories. Motivation • Binning is very similar to partitioning and often can be used to solve the same problem. • Binning splits data up in the map phase instead of in the partitioner. • Each mapper will now have one file per possible output bin • 1000 Bins x 1000 Mappers = 1000,000 files56
  57. 57. Binning 2/3 Structure • Mapper: if the record meets the criteria, it is sent to that bin. • No combiner, partitioner, or reducer is used in this pattern.57
  58. 58. Binning 3/3 Consequences • Each mapper outputs one small file per bin. Resemblances • PIG SPLIT data INTO eights IF col1 == 8, bigs IF col1 > 8, smalls IF (col1 < 8 AND col1 > 0); Performance analysis • map-only jobs  how efficient of processing records • No sort, shuffle, or reduce to be performed • Most of the processing is going to be done on data that is local.58
  59. 59. Total Order Sorting 1/3 Intent • Sort your data in parallel on a sort key. Motivation • Reducer will sort its data by key - but not global across all data. • Sorting in parallel is not easy Applicability • Your sort key has to be comparable so the data can be ordered.59
  60. 60. Total Order Sorting 2/3 Structure • Analyze phase - determines the ranges • idea: partitions that evenly split the random sample should evenly split the larger data set well. • Mapper does a random sampling. • the number of records in the total data set • percentage  of  records  you’ll  need  to  analyze • Only one reducer - collect the sort keys together into a sorted list  the list of keys will be sliced into the data range boundaries. • Order phase - actually sorts the data. • # of Reducers === # of Partitions • A custom partitioner loads up the partition file  data ranges60
  61. 61. Total Order Sorting 3/3 Consequences • The output files will contain sorted data Resemblances • SQL: SELCT * FROM data ORDER BY col1; • Pig: c = ORDER b BY col1; Performance analysis • Expensive!!! • load and parse the data twice: • Step 1. Build the partition ranges • Step 2. Actually sort the data.61
  62. 62. Shuffling 1/3 Intent • To completely randomize a set of records that Motivation • Shuffling for 綺夢 • Shuffling for anonymizing the data. • Shuffling for repeatable random sampling.62
  63. 63. Shuffling 2/3 Structure • Mappers [random key, record] • Reducer sorts the random keys  randomizing the data. Consequences • Each reducer outputs a file containing random records. Resemblances • SQL: SELECT * FROM data ORDER BY RAND() • Pig: c = GROUP b BY RANDOM(); d = FOREACH c GENERATE FLATTEN(b);63
  64. 64. Shuffling 3/3 Performance analysis • Nice performance properties. • Data distribution across reducers is completely balanced. • With more reducers, the data will be more spread out. • The size of the files will also be very predictable: each is the size of the data set divided by the number of reducers. This makes it easy to get a specific desired file size as output64
  65. 65. 2.4 Join patterns Refresh of RDMS join • Inner Join An SQL query walks into a bar, sees two tables and asks them • Outer Join “May I join you?” • Cartesian Product • Anti Join = full outer join - inner join. Patterns • Reduce Side Join • Replicated Join • Composite Join • Cartesian Product65
  66. 66. Reduce Side Join 1/3 Intent • Join large multiple data sets together by some foreign key. Motivation • Simple to implement in Reducers • Supports all the different join operations • No limitation on the size of your data sets. Applicability • Multiple large data sets are being joined by a foreign key. • You want the flexibility of being able to execute any join operation. • A large amount of network bandwidth66
  67. 67. Reduce Side Join 2/3 Structure • Mapper prepares [(foreign key, record)] • Reducer performs join operation67
  68. 68. Reduce Side Join 3/3 Consequences • # of part files == # of reduce tasks. • A part contains the portion of the joined records. Resemblances • SQL SELECT users.ID, users.Location, comments.upVotes FROM users [INNER|LEFT|RIGHT] JOIN comments ON users.ID=comments.UserID Performance analysis • Custer’s  network  bandwidth  !!!   • Utilize relatively more reducers than your analytic.68
  69. 69. Replicated Join 1/3 Intent • Eliminates the need to shuffle any data to the reduce phase. Motivation • All the data sets except the very large one are essentially read into memory during the setup phase of each map task, which is limited by the JVM heap. Applicability • All of the data sets, except for the large one, can be fit into main memory of each map task.69
  70. 70. Replicated Join 2/3 Structure • Map-only pattern • Read all files from the distributed cache and store them into in-memory lookup tables.70
  71. 71. Replicated Join 3/3 Consequences • # of part files == # of map tasks. • The part files contain the full set of joined records. Performance analysis • A replicated join can be the fastest type of join executed because there is no reducer required. • The amount of data that can be stored safely inside JVM.71
  72. 72. Composite Join 1/4 Intent • Performed on the map-side with many very large formatted inputs. • Completely eliminates the need to shuffle and sort all the data to the reduce phase. • Data to be already organized or prepared in a very specific way. Motivation • Particularly useful if you want to join very large data sets together. • The data sets must first be sorted by foreign key, partitioned by foreign key, and read in a very particular manner.72
  73. 73. Composite Join 2/4 Applicability • An inner or full outer join is desired. • All the data sets are sufficiently large. • All data sets can be read with the foreign key as the input key to the mapper. • All data sets have the same number of partitions. • Each partition is sorted by foreign key, and all the foreign keys reside in the associated partition of each data set. • The data sets do not change often (if they have to be prepared).73
  74. 74. Composite Join 3/4 Structure • Map-only • Mapper is very trivial. • Two values are retrieved from the input tuple and output to file system74
  75. 75. Composite Join 4/4 Consequences • Output # of part files == # of map tasks. Performance analysis • Can be executed relatively quickly over large data sets. • Data Preparation = sorting cost • The cost of producing these prepared data sets is averaged out over all of the runs.75
  76. 76. Cartesian Product 1/3 Intent • Pair up and compare every single record with every other record in a data set. Motivation • Simply pairs every record of a data set with every record of all the other data sets. • To analyze relationships between one or more data sets Applicability • You want to analyze relationships between all pairs of individual records. • You’ve  exhausted all other means to solve this problem. • You have no time constraints on execution time.76
  77. 77. Cartesian Product 2/3 Structure • Map-only • RecordReader job77
  78. 78. Cartesian Product 3/3 Consequences • The final data set is made up of tuples equivalent to the number of input data sets. • Every possible tuple combination from the input records is represented in the final output Resemblances • SQL: SELECT * FROM tableA, tableB; Performance Analysis • A massive explosion in data size O(n^2) • If a single input split contains a thousand records  the right input split needs to be read a thousand times before the task can finish. • If a single task fails for an odd reason, the whole thing needs to be restarted.78
  79. 79. 2.5 Metapatterns (skipped) Patterns about using patterns • Job Chaining - piecg together several patterns to solve complex, multistage problems • Chain Folding • Job Merging - an optimization for performing several analytics in the same MapReduce job79
  80. 80. 2.6 Input and Output patterns Customizing Input and Output in Hadoop Loaded data on disk • Configuring how contiguous chunks of input are generated from blocks in HDFS • Configuring how records appear in the map phase • RecordReader and InputFormat classes • RecordWriter and OuputFormat classes Patterns • Generating Data • External Source Output • External Source Input • Partition Pruning80
  81. 81. Generating Data 1/3 Intent • You want to generate a lot of data from scratch. Motivation • it doesn’t  load  data  generate the data and store it back in the distributed file system.81
  82. 82. Generating Data 2/3 Structure • map-only82
  83. 83. Generating Data 3/3 Consequences • Each mapper outputs a file containing random data. Performance analysis • How many worker map tasks are needed to generate the data. • In general, the more map tasks you have, the faster you can generate data.83
  84. 84. External Source Output 1/3 Intent • To write MapReduce output to a nonnative location (outside of Hadoop and HDFS). Motivation • To output data from the MapReduce framework directly to an external source. • This is extremely useful for direct loading into a system instead of staging the data to be delivered to the external source.84
  85. 85. External Source Output 2/3 Structure85
  86. 86. External Source Output 3/3 Consequences • The output data has been sent to the external source and that external source has loaded it successfully. Performance analysis • The receiver of the data can handle the parallel connections. • Having a thousand tasks writing to a single SQL database is not going to work well.86
  87. 87. External Source Input 1/3 Intent • You want to load data in parallel from a source that is not part of your MapReduce framework. Motivation • Typical model for using MapReduce to analyze the data is to store it into HDFS. • With this pattern, you can hook up the MapReduce framework into an external source, such as a database or a web service, and pull the data directly into the mappers.87
  88. 88. External Source Input 2/3 Structure88
  89. 89. External Source Input 3/3 Consequences • Data is loaded from the external source into the MapReduce job • Map  phase  doesn’t  care  where that data came from. Performance analysis • Bottleneck - the source or the network. • The source may not scale well with multiple connections (e.g., a single threaded SQL db). • If the source is not in the cluster’s  network,  the connections may be reaching out on a single connection on a slower public network.89
  90. 90. Partition Pruning 1/3 Intent • You have a set of data that is partitioned by a predetermined value, which you can use to dynamically load the data based on what is requested by the application. Motivation • Loading all of the files is a large waste of processing time. • By partitioning the data by a common value, you can avoid significant amounts of processing time by looking only where the data would exist90
  91. 91. Partition Pruning 2/3 Structure91
  92. 92. Partition Pruning 3/3 Consequences • Partition pruning changes only the amount of data that is read by the MapReduce job, not the eventual outcome of the analytic. Performance analysis • Utilizing this pattern can provide massive gains by reducing the number of tasks that need to be created that would not have generated output anyways. • Outside of the I/O, the performance depends on the other pattern being applied in the map and reduce phases of the job.92
  93. 93. The End (Finally…) Thanks for your attentions. • MapReduce has proven to be a useful abstraction • Greatly simplifies large-scale computations • Hadoop is widely used • Focus on problems, let MapReduce deal with messy details. Any Questions?93