MapReduce Online<br />Tyson Condie<br />UC Berkeley<br />Joint work with <br />Neil Conway, Peter Alvaro, and Joseph M. He...
MapReduce Programming Model<br />Think data-centric<br />Apply a two step transformation to data sets<br />Map step: Map(k...
MapReduce System Model<br />Shared-nothing architecture<br />Tuned for massive data parallelism<br />Many maps operate on ...
Life Beyond Batch<br />MapReduce often used for analytics on streams of data that arrive continuously<br />Click streams, ...
Online Query Processing<br />Two domains of interest (at massive scale):<br />Online aggregation<br />Interactive data ana...
A Brave New MapReduce World<br />Pipelined MapReduce<br />Maps can operate on infinite data (Stream processing)<br />Reduc...
Outline<br />Hadoop MR Background<br />Hadoop Online Prototype (HOP)<br />Online Aggregation<br />Stream Processing<br />P...
Hadoop Architecture<br />HadoopMapReduce<br />Single master node (JobTracker), many worker nodes (TaskTrackers)<br />Clien...
Hadoop Job Execution<br />Submit job<br />map<br />reduce<br />schedule<br />map<br />reduce<br />
Hadoop Job Execution<br />Read <br />Input File<br />map<br />reduce<br />HDFS<br />Block 1<br />Block 2<br />map<br />red...
Hadoop Job Execution<br />Finished<br />Map Locations<br />Local FS<br />map<br />reduce<br />Local FS<br />map<br />reduc...
Hadoop Job Execution<br />Local FS<br />map<br />reduce<br />HTTP GET<br />Local FS<br />map<br />reduce<br />
Hadoop Job Execution<br />reduce<br />Write Final Answer<br />HDFS<br />reduce<br />Input: sorted runs of records assigned...
Hadoop Online Prototype (HOP)<br />Pipelining between operators<br />Data pushed from producers to consumers<br />Producer...
JobTracker accepts a series of jobs</li></li></ul><li>HOP Job Pipelined Execution<br />Schedule<br />Schedule + Map Locati...
Pipelining Data Unit<br />Initial design: pipeline eagerly (each record)<br />Prevents map side preaggregation (a.k.a., co...
Pipelined Fault Tolerance (PFT)<br />Simple PFT design:<br />Reduce treats in-progress map output as tentative<br />If map...
Benefits of Pipelining<br />Online aggregation<br />An early view of the result from a running computation<br />Interactiv...
Outline<br />Hadoop MR Background<br />Hadoop Online Prototype (HOP)<br />Online Aggregation<br />Implementation<br />Exam...
Implementation<br />Read <br />Input File<br />map<br />reduce<br />Block 1<br />HDFS<br />HDFS<br />Block 2<br />map<br /...
Intermediate results published to HDFS</li></li></ul><li>Example Approximation Query<br />The data:<br />Wikipedia traffic...
Bar graph shows results for a single hour (1600)<br />Taken less than 2 minutes into a ~2 hour job!<br />
Approximation error: |estimate – actual| / actual<br />Job progress assumes hours are uniformly sampled<br />Sample fracti...
Outline<br />Hadoop MR Background<br />Hadoop Online Prototype (HOP)<br />Online Aggregation<br />Stream Processing<br />I...
Implementation<br />Map and reduce tasks run continuously<br />Challenge: what if number of tasks exceed slot capacity?<br...
Real-time Monitoring System<br />Use MapReduce to monitor MapReduce<br />Continuous jobs that monitor cluster health<br />...
Monitor /proc/vmstat for swapping<br />Alert triggered after some threshold<br />Alert reported around a second after pass...
Performance<br />Open problem!<br />A lot of performance related work still remains<br />Focus on obvious cases first<br /...
Blocking vs. Pipelining<br />Final  map finishes, sorts and sends to reduce<br />3rd map finishes, sorts, and sends to red...
Blocking vs. Pipelining<br />Job completion when reduce finishes<br />Reduce task performing final merge-sort<br />No sign...
Recall in blocking…<br />Operators block<br />Poor CPU and I/O overlap<br />Reduce task idle periods<br />Only the final a...
CPU Utilization<br />Map tasks loading 2GB of data<br />Mapper CPU<br />Pipelining reduce tasks start working (presorting)...
Recall in blocking…<br />Operators block<br />Poor CPU and I/O overlap<br />Reduce task idle periods<br />Only the final a...
Network Traffic Spikes(map machine network out)<br />First 2 maps finish and send output<br />Last map finishes and sends ...
A more realistic setup<br /><ul><li>20 extra-large nodes: 4 cores, 15GB RAM</li></ul>Slot capacity: <br />80 maps (4 per n...
Adaptive vs. Fixed Block Size <br />Job completion time<br />~ 49 minutes<br />~ 36 minutes<br />512MB block size, 240 map...
Adaptive vs. Fixed Block Size <br />Job completion time<br />~ 42 minutes<br />~ 35 minutes<br />32MB block size, 3120 map...
Future Work<br />Blocking vs. Pipelining<br />Thorough performance analysis at scale<br />Hadoop optimizer<br />Online Agg...
Thank you!<br />Questions?<br />MapReduce Online paper: In NSDI 2010<br />Source code: http://code.google.com/p/hop/<br />...
Map Reduce Online
Upcoming SlideShare
Loading in...5
×

Map Reduce Online

4,945

Published on

Map Reduce Online presentation at the Hadoop Bay Area User Group, March 24 at Yahoo! Sunnyvale campus

0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,945
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
13
Embeds 0
No embeds

No notes for slide
  • TARGET Batch-oriented computations
  • DEFINE COMBINER!!!
  • Vertical line indicates alert
  • Data transfer scheduled by reducerReduce partitions serviced in arbitrary orderPoor spatial locality!
  • Data transfer scheduled by reducerReduce partitions serviced in arbitrary orderPoor spatial locality!
  • Operators blockPoor CPU and I/O overlapReduce task idle periods
  • Cluster capacity: 80 maps, 60 reduces
  • First plateau due to 80 maps finishing at the same timeSubsequent map waves get spread out by the scheduler duty cycle.
  • Transcript of "Map Reduce Online"

    1. 1. MapReduce Online<br />Tyson Condie<br />UC Berkeley<br />Joint work with <br />Neil Conway, Peter Alvaro, and Joseph M. Hellerstein (UC Berkeley)<br />KhaledElmeleegy and Russell Sears (Yahoo! Research)<br />
    2. 2. MapReduce Programming Model<br />Think data-centric<br />Apply a two step transformation to data sets<br />Map step: Map(k1, v1) -> list(k2, v2)<br />Apply map function to input records<br />Divide output records into groups<br />Reduce step: Reduce(k2, list(v2)) -> list(v3)<br />Consolidate groups from the map step<br />Apply reduce function to each group<br />
    3. 3. MapReduce System Model<br />Shared-nothing architecture<br />Tuned for massive data parallelism<br />Many maps operate on portions of the input<br />Many reduces, each assigned specific groups<br /><ul><li> Batch-oriented computations over huge data sets</li></ul>Runtimes range in minutes tohours<br />Execute on 10s to 1000s of machines<br />Failures common (fault tolerance crucial)<br />Fault tolerance via operator restart since …<br />Operators complete before producing any output<br />Atomic data exchange between operators<br />Simple, elegant, easy to remember<br />
    4. 4. Life Beyond Batch<br />MapReduce often used for analytics on streams of data that arrive continuously<br />Click streams, network traffic, web crawl data, …<br />Batch approach: buffer, load, process<br />High latency<br />Not scalable<br />Online approach: run MR jobs continuously<br />Analyze data as it arrives<br />
    5. 5. Online Query Processing<br />Two domains of interest (at massive scale):<br />Online aggregation<br />Interactive data analysis (watch answer evolve)<br />Stream processing<br />Continuous (real-time) analysis of data streams<br />Blocking operators are a poor fit<br />Final answers only<br />No infinite streams<br />Operators need to pipeline<br />BUT we must retain fault tolerance<br />AND Keep It Simple Stupid!<br />
    6. 6. A Brave New MapReduce World<br />Pipelined MapReduce<br />Maps can operate on infinite data (Stream processing)<br />Reduces can export early answers (Online aggregation)<br />Hadoop Online Prototype (HOP)<br />Hadoop with pipelining support<br />Preserves Hadoop interfaces and APIs<br />Pipelining fault tolerance model<br />
    7. 7. Outline<br />Hadoop MR Background<br />Hadoop Online Prototype (HOP)<br />Online Aggregation<br />Stream Processing<br />Performance (blocking vs. pipelining)<br />Future Work<br />
    8. 8. Hadoop Architecture<br />HadoopMapReduce<br />Single master node (JobTracker), many worker nodes (TaskTrackers)<br />Client submits a job to the JobTracker<br />JobTracker splits each job into tasks (map/reduce)<br />Assigns tasks to TaskTrackers on demand<br />Hadoop Distributed File System (HDFS)<br />Single name node, many data nodes<br />Data is stored as fixed-size (e.g., 64MB) blocks<br />HDFS typically holds map input and reduce output<br />
    9. 9. Hadoop Job Execution<br />Submit job<br />map<br />reduce<br />schedule<br />map<br />reduce<br />
    10. 10. Hadoop Job Execution<br />Read <br />Input File<br />map<br />reduce<br />HDFS<br />Block 1<br />Block 2<br />map<br />reduce<br />
    11. 11. Hadoop Job Execution<br />Finished<br />Map Locations<br />Local FS<br />map<br />reduce<br />Local FS<br />map<br />reduce<br />Map output: sorted by group id and key<br />group id = hash(key) mod # reducers<br />
    12. 12. Hadoop Job Execution<br />Local FS<br />map<br />reduce<br />HTTP GET<br />Local FS<br />map<br />reduce<br />
    13. 13. Hadoop Job Execution<br />reduce<br />Write Final Answer<br />HDFS<br />reduce<br />Input: sorted runs of records assigned the same group id<br />Process: merge-sort runs, for each final group call reduce<br />
    14. 14. Hadoop Online Prototype (HOP)<br />Pipelining between operators<br />Data pushed from producers to consumers<br />Producers schedule data transfer concurrently with operator computation<br />Consumers notify producers early (ASAP)<br />HOP API<br /><ul><li> No changes required to existing clients</li></ul>Pig, Hive, Jaqlstill work<br /><ul><li> Configuration for pipeline/block modes
    15. 15. JobTracker accepts a series of jobs</li></li></ul><li>HOP Job Pipelined Execution<br />Schedule<br />Schedule + Map Locations<br />map<br />reduce<br />Pipeline request<br />map<br />reduce<br />
    16. 16. Pipelining Data Unit<br />Initial design: pipeline eagerly (each record)<br />Prevents map side preaggregation (a.k.a., combiner)<br />Moves all the sorting work to the reduce step<br />Map computation can block on network I/O<br />Revised design: pipeline small sorted runs (spills)<br />Task thread: apply (map/reduce) function, buffer output<br />Spill thread: sort & combine buffer, spill to a file<br />TaskTracker: sends spill files to consumers<br />Simple adaptive algorithm<br />Halt pipeline when <br />1. spill files backup OR 2. effective combiner <br />Resume pipeline by first merging & combining accumulated spill files into a single file<br />
    17. 17. Pipelined Fault Tolerance (PFT)<br />Simple PFT design:<br />Reduce treats in-progress map output as tentative<br />If map dies then throw away output<br />If map succeeds then accept output<br />Revised PFT design:<br />Spill files have deterministic boundaries and are assigned a sequence number<br />Correctness: Reduce tasks ensure spill files are idempotent<br />Optimization: Map tasks avoid sending redundant spill files<br />
    18. 18. Benefits of Pipelining<br />Online aggregation<br />An early view of the result from a running computation<br />Interactive data analysis (you say when to stop)<br />Stream processing<br />Tasks operate on infinite data streams<br />Real-time data analysis<br />Performance? Pipelining can…<br />Improve CPU and I/O overlap<br />Steady network traffic (fewer load spikes)<br />Improve cluster utilization (reducers presort earlier)<br />
    19. 19. Outline<br />Hadoop MR Background<br />Hadoop Online Prototype (HOP)<br />Online Aggregation<br />Implementation<br />Example Approximation Query<br />Stream Processing<br />Performance (blocking vs. pipelining)<br />Future Work<br />
    20. 20. Implementation<br />Read <br />Input File<br />map<br />reduce<br />Block 1<br />HDFS<br />HDFS<br />Block 2<br />map<br />reduce<br />Write Snapshot<br />Answer<br /><ul><li>Execute reduce task on intermediate data
    21. 21. Intermediate results published to HDFS</li></li></ul><li>Example Approximation Query<br />The data:<br />Wikipedia traffic statistics (1TB)<br />Webpage clicks/hour<br />5066 compressed files (each file = 1 hour click logs)<br />The query: <br />group by language and hour<br />count clicks<br />The approximation:<br />Final answer ~ (intermediate click count * scale-up factor)<br />Job progress: 1.0 / fraction of input processed<br />Sample fraction: total # of hours / # hours sampled<br />(each file = 1 hour click logs)<br />and fraction of hour<br />
    22. 22. Bar graph shows results for a single hour (1600)<br />Taken less than 2 minutes into a ~2 hour job!<br />
    23. 23. Approximation error: |estimate – actual| / actual<br />Job progress assumes hours are uniformly sampled<br />Sample fraction closer to the sample distribution of each hour<br />
    24. 24. Outline<br />Hadoop MR Background<br />Hadoop Online Prototype (HOP)<br />Online Aggregation<br />Stream Processing<br />Implementation<br />Use case: real-time monitoring system<br />Performance (blocking vs. pipelining)<br />Future Work<br />
    25. 25. Implementation<br />Map and reduce tasks run continuously<br />Challenge: what if number of tasks exceed slot capacity?<br />Current approach: wait for required slot capacity<br />Map tasks stream spill files<br />Input taken from arbitrary source (MR job, socket, etc.)<br />Garbage collection handled by system<br />Window management done at reducer<br />Reduce function arguments<br />Input data: the set of current input records<br />OutputCollector: output records for the current window<br />InputCollector: records for subsequent windows<br />Return value says when to call next<br />e.g., in X milliseconds, after receiving Y records, etc.<br />
    26. 26. Real-time Monitoring System<br />Use MapReduce to monitor MapReduce<br />Continuous jobs that monitor cluster health<br />Same scheduler for user jobs and monitoring jobs<br />Economy of Mechanism<br />Agents monitor machines <br />Record statistics of interest (/proc, log files, etc.)<br />Implemented as a continuous map task<br />Aggregators group agent-local statistics<br />High level (rack, datacenter) analysis and correlations<br />Reduce windows: 1, 5, and 15 second load averages<br />
    27. 27. Monitor /proc/vmstat for swapping<br />Alert triggered after some threshold<br />Alert reported around a second after passing threshold<br />Faster than the (~5 second) TaskTracker reporting interval<br /><ul><li>Feedback loop to the JobTracker for better scheduling</li></li></ul><li>Outline<br />Hadoop MR Background<br />Hadoop Online Prototype (HOP)<br />Online Aggregation<br />Stream Processing<br />Performance (blocking vs. pipelining)<br />Not a comprehensive study<br />But obvious cases exist<br />Future Work<br />
    28. 28. Performance<br />Open problem!<br />A lot of performance related work still remains<br />Focus on obvious cases first<br />Why block?<br />Effective combiner<br />Reduce step is a bottleneck<br />Why pipeline?<br />Improve cluster utilization<br />Smooth out network traffic<br />
    29. 29. Blocking vs. Pipelining<br />Final map finishes, sorts and sends to reduce<br />3rd map finishes, sorts, and sends to reduce<br />2 maps sort and send output to reducer<br />Simple wordcount on two (small) EC2 nodes<br />Map machine: 2 map slots<br />Reduce machine: 2 reduce slots<br />Input 2GB data, 512MB block size<br />So job contains 4 maps and (a hard-coded) 2 reduces<br />
    30. 30. Blocking vs. Pipelining<br />Job completion when reduce finishes<br />Reduce task performing final merge-sort<br />No significant idle periods during the shuffle phase<br />4th map output received<br />3rd map output received<br />2nd map output received<br />1st map output received<br />Reduce task 6.5 minute idle period<br />~ 15 minutes<br />~ 9 minutes<br />Simple wordcount on two (small) EC2 nodes<br />Map machine: 2 map slots<br />Reduce machine: 2 reduce slots<br />Input 2GB data, 512MB block size<br />So job contains 4 maps and (a hard-coded) 2 reduces<br />
    31. 31. Recall in blocking…<br />Operators block<br />Poor CPU and I/O overlap<br />Reduce task idle periods<br />Only the final answer is fetched<br />So more data is fetched resulting in…<br />Network traffic spikes<br />Especially when a group of maps finish<br />
    32. 32. CPU Utilization<br />Map tasks loading 2GB of data<br />Mapper CPU<br />Pipelining reduce tasks start working (presorting) early<br />Reduce task 6.5 minute idle period<br />Reducer CPU<br />Amazon Cloudwatch<br />
    33. 33. Recall in blocking…<br />Operators block<br />Poor CPU and I/O overlap<br />Reduce task idle periods<br />Only the final answer is fetched<br />So more data is fetched at once resulting in…<br />Network traffic spikes<br />Especially when a group of maps finish<br />
    34. 34. Network Traffic Spikes(map machine network out)<br />First 2 maps finish and send output<br />Last map finishes and sends output<br />3rd map finishes and sends output<br />Amazon Cloudwatch<br />
    35. 35. A more realistic setup<br /><ul><li>20 extra-large nodes: 4 cores, 15GB RAM</li></ul>Slot capacity: <br />80 maps (4 per node)<br />60 reduces (3 per node)<br />Evaluate using wordcount job<br />Input: 100GB randomly generated words<br />Block size: 512MB (240 maps) and 32MB (3120 maps)<br />Hard coded 60 reduce tasks<br />
    36. 36. Adaptive vs. Fixed Block Size <br />Job completion time<br />~ 49 minutes<br />~ 36 minutes<br />512MB block size, 240 maps<br />Maps scheduled in 3 waves (1st 80, 2nd 80, last 80)<br />Large block size creates reduce idle periods<br />Poor CPU and I/O overlap<br />Pipelining minimizes idle periods<br />
    37. 37. Adaptive vs. Fixed Block Size <br />Job completion time<br />~ 42 minutes<br />~ 35 minutes<br />32MB block size, 3120 maps<br />Maps scheduled in 39 waves, maps finish fast! <br />Small block size improves CPU and I/O overlap<br />BUT increases scheduling overhead so not a scalable solution<br />Adaptive policy finds the right degree of pipelined parallelism<br />Based on runtime dynamics (load, network capacity, etc.)<br />
    38. 38. Future Work<br />Blocking vs. Pipelining<br />Thorough performance analysis at scale<br />Hadoop optimizer<br />Online Aggregation<br />Statically-robust estimation<br />Random sampling of the input<br />Better UI for approximate results<br />Stream Processing<br />Develop full-fledged stream processing framework<br />Stream support for high-level query languages<br />
    39. 39. Thank you!<br />Questions?<br />MapReduce Online paper: In NSDI 2010<br />Source code: http://code.google.com/p/hop/<br />Contact: tcondie@cs.berkeley.edu<br />

    ×