Your SlideShare is downloading. ×
0
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Map Reduce Online
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Map Reduce Online

4,926

Published on

Map Reduce Online presentation at the Hadoop Bay Area User Group, March 24 at Yahoo! Sunnyvale campus

Map Reduce Online presentation at the Hadoop Bay Area User Group, March 24 at Yahoo! Sunnyvale campus

0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,926
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
13
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • TARGET Batch-oriented computations
  • DEFINE COMBINER!!!
  • Vertical line indicates alert
  • Data transfer scheduled by reducerReduce partitions serviced in arbitrary orderPoor spatial locality!
  • Data transfer scheduled by reducerReduce partitions serviced in arbitrary orderPoor spatial locality!
  • Operators blockPoor CPU and I/O overlapReduce task idle periods
  • Cluster capacity: 80 maps, 60 reduces
  • First plateau due to 80 maps finishing at the same timeSubsequent map waves get spread out by the scheduler duty cycle.
  • Transcript

    • 1. MapReduce Online<br />Tyson Condie<br />UC Berkeley<br />Joint work with <br />Neil Conway, Peter Alvaro, and Joseph M. Hellerstein (UC Berkeley)<br />KhaledElmeleegy and Russell Sears (Yahoo! Research)<br />
    • 2. MapReduce Programming Model<br />Think data-centric<br />Apply a two step transformation to data sets<br />Map step: Map(k1, v1) -> list(k2, v2)<br />Apply map function to input records<br />Divide output records into groups<br />Reduce step: Reduce(k2, list(v2)) -> list(v3)<br />Consolidate groups from the map step<br />Apply reduce function to each group<br />
    • 3. MapReduce System Model<br />Shared-nothing architecture<br />Tuned for massive data parallelism<br />Many maps operate on portions of the input<br />Many reduces, each assigned specific groups<br /><ul><li> Batch-oriented computations over huge data sets</li></ul>Runtimes range in minutes tohours<br />Execute on 10s to 1000s of machines<br />Failures common (fault tolerance crucial)<br />Fault tolerance via operator restart since …<br />Operators complete before producing any output<br />Atomic data exchange between operators<br />Simple, elegant, easy to remember<br />
    • 4. Life Beyond Batch<br />MapReduce often used for analytics on streams of data that arrive continuously<br />Click streams, network traffic, web crawl data, …<br />Batch approach: buffer, load, process<br />High latency<br />Not scalable<br />Online approach: run MR jobs continuously<br />Analyze data as it arrives<br />
    • 5. Online Query Processing<br />Two domains of interest (at massive scale):<br />Online aggregation<br />Interactive data analysis (watch answer evolve)<br />Stream processing<br />Continuous (real-time) analysis of data streams<br />Blocking operators are a poor fit<br />Final answers only<br />No infinite streams<br />Operators need to pipeline<br />BUT we must retain fault tolerance<br />AND Keep It Simple Stupid!<br />
    • 6. A Brave New MapReduce World<br />Pipelined MapReduce<br />Maps can operate on infinite data (Stream processing)<br />Reduces can export early answers (Online aggregation)<br />Hadoop Online Prototype (HOP)<br />Hadoop with pipelining support<br />Preserves Hadoop interfaces and APIs<br />Pipelining fault tolerance model<br />
    • 7. Outline<br />Hadoop MR Background<br />Hadoop Online Prototype (HOP)<br />Online Aggregation<br />Stream Processing<br />Performance (blocking vs. pipelining)<br />Future Work<br />
    • 8. Hadoop Architecture<br />HadoopMapReduce<br />Single master node (JobTracker), many worker nodes (TaskTrackers)<br />Client submits a job to the JobTracker<br />JobTracker splits each job into tasks (map/reduce)<br />Assigns tasks to TaskTrackers on demand<br />Hadoop Distributed File System (HDFS)<br />Single name node, many data nodes<br />Data is stored as fixed-size (e.g., 64MB) blocks<br />HDFS typically holds map input and reduce output<br />
    • 9. Hadoop Job Execution<br />Submit job<br />map<br />reduce<br />schedule<br />map<br />reduce<br />
    • 10. Hadoop Job Execution<br />Read <br />Input File<br />map<br />reduce<br />HDFS<br />Block 1<br />Block 2<br />map<br />reduce<br />
    • 11. Hadoop Job Execution<br />Finished<br />Map Locations<br />Local FS<br />map<br />reduce<br />Local FS<br />map<br />reduce<br />Map output: sorted by group id and key<br />group id = hash(key) mod # reducers<br />
    • 12. Hadoop Job Execution<br />Local FS<br />map<br />reduce<br />HTTP GET<br />Local FS<br />map<br />reduce<br />
    • 13. Hadoop Job Execution<br />reduce<br />Write Final Answer<br />HDFS<br />reduce<br />Input: sorted runs of records assigned the same group id<br />Process: merge-sort runs, for each final group call reduce<br />
    • 14. Hadoop Online Prototype (HOP)<br />Pipelining between operators<br />Data pushed from producers to consumers<br />Producers schedule data transfer concurrently with operator computation<br />Consumers notify producers early (ASAP)<br />HOP API<br /><ul><li> No changes required to existing clients</li></ul>Pig, Hive, Jaqlstill work<br /><ul><li> Configuration for pipeline/block modes
    • 15. JobTracker accepts a series of jobs</li></li></ul><li>HOP Job Pipelined Execution<br />Schedule<br />Schedule + Map Locations<br />map<br />reduce<br />Pipeline request<br />map<br />reduce<br />
    • 16. Pipelining Data Unit<br />Initial design: pipeline eagerly (each record)<br />Prevents map side preaggregation (a.k.a., combiner)<br />Moves all the sorting work to the reduce step<br />Map computation can block on network I/O<br />Revised design: pipeline small sorted runs (spills)<br />Task thread: apply (map/reduce) function, buffer output<br />Spill thread: sort & combine buffer, spill to a file<br />TaskTracker: sends spill files to consumers<br />Simple adaptive algorithm<br />Halt pipeline when <br />1. spill files backup OR 2. effective combiner <br />Resume pipeline by first merging & combining accumulated spill files into a single file<br />
    • 17. Pipelined Fault Tolerance (PFT)<br />Simple PFT design:<br />Reduce treats in-progress map output as tentative<br />If map dies then throw away output<br />If map succeeds then accept output<br />Revised PFT design:<br />Spill files have deterministic boundaries and are assigned a sequence number<br />Correctness: Reduce tasks ensure spill files are idempotent<br />Optimization: Map tasks avoid sending redundant spill files<br />
    • 18. Benefits of Pipelining<br />Online aggregation<br />An early view of the result from a running computation<br />Interactive data analysis (you say when to stop)<br />Stream processing<br />Tasks operate on infinite data streams<br />Real-time data analysis<br />Performance? Pipelining can…<br />Improve CPU and I/O overlap<br />Steady network traffic (fewer load spikes)<br />Improve cluster utilization (reducers presort earlier)<br />
    • 19. Outline<br />Hadoop MR Background<br />Hadoop Online Prototype (HOP)<br />Online Aggregation<br />Implementation<br />Example Approximation Query<br />Stream Processing<br />Performance (blocking vs. pipelining)<br />Future Work<br />
    • 20. Implementation<br />Read <br />Input File<br />map<br />reduce<br />Block 1<br />HDFS<br />HDFS<br />Block 2<br />map<br />reduce<br />Write Snapshot<br />Answer<br /><ul><li>Execute reduce task on intermediate data
    • 21. Intermediate results published to HDFS</li></li></ul><li>Example Approximation Query<br />The data:<br />Wikipedia traffic statistics (1TB)<br />Webpage clicks/hour<br />5066 compressed files (each file = 1 hour click logs)<br />The query: <br />group by language and hour<br />count clicks<br />The approximation:<br />Final answer ~ (intermediate click count * scale-up factor)<br />Job progress: 1.0 / fraction of input processed<br />Sample fraction: total # of hours / # hours sampled<br />(each file = 1 hour click logs)<br />and fraction of hour<br />
    • 22. Bar graph shows results for a single hour (1600)<br />Taken less than 2 minutes into a ~2 hour job!<br />
    • 23. Approximation error: |estimate – actual| / actual<br />Job progress assumes hours are uniformly sampled<br />Sample fraction closer to the sample distribution of each hour<br />
    • 24. Outline<br />Hadoop MR Background<br />Hadoop Online Prototype (HOP)<br />Online Aggregation<br />Stream Processing<br />Implementation<br />Use case: real-time monitoring system<br />Performance (blocking vs. pipelining)<br />Future Work<br />
    • 25. Implementation<br />Map and reduce tasks run continuously<br />Challenge: what if number of tasks exceed slot capacity?<br />Current approach: wait for required slot capacity<br />Map tasks stream spill files<br />Input taken from arbitrary source (MR job, socket, etc.)<br />Garbage collection handled by system<br />Window management done at reducer<br />Reduce function arguments<br />Input data: the set of current input records<br />OutputCollector: output records for the current window<br />InputCollector: records for subsequent windows<br />Return value says when to call next<br />e.g., in X milliseconds, after receiving Y records, etc.<br />
    • 26. Real-time Monitoring System<br />Use MapReduce to monitor MapReduce<br />Continuous jobs that monitor cluster health<br />Same scheduler for user jobs and monitoring jobs<br />Economy of Mechanism<br />Agents monitor machines <br />Record statistics of interest (/proc, log files, etc.)<br />Implemented as a continuous map task<br />Aggregators group agent-local statistics<br />High level (rack, datacenter) analysis and correlations<br />Reduce windows: 1, 5, and 15 second load averages<br />
    • 27. Monitor /proc/vmstat for swapping<br />Alert triggered after some threshold<br />Alert reported around a second after passing threshold<br />Faster than the (~5 second) TaskTracker reporting interval<br /><ul><li>Feedback loop to the JobTracker for better scheduling</li></li></ul><li>Outline<br />Hadoop MR Background<br />Hadoop Online Prototype (HOP)<br />Online Aggregation<br />Stream Processing<br />Performance (blocking vs. pipelining)<br />Not a comprehensive study<br />But obvious cases exist<br />Future Work<br />
    • 28. Performance<br />Open problem!<br />A lot of performance related work still remains<br />Focus on obvious cases first<br />Why block?<br />Effective combiner<br />Reduce step is a bottleneck<br />Why pipeline?<br />Improve cluster utilization<br />Smooth out network traffic<br />
    • 29. Blocking vs. Pipelining<br />Final map finishes, sorts and sends to reduce<br />3rd map finishes, sorts, and sends to reduce<br />2 maps sort and send output to reducer<br />Simple wordcount on two (small) EC2 nodes<br />Map machine: 2 map slots<br />Reduce machine: 2 reduce slots<br />Input 2GB data, 512MB block size<br />So job contains 4 maps and (a hard-coded) 2 reduces<br />
    • 30. Blocking vs. Pipelining<br />Job completion when reduce finishes<br />Reduce task performing final merge-sort<br />No significant idle periods during the shuffle phase<br />4th map output received<br />3rd map output received<br />2nd map output received<br />1st map output received<br />Reduce task 6.5 minute idle period<br />~ 15 minutes<br />~ 9 minutes<br />Simple wordcount on two (small) EC2 nodes<br />Map machine: 2 map slots<br />Reduce machine: 2 reduce slots<br />Input 2GB data, 512MB block size<br />So job contains 4 maps and (a hard-coded) 2 reduces<br />
    • 31. Recall in blocking…<br />Operators block<br />Poor CPU and I/O overlap<br />Reduce task idle periods<br />Only the final answer is fetched<br />So more data is fetched resulting in…<br />Network traffic spikes<br />Especially when a group of maps finish<br />
    • 32. CPU Utilization<br />Map tasks loading 2GB of data<br />Mapper CPU<br />Pipelining reduce tasks start working (presorting) early<br />Reduce task 6.5 minute idle period<br />Reducer CPU<br />Amazon Cloudwatch<br />
    • 33. Recall in blocking…<br />Operators block<br />Poor CPU and I/O overlap<br />Reduce task idle periods<br />Only the final answer is fetched<br />So more data is fetched at once resulting in…<br />Network traffic spikes<br />Especially when a group of maps finish<br />
    • 34. Network Traffic Spikes(map machine network out)<br />First 2 maps finish and send output<br />Last map finishes and sends output<br />3rd map finishes and sends output<br />Amazon Cloudwatch<br />
    • 35. A more realistic setup<br /><ul><li>20 extra-large nodes: 4 cores, 15GB RAM</li></ul>Slot capacity: <br />80 maps (4 per node)<br />60 reduces (3 per node)<br />Evaluate using wordcount job<br />Input: 100GB randomly generated words<br />Block size: 512MB (240 maps) and 32MB (3120 maps)<br />Hard coded 60 reduce tasks<br />
    • 36. Adaptive vs. Fixed Block Size <br />Job completion time<br />~ 49 minutes<br />~ 36 minutes<br />512MB block size, 240 maps<br />Maps scheduled in 3 waves (1st 80, 2nd 80, last 80)<br />Large block size creates reduce idle periods<br />Poor CPU and I/O overlap<br />Pipelining minimizes idle periods<br />
    • 37. Adaptive vs. Fixed Block Size <br />Job completion time<br />~ 42 minutes<br />~ 35 minutes<br />32MB block size, 3120 maps<br />Maps scheduled in 39 waves, maps finish fast! <br />Small block size improves CPU and I/O overlap<br />BUT increases scheduling overhead so not a scalable solution<br />Adaptive policy finds the right degree of pipelined parallelism<br />Based on runtime dynamics (load, network capacity, etc.)<br />
    • 38. Future Work<br />Blocking vs. Pipelining<br />Thorough performance analysis at scale<br />Hadoop optimizer<br />Online Aggregation<br />Statically-robust estimation<br />Random sampling of the input<br />Better UI for approximate results<br />Stream Processing<br />Develop full-fledged stream processing framework<br />Stream support for high-level query languages<br />
    • 39. Thank you!<br />Questions?<br />MapReduce Online paper: In NSDI 2010<br />Source code: http://code.google.com/p/hop/<br />Contact: tcondie@cs.berkeley.edu<br />

    ×