Map Reduce Online

MapReduce Online Tyson Condie UC Berkeley Joint work with Neil Conway, Peter Alvaro, and Joseph M. Hellerstein (UC Berkeley) KhaledElmeleegy and Russell Sears (Yahoo! Research)

MapReduce Programming Model Think data-centric Apply a two step transformation to data sets Map step: Map(k1, v1) -> list(k2, v2) Apply map function to input records Divide output records into groups Reduce step: Reduce(k2, list(v2)) -> list(v3) Consolidate groups from the map step Apply reduce function to each group

MapReduce System Model Shared-nothing architecture Tuned for massive data parallelism Many maps operate on portions of the input Many reduces, each assigned specific groups ,[object Object],Runtimes range in minutes tohours Execute on 10s to 1000s of machines Failures common (fault tolerance crucial) Fault tolerance via operator restart since … Operators complete before producing any output Atomic data exchange between operators Simple, elegant, easy to remember

Life Beyond Batch MapReduce often used for analytics on streams of data that arrive continuously Click streams, network traffic, web crawl data, … Batch approach: buffer, load, process High latency Not scalable Online approach: run MR jobs continuously Analyze data as it arrives

Online Query Processing Two domains of interest (at massive scale): Online aggregation Interactive data analysis (watch answer evolve) Stream processing Continuous (real-time) analysis of data streams Blocking operators are a poor fit Final answers only No infinite streams Operators need to pipeline BUT we must retain fault tolerance AND Keep It Simple Stupid!

A Brave New MapReduce World Pipelined MapReduce Maps can operate on infinite data (Stream processing) Reduces can export early answers (Online aggregation) Hadoop Online Prototype (HOP) Hadoop with pipelining support Preserves Hadoop interfaces and APIs Pipelining fault tolerance model

Outline Hadoop MR Background Hadoop Online Prototype (HOP) Online Aggregation Stream Processing Performance (blocking vs. pipelining) Future Work

Hadoop Architecture HadoopMapReduce Single master node (JobTracker), many worker nodes (TaskTrackers) Client submits a job to the JobTracker JobTracker splits each job into tasks (map/reduce) Assigns tasks to TaskTrackers on demand Hadoop Distributed File System (HDFS) Single name node, many data nodes Data is stored as fixed-size (e.g., 64MB) blocks HDFS typically holds map input and reduce output

Hadoop Job Execution Submit job map reduce schedule map reduce

Hadoop Job Execution Read Input File map reduce HDFS Block 1 Block 2 map reduce

Hadoop Job Execution Finished Map Locations Local FS map reduce Local FS map reduce Map output: sorted by group id and key group id = hash(key) mod # reducers

Hadoop Job Execution Local FS map reduce HTTP GET Local FS map reduce

Hadoop Job Execution reduce Write Final Answer HDFS reduce Input: sorted runs of records assigned the same group id Process: merge-sort runs, for each final group call reduce

Hadoop Online Prototype (HOP) Pipelining between operators Data pushed from producers to consumers Producers schedule data transfer concurrently with operator computation Consumers notify producers early (ASAP) HOP API ,[object Object],Pig, Hive, Jaqlstill work ,[object Object]

JobTracker accepts a series of jobs,[object Object]

Pipelining Data Unit Initial design: pipeline eagerly (each record) Prevents map side preaggregation (a.k.a., combiner) Moves all the sorting work to the reduce step Map computation can block on network I/O Revised design: pipeline small sorted runs (spills) Task thread: apply (map/reduce) function, buffer output Spill thread: sort & combine buffer, spill to a file TaskTracker: sends spill files to consumers Simple adaptive algorithm Halt pipeline when 1. spill files backup OR 2. effective combiner Resume pipeline by first merging & combining accumulated spill files into a single file

Pipelined Fault Tolerance (PFT) Simple PFT design: Reduce treats in-progress map output as tentative If map dies then throw away output If map succeeds then accept output Revised PFT design: Spill files have deterministic boundaries and are assigned a sequence number Correctness: Reduce tasks ensure spill files are idempotent Optimization: Map tasks avoid sending redundant spill files

Benefits of Pipelining Online aggregation An early view of the result from a running computation Interactive data analysis (you say when to stop) Stream processing Tasks operate on infinite data streams Real-time data analysis Performance? Pipelining can… Improve CPU and I/O overlap Steady network traffic (fewer load spikes) Improve cluster utilization (reducers presort earlier)

Outline Hadoop MR Background Hadoop Online Prototype (HOP) Online Aggregation Implementation Example Approximation Query Stream Processing Performance (blocking vs. pipelining) Future Work

Implementation Read Input File map reduce Block 1 HDFS HDFS Block 2 map reduce Write Snapshot Answer ,[object Object]

Intermediate results published to HDFS,[object Object]

Bar graph shows results for a single hour (1600) Taken less than 2 minutes into a ~2 hour job!

Approximation error: |estimate – actual| / actual Job progress assumes hours are uniformly sampled Sample fraction closer to the sample distribution of each hour

Outline Hadoop MR Background Hadoop Online Prototype (HOP) Online Aggregation Stream Processing Implementation Use case: real-time monitoring system Performance (blocking vs. pipelining) Future Work

Implementation Map and reduce tasks run continuously Challenge: what if number of tasks exceed slot capacity? Current approach: wait for required slot capacity Map tasks stream spill files Input taken from arbitrary source (MR job, socket, etc.) Garbage collection handled by system Window management done at reducer Reduce function arguments Input data: the set of current input records OutputCollector: output records for the current window InputCollector: records for subsequent windows Return value says when to call next e.g., in X milliseconds, after receiving Y records, etc.

Real-time Monitoring System Use MapReduce to monitor MapReduce Continuous jobs that monitor cluster health Same scheduler for user jobs and monitoring jobs Economy of Mechanism Agents monitor machines Record statistics of interest (/proc, log files, etc.) Implemented as a continuous map task Aggregators group agent-local statistics High level (rack, datacenter) analysis and correlations Reduce windows: 1, 5, and 15 second load averages

Monitor /proc/vmstat for swapping Alert triggered after some threshold Alert reported around a second after passing threshold Faster than the (~5 second) TaskTracker reporting interval ,[object Object],[object Object]

Performance Open problem! A lot of performance related work still remains Focus on obvious cases first Why block? Effective combiner Reduce step is a bottleneck Why pipeline? Improve cluster utilization Smooth out network traffic

Blocking vs. Pipelining Final map finishes, sorts and sends to reduce 3rd map finishes, sorts, and sends to reduce 2 maps sort and send output to reducer Simple wordcount on two (small) EC2 nodes Map machine: 2 map slots Reduce machine: 2 reduce slots Input 2GB data, 512MB block size So job contains 4 maps and (a hard-coded) 2 reduces

Blocking vs. Pipelining Job completion when reduce finishes Reduce task performing final merge-sort No significant idle periods during the shuffle phase 4th map output received 3rd map output received 2nd map output received 1st map output received Reduce task 6.5 minute idle period ~ 15 minutes ~ 9 minutes Simple wordcount on two (small) EC2 nodes Map machine: 2 map slots Reduce machine: 2 reduce slots Input 2GB data, 512MB block size So job contains 4 maps and (a hard-coded) 2 reduces

Recall in blocking… Operators block Poor CPU and I/O overlap Reduce task idle periods Only the final answer is fetched So more data is fetched resulting in… Network traffic spikes Especially when a group of maps finish

CPU Utilization Map tasks loading 2GB of data Mapper CPU Pipelining reduce tasks start working (presorting) early Reduce task 6.5 minute idle period Reducer CPU Amazon Cloudwatch

Recall in blocking… Operators block Poor CPU and I/O overlap Reduce task idle periods Only the final answer is fetched So more data is fetched at once resulting in… Network traffic spikes Especially when a group of maps finish

Network Traffic Spikes(map machine network out) First 2 maps finish and send output Last map finishes and sends output 3rd map finishes and sends output Amazon Cloudwatch

A more realistic setup ,[object Object],Slot capacity: 80 maps (4 per node) 60 reduces (3 per node) Evaluate using wordcount job Input: 100GB randomly generated words Block size: 512MB (240 maps) and 32MB (3120 maps) Hard coded 60 reduce tasks

Adaptive vs. Fixed Block Size Job completion time ~ 49 minutes ~ 36 minutes 512MB block size, 240 maps Maps scheduled in 3 waves (1st 80, 2nd 80, last 80) Large block size creates reduce idle periods Poor CPU and I/O overlap Pipelining minimizes idle periods

Adaptive vs. Fixed Block Size Job completion time ~ 42 minutes ~ 35 minutes 32MB block size, 3120 maps Maps scheduled in 39 waves, maps finish fast! Small block size improves CPU and I/O overlap BUT increases scheduling overhead so not a scalable solution Adaptive policy finds the right degree of pipelined parallelism Based on runtime dynamics (load, network capacity, etc.)

Future Work Blocking vs. Pipelining Thorough performance analysis at scale Hadoop optimizer Online Aggregation Statically-robust estimation Random sampling of the input Better UI for approximate results Stream Processing Develop full-fledged stream processing framework Stream support for high-level query languages

Thank you! Questions? MapReduce Online paper: In NSDI 2010 Source code: http://code.google.com/p/hop/ Contact: tcondie@cs.berkeley.edu

Map Reduce Online

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Map Reduce Online

Similar to Map Reduce Online (20)

More from Hadoop User Group

More from Hadoop User Group (20)

Map Reduce Online

Editor's Notes