Your SlideShare is downloading. ×
Map Reduce Online
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Map Reduce Online

4,877

Published on

Map Reduce Online presentation at the Hadoop Bay Area User Group, March 24 at Yahoo! Sunnyvale campus

Map Reduce Online presentation at the Hadoop Bay Area User Group, March 24 at Yahoo! Sunnyvale campus

0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,877
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
13
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • TARGET Batch-oriented computations
  • DEFINE COMBINER!!!
  • Vertical line indicates alert
  • Data transfer scheduled by reducerReduce partitions serviced in arbitrary orderPoor spatial locality!
  • Data transfer scheduled by reducerReduce partitions serviced in arbitrary orderPoor spatial locality!
  • Operators blockPoor CPU and I/O overlapReduce task idle periods
  • Cluster capacity: 80 maps, 60 reduces
  • First plateau due to 80 maps finishing at the same timeSubsequent map waves get spread out by the scheduler duty cycle.
  • Transcript

    • 1. MapReduce Online
      Tyson Condie
      UC Berkeley
      Joint work with
      Neil Conway, Peter Alvaro, and Joseph M. Hellerstein (UC Berkeley)
      KhaledElmeleegy and Russell Sears (Yahoo! Research)
    • 2. MapReduce Programming Model
      Think data-centric
      Apply a two step transformation to data sets
      Map step: Map(k1, v1) -> list(k2, v2)
      Apply map function to input records
      Divide output records into groups
      Reduce step: Reduce(k2, list(v2)) -> list(v3)
      Consolidate groups from the map step
      Apply reduce function to each group
    • 3. MapReduce System Model
      Shared-nothing architecture
      Tuned for massive data parallelism
      Many maps operate on portions of the input
      Many reduces, each assigned specific groups
      • Batch-oriented computations over huge data sets
      Runtimes range in minutes tohours
      Execute on 10s to 1000s of machines
      Failures common (fault tolerance crucial)
      Fault tolerance via operator restart since …
      Operators complete before producing any output
      Atomic data exchange between operators
      Simple, elegant, easy to remember
    • 4. Life Beyond Batch
      MapReduce often used for analytics on streams of data that arrive continuously
      Click streams, network traffic, web crawl data, …
      Batch approach: buffer, load, process
      High latency
      Not scalable
      Online approach: run MR jobs continuously
      Analyze data as it arrives
    • 5. Online Query Processing
      Two domains of interest (at massive scale):
      Online aggregation
      Interactive data analysis (watch answer evolve)
      Stream processing
      Continuous (real-time) analysis of data streams
      Blocking operators are a poor fit
      Final answers only
      No infinite streams
      Operators need to pipeline
      BUT we must retain fault tolerance
      AND Keep It Simple Stupid!
    • 6. A Brave New MapReduce World
      Pipelined MapReduce
      Maps can operate on infinite data (Stream processing)
      Reduces can export early answers (Online aggregation)
      Hadoop Online Prototype (HOP)
      Hadoop with pipelining support
      Preserves Hadoop interfaces and APIs
      Pipelining fault tolerance model
    • 7. Outline
      Hadoop MR Background
      Hadoop Online Prototype (HOP)
      Online Aggregation
      Stream Processing
      Performance (blocking vs. pipelining)
      Future Work
    • 8. Hadoop Architecture
      HadoopMapReduce
      Single master node (JobTracker), many worker nodes (TaskTrackers)
      Client submits a job to the JobTracker
      JobTracker splits each job into tasks (map/reduce)
      Assigns tasks to TaskTrackers on demand
      Hadoop Distributed File System (HDFS)
      Single name node, many data nodes
      Data is stored as fixed-size (e.g., 64MB) blocks
      HDFS typically holds map input and reduce output
    • 9. Hadoop Job Execution
      Submit job
      map
      reduce
      schedule
      map
      reduce
    • 10. Hadoop Job Execution
      Read
      Input File
      map
      reduce
      HDFS
      Block 1
      Block 2
      map
      reduce
    • 11. Hadoop Job Execution
      Finished
      Map Locations
      Local FS
      map
      reduce
      Local FS
      map
      reduce
      Map output: sorted by group id and key
      group id = hash(key) mod # reducers
    • 12. Hadoop Job Execution
      Local FS
      map
      reduce
      HTTP GET
      Local FS
      map
      reduce
    • 13. Hadoop Job Execution
      reduce
      Write Final Answer
      HDFS
      reduce
      Input: sorted runs of records assigned the same group id
      Process: merge-sort runs, for each final group call reduce
    • 14. Hadoop Online Prototype (HOP)
      Pipelining between operators
      Data pushed from producers to consumers
      Producers schedule data transfer concurrently with operator computation
      Consumers notify producers early (ASAP)
      HOP API
      • No changes required to existing clients
      Pig, Hive, Jaqlstill work
      • Configuration for pipeline/block modes
      • 15. JobTracker accepts a series of jobs
    • HOP Job Pipelined Execution
      Schedule
      Schedule + Map Locations
      map
      reduce
      Pipeline request
      map
      reduce
    • 16. Pipelining Data Unit
      Initial design: pipeline eagerly (each record)
      Prevents map side preaggregation (a.k.a., combiner)
      Moves all the sorting work to the reduce step
      Map computation can block on network I/O
      Revised design: pipeline small sorted runs (spills)
      Task thread: apply (map/reduce) function, buffer output
      Spill thread: sort & combine buffer, spill to a file
      TaskTracker: sends spill files to consumers
      Simple adaptive algorithm
      Halt pipeline when
      1. spill files backup OR 2. effective combiner
      Resume pipeline by first merging & combining accumulated spill files into a single file
    • 17. Pipelined Fault Tolerance (PFT)
      Simple PFT design:
      Reduce treats in-progress map output as tentative
      If map dies then throw away output
      If map succeeds then accept output
      Revised PFT design:
      Spill files have deterministic boundaries and are assigned a sequence number
      Correctness: Reduce tasks ensure spill files are idempotent
      Optimization: Map tasks avoid sending redundant spill files
    • 18. Benefits of Pipelining
      Online aggregation
      An early view of the result from a running computation
      Interactive data analysis (you say when to stop)
      Stream processing
      Tasks operate on infinite data streams
      Real-time data analysis
      Performance? Pipelining can…
      Improve CPU and I/O overlap
      Steady network traffic (fewer load spikes)
      Improve cluster utilization (reducers presort earlier)
    • 19. Outline
      Hadoop MR Background
      Hadoop Online Prototype (HOP)
      Online Aggregation
      Implementation
      Example Approximation Query
      Stream Processing
      Performance (blocking vs. pipelining)
      Future Work
    • 20. Implementation
      Read
      Input File
      map
      reduce
      Block 1
      HDFS
      HDFS
      Block 2
      map
      reduce
      Write Snapshot
      Answer
      • Execute reduce task on intermediate data
      • 21. Intermediate results published to HDFS
    • Example Approximation Query
      The data:
      Wikipedia traffic statistics (1TB)
      Webpage clicks/hour
      5066 compressed files (each file = 1 hour click logs)
      The query:
      group by language and hour
      count clicks
      The approximation:
      Final answer ~ (intermediate click count * scale-up factor)
      Job progress: 1.0 / fraction of input processed
      Sample fraction: total # of hours / # hours sampled
      (each file = 1 hour click logs)
      and fraction of hour
    • 22. Bar graph shows results for a single hour (1600)
      Taken less than 2 minutes into a ~2 hour job!
    • 23. Approximation error: |estimate – actual| / actual
      Job progress assumes hours are uniformly sampled
      Sample fraction closer to the sample distribution of each hour
    • 24. Outline
      Hadoop MR Background
      Hadoop Online Prototype (HOP)
      Online Aggregation
      Stream Processing
      Implementation
      Use case: real-time monitoring system
      Performance (blocking vs. pipelining)
      Future Work
    • 25. Implementation
      Map and reduce tasks run continuously
      Challenge: what if number of tasks exceed slot capacity?
      Current approach: wait for required slot capacity
      Map tasks stream spill files
      Input taken from arbitrary source (MR job, socket, etc.)
      Garbage collection handled by system
      Window management done at reducer
      Reduce function arguments
      Input data: the set of current input records
      OutputCollector: output records for the current window
      InputCollector: records for subsequent windows
      Return value says when to call next
      e.g., in X milliseconds, after receiving Y records, etc.
    • 26. Real-time Monitoring System
      Use MapReduce to monitor MapReduce
      Continuous jobs that monitor cluster health
      Same scheduler for user jobs and monitoring jobs
      Economy of Mechanism
      Agents monitor machines
      Record statistics of interest (/proc, log files, etc.)
      Implemented as a continuous map task
      Aggregators group agent-local statistics
      High level (rack, datacenter) analysis and correlations
      Reduce windows: 1, 5, and 15 second load averages
    • 27. Monitor /proc/vmstat for swapping
      Alert triggered after some threshold
      Alert reported around a second after passing threshold
      Faster than the (~5 second) TaskTracker reporting interval
      • Feedback loop to the JobTracker for better scheduling
    • Outline
      Hadoop MR Background
      Hadoop Online Prototype (HOP)
      Online Aggregation
      Stream Processing
      Performance (blocking vs. pipelining)
      Not a comprehensive study
      But obvious cases exist
      Future Work
    • 28. Performance
      Open problem!
      A lot of performance related work still remains
      Focus on obvious cases first
      Why block?
      Effective combiner
      Reduce step is a bottleneck
      Why pipeline?
      Improve cluster utilization
      Smooth out network traffic
    • 29. Blocking vs. Pipelining
      Final map finishes, sorts and sends to reduce
      3rd map finishes, sorts, and sends to reduce
      2 maps sort and send output to reducer
      Simple wordcount on two (small) EC2 nodes
      Map machine: 2 map slots
      Reduce machine: 2 reduce slots
      Input 2GB data, 512MB block size
      So job contains 4 maps and (a hard-coded) 2 reduces
    • 30. Blocking vs. Pipelining
      Job completion when reduce finishes
      Reduce task performing final merge-sort
      No significant idle periods during the shuffle phase
      4th map output received
      3rd map output received
      2nd map output received
      1st map output received
      Reduce task 6.5 minute idle period
      ~ 15 minutes
      ~ 9 minutes
      Simple wordcount on two (small) EC2 nodes
      Map machine: 2 map slots
      Reduce machine: 2 reduce slots
      Input 2GB data, 512MB block size
      So job contains 4 maps and (a hard-coded) 2 reduces
    • 31. Recall in blocking…
      Operators block
      Poor CPU and I/O overlap
      Reduce task idle periods
      Only the final answer is fetched
      So more data is fetched resulting in…
      Network traffic spikes
      Especially when a group of maps finish
    • 32. CPU Utilization
      Map tasks loading 2GB of data
      Mapper CPU
      Pipelining reduce tasks start working (presorting) early
      Reduce task 6.5 minute idle period
      Reducer CPU
      Amazon Cloudwatch
    • 33. Recall in blocking…
      Operators block
      Poor CPU and I/O overlap
      Reduce task idle periods
      Only the final answer is fetched
      So more data is fetched at once resulting in…
      Network traffic spikes
      Especially when a group of maps finish
    • 34. Network Traffic Spikes(map machine network out)
      First 2 maps finish and send output
      Last map finishes and sends output
      3rd map finishes and sends output
      Amazon Cloudwatch
    • 35. A more realistic setup
      • 20 extra-large nodes: 4 cores, 15GB RAM
      Slot capacity:
      80 maps (4 per node)
      60 reduces (3 per node)
      Evaluate using wordcount job
      Input: 100GB randomly generated words
      Block size: 512MB (240 maps) and 32MB (3120 maps)
      Hard coded 60 reduce tasks
    • 36. Adaptive vs. Fixed Block Size
      Job completion time
      ~ 49 minutes
      ~ 36 minutes
      512MB block size, 240 maps
      Maps scheduled in 3 waves (1st 80, 2nd 80, last 80)
      Large block size creates reduce idle periods
      Poor CPU and I/O overlap
      Pipelining minimizes idle periods
    • 37. Adaptive vs. Fixed Block Size
      Job completion time
      ~ 42 minutes
      ~ 35 minutes
      32MB block size, 3120 maps
      Maps scheduled in 39 waves, maps finish fast!
      Small block size improves CPU and I/O overlap
      BUT increases scheduling overhead so not a scalable solution
      Adaptive policy finds the right degree of pipelined parallelism
      Based on runtime dynamics (load, network capacity, etc.)
    • 38. Future Work
      Blocking vs. Pipelining
      Thorough performance analysis at scale
      Hadoop optimizer
      Online Aggregation
      Statically-robust estimation
      Random sampling of the input
      Better UI for approximate results
      Stream Processing
      Develop full-fledged stream processing framework
      Stream support for high-level query languages
    • 39. Thank you!
      Questions?
      MapReduce Online paper: In NSDI 2010
      Source code: http://code.google.com/p/hop/
      Contact: tcondie@cs.berkeley.edu

    ×