Block Sampling:
Efficient Accurate Online
Aggregation in MapReduce
5th IEEE International Conference on Cloud Computing
Technology and Science (CloudCom 2013)
Vasiliki Kalavri, Vaidas Brundza, Vladimir Vlassov
{kalavri, vaidas, vladv}@kth.se
3 December 2013, Bristol, UK
Problem and Motivation
Luckily, in many cases results can be
useful even before job completion
○ tolerate some inaccuracy
○ benefit from faster answers
2
Big data processing is usually very time-
consuming...
… but many applications require results
really fast or can only use results for a
limited window of time
MapReduce vs. MapReduce Online
mapper
reducer
Local
Disk
Input
Record map
function
Output
Record
HTTP request
In original MR, a reducer task cannot
fetch the output of a map task which
hasn't committed its output to disk
mapper
reducer
Input
Record map
function
Output
Record
TCP- push/pull
3
Online Aggregation
● Apply the reduce function to the data seen so far
● % input processed to estimate accuracy
4
Sampling Challenges
● Data in HDFS
○ Disk already access is terribly slow
○ Random disk access for sampling is even slower
● Unstructured Data
○ Sample based on what?
○ We don’t know the query, we don’t know the
key or the value!
5
The Block Sampling Technique
6
MapReduce Online vs. Block Sampling
Average Temperature Estimation on Weather Data
Unsorted Sorted
7
Takeaway
8
● Useful results even before job completion
● Disk random access is prohibitively
expensive → efficiently emulate sampling
using in-memory shuffling
● Higher sampling rate improves accuracy but
also increases communication costs among
mapper tasks
Block Sampling:
Efficient Accurate Online
Aggregation in MapReduce
5th IEEE International Conference on Cloud Computing
Technology and Science (CloudCom 2013)
Vasiliki Kalavri, Vaidas Brundza, Vladimir Vlassov
{kalavri, vaidas, vladv}@kth.se
3 December 2013, Bristol, UK
Average Temperature Estimation on
Sorted and Unsorted Weather Data
Unsorted Sorted
6
How do the block sampling rate and the % of processed input
affect accuracy?
Performance - Sampling Rate
Performance - Bias Reduction
snapshot freq = 10%
Experimental Setup
● 8 large-instance OpenStack VMs
○ 4 vCPUs, 8 GB memory, 90 GB disk
● Linux Ubuntu 12.04.2 LTS OSm Java 1.7.0 14
● up to 17 map tasks and 5 reduce tasks per job, HDFS
block size of 64MB
● weather station data from the National Climatic
Data Center ftp server (available years 1901 to 2013)
● the complete Project Gutenberg e-books catalog
(30615 e-books in .txt format)
System Configuration Parameters
Bias Reduction
● Access Phase: Store the entire input split
in the reader task’s local memory
● Shuffling Phase: Shuffle the records of
the block in-place
● Processing Phase: Serve a record to the
mapper task from local memory (avoids
additional disk I/O)
Future Work
● Integrate statistical estimators
○ provide error bounds for users
● Automatically fine-tune sampling
parameters based on system
configuration
● Explore alternative sampling techniques
and wavelet-approximation

Block Sampling: Efficient Accurate Online Aggregation in MapReduce

  • 1.
    Block Sampling: Efficient AccurateOnline Aggregation in MapReduce 5th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2013) Vasiliki Kalavri, Vaidas Brundza, Vladimir Vlassov {kalavri, vaidas, vladv}@kth.se 3 December 2013, Bristol, UK
  • 2.
    Problem and Motivation Luckily,in many cases results can be useful even before job completion ○ tolerate some inaccuracy ○ benefit from faster answers 2 Big data processing is usually very time- consuming... … but many applications require results really fast or can only use results for a limited window of time
  • 3.
    MapReduce vs. MapReduceOnline mapper reducer Local Disk Input Record map function Output Record HTTP request In original MR, a reducer task cannot fetch the output of a map task which hasn't committed its output to disk mapper reducer Input Record map function Output Record TCP- push/pull 3
  • 4.
    Online Aggregation ● Applythe reduce function to the data seen so far ● % input processed to estimate accuracy 4
  • 5.
    Sampling Challenges ● Datain HDFS ○ Disk already access is terribly slow ○ Random disk access for sampling is even slower ● Unstructured Data ○ Sample based on what? ○ We don’t know the query, we don’t know the key or the value! 5
  • 6.
    The Block SamplingTechnique 6
  • 7.
    MapReduce Online vs.Block Sampling Average Temperature Estimation on Weather Data Unsorted Sorted 7
  • 8.
    Takeaway 8 ● Useful resultseven before job completion ● Disk random access is prohibitively expensive → efficiently emulate sampling using in-memory shuffling ● Higher sampling rate improves accuracy but also increases communication costs among mapper tasks
  • 9.
    Block Sampling: Efficient AccurateOnline Aggregation in MapReduce 5th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2013) Vasiliki Kalavri, Vaidas Brundza, Vladimir Vlassov {kalavri, vaidas, vladv}@kth.se 3 December 2013, Bristol, UK
  • 10.
    Average Temperature Estimationon Sorted and Unsorted Weather Data Unsorted Sorted 6 How do the block sampling rate and the % of processed input affect accuracy?
  • 11.
  • 12.
    Performance - BiasReduction snapshot freq = 10%
  • 13.
    Experimental Setup ● 8large-instance OpenStack VMs ○ 4 vCPUs, 8 GB memory, 90 GB disk ● Linux Ubuntu 12.04.2 LTS OSm Java 1.7.0 14 ● up to 17 map tasks and 5 reduce tasks per job, HDFS block size of 64MB ● weather station data from the National Climatic Data Center ftp server (available years 1901 to 2013) ● the complete Project Gutenberg e-books catalog (30615 e-books in .txt format)
  • 14.
  • 15.
    Bias Reduction ● AccessPhase: Store the entire input split in the reader task’s local memory ● Shuffling Phase: Shuffle the records of the block in-place ● Processing Phase: Serve a record to the mapper task from local memory (avoids additional disk I/O)
  • 16.
    Future Work ● Integratestatistical estimators ○ provide error bounds for users ● Automatically fine-tune sampling parameters based on system configuration ● Explore alternative sampling techniques and wavelet-approximation