Block Sampling: Efficient Accurate Online Aggregation in MapReduce

Block Sampling:
Efficient Accurate Online
Aggregation in MapReduce
5th IEEE International Conference on Cloud Computing
Technology and Science (CloudCom 2013)
Vasiliki Kalavri, Vaidas Brundza, Vladimir Vlassov
{kalavri, vaidas, vladv}@kth.se
3 December 2013, Bristol, UK

Problem and Motivation
Luckily, in many cases results can be
useful even before job completion
○ tolerate some inaccuracy
○ benefit from faster answers
2
Big data processing is usually very time-
consuming...
… but many applications require results
really fast or can only use results for a
limited window of time

MapReduce vs. MapReduce Online
mapper
reducer
Local
Disk
Input
Record map
function
Output
Record
HTTP request
In original MR, a reducer task cannot
fetch the output of a map task which
hasn't committed its output to disk
mapper
reducer
Input
Record map
function
Output
Record
TCP- push/pull
3

Online Aggregation
● Apply the reduce function to the data seen so far
● % input processed to estimate accuracy
4

Sampling Challenges
● Data in HDFS
○ Disk already access is terribly slow
○ Random disk access for sampling is even slower
● Unstructured Data
○ Sample based on what?
○ We don’t know the query, we don’t know the
key or the value!
5

The Block Sampling Technique
6

MapReduce Online vs. Block Sampling
Average Temperature Estimation on Weather Data
Unsorted Sorted
7

Takeaway
8
● Useful results even before job completion
● Disk random access is prohibitively
expensive → efficiently emulate sampling
using in-memory shuffling
● Higher sampling rate improves accuracy but
also increases communication costs among
mapper tasks

Average Temperature Estimation on
Sorted and Unsorted Weather Data
Unsorted Sorted
6
How do the block sampling rate and the % of processed input
affect accuracy?

Performance - Bias Reduction
snapshot freq = 10%

Experimental Setup
● 8 large-instance OpenStack VMs
○ 4 vCPUs, 8 GB memory, 90 GB disk
● Linux Ubuntu 12.04.2 LTS OSm Java 1.7.0 14
● up to 17 map tasks and 5 reduce tasks per job, HDFS
block size of 64MB
● weather station data from the National Climatic
Data Center ftp server (available years 1901 to 2013)
● the complete Project Gutenberg e-books catalog
(30615 e-books in .txt format)

System Configuration Parameters

Bias Reduction
● Access Phase: Store the entire input split
in the reader task’s local memory
● Shuffling Phase: Shuffle the records of
the block in-place
● Processing Phase: Serve a record to the
mapper task from local memory (avoids
additional disk I/O)

Future Work
● Integrate statistical estimators
○ provide error bounds for users
● Automatically fine-tune sampling
parameters based on system
configuration
● Explore alternative sampling techniques
and wavelet-approximation

Block Sampling: Efficient Accurate Online Aggregation in MapReduce

More Related Content

What's hot

Viewers also liked

Recently uploaded

Block Sampling: Efficient Accurate Online Aggregation in MapReduce