Hadoop Network Performance profile

Profiling the Network Performance
of Hadoop Jobs
Team : Pramod Biligiri & Sayed Asad Ali

Talk Outline
Introduction to the problem
What is Hadoop?
Hadoop’s MapReduce Framework
Shuffle as a Bottleneck
Experimental Setup
Choice of Benchmarks
Terasort Discussion
Ranked Inverted Index Discussion
Summary and Future Work

Introduction to the problem
Reproduce existing results which show that the
“Network” is the bottleneck in shuffle-intensive
Hadoop jobs.

What is Hadoop?
A framework for distributed processing of large data sets across
clusters of computers using simple programming models based on
Google’s MapReduce.
Distinct Features:
● Designed for Commodity Hardware
● Highly Fault-tolerant
● Horizontally Scalable
● Push computation to data

MapReduce
● MapReduce is a programming model for processing large data sets
with a parallel, distributed algorithm on a cluster
● Programming Model
○ For each input record, generate (key, value)
○ Apply reduce operation for all values corresponding to the
same key

Hadoop’s MapReduce Framework
1. Prepare the Map() input
2. Run the user-provided Map() code
3. "Shuffle" the Map output to the Reduce processors
4. Run the user-provided Reduce() code
5. Produce the final output

Shuffle as a Bottleneck?
“On average, the shuffle phase accounts for 33% of the running
time in these jobs. In addition, in 26% of the jobs with reduce tasks,
shuffles account for more than 50% of the running time, and in 16% of
jobs, they account for more than 70% of the running time. This
confirms widely reported results that the network is a bottleneck
in MapReduce”
Managing Data Transfers in Computer Clusters with Orchestra
- Mosharaf Chowdhury et al

Chosen Benchmarks
● Terasort
● Ranked Inverted Index

Experimental Setups
Instance type Memory CPU Elastic Compute Units Disk Network performance
Config 1 m1.large 7.5 GB 64-bit 4 2 x 420 GB Moderate
Config 2 m1.xlarge 15 GB 64-bit 8 4 x 420 GB High
SDSC custom 8 GB 64-bit/ Intel Xeon CPU 5140 @2.33
GHz, 4 cores
2 x 1.5 TB 1 Gb/s

Network Performance of EMR
Conflicting Values!
Source 1 : with AppNeta pathtest
average : 753 Mb/s
http://www.appneta.com/resources/pathtest-download.html
Source 2 : “The available bandwidth is still 1 Gb/s, confirming
anecdotal evidence that EC2 has full bisection bandwidth."
Opening Up Black Box Networks with CloudTalk, by Costin Raiciu et al
Source 3 : “The median TCP/UDP throughput of medium
instances are both close to 760 Mb/s."
The Impact of Virtualization on Network Performance of Amazon EC2 Data Center, by Guohui Wang et al

Why Terasort?
● Popular benchmark for Hadoop
● Shipped with most Hadoop distributions.
● Utilizes all aspects of the cluster - cpu, network, disk and memory
● Large amount of data to shuffle (240 GB).
● Representative of real world workloads
“This data shuffle pattern arises in large scale sorts, merges and join
operations in the data center. We chose this test because, in our
interactions with application developers, we learned that many use such
operations with caution, because the operations are highly expensive in
today’s data center network.”
source : VL2: A Scalable and Flexible Data Center Network - A. Greenberg et al.

Terasort - How it works:
● Sorts 1 terabyte of data.
● Each data item is 100 bytes in size.
● The first 10 bytes of a data item constitute its sort key.
● Format of input data:
○ <key 10 bytes><rowid 10 bytes><filler 78 bytes>rn
■ key : random characters from ASCII 32-126
■ rowid : an integer
■ filler : random characters from the set A-Z

Terasort - How it works:
Map
Partition input keys into different buckets
<Leverage Hadoop’s default sorting of Map output>
Reduce
Collect outputs from different maps

Comparison of Terasort on different configurations
Instance type
Config 1 m1.large (RAM 7.5 GB)
Config 2 m1.xlarge (RAM 15 GB)
SDSC Custom (RAM 8 GB)
Total job Time
(min)
Map Time
(min)
Reduce
Time (min)
Shuffle
Average Time
Shuffle Time %
Config 1 205 84 205 60 29.3
SDSC 166 60 90 36 21.7
Config 2 86 40 75 22 25.5

CDF of data transferred over the network during the lifetime of the job
Map ends
Shuffle starts
Shuffle ends
Reduce nearly done
Sorting of Map outputs
(local to the node)
5100 6900

Network Transfer Rate on nodes
Network Link Saturated

Disk I/O
Sorting of map
outputs
Blue : Read
Red : Write

Why Ranked Inverted Index?
● For a given text corpus, for each word it generates a list of
documents containing the word in decreasing order of frequency
word -> (count1 | file1), (count2 | file2), ...
count1 > count2 > …
● A ranked inverted index is used often in text processing and
information retrieval tasks
● Mentioned in the Tarazu paper as a Shuffle heavy workload
Tarazu: Optimizing MapReduce On Heterogeneous Clusters, Faraz Ahmad et al.

Ranked Inverted Index - How it
works:
Map input: (word | filename) -> count
Map output: word -> (filename, count)
Reduce output: word -> (count1 | file1), (count2 | file2) ...
It involves a sort of the values on the reduce side
(Note that the Map input is the output of another MapReduce job called
sequence-count)

Experimental Results of Ranked Inverted Index
Instance type
Config 1 m1.large (RAM 7.5 GB)
Total job Time
(min)
Map Time
(min)
Reduce
Time (min)
Shuffle
Average Time
Shuffle Time %
Config 1 12 5.5 11.5 3.5 27.14
Input Data Set : 40 GB ftp://ftp.ecn.purdue.edu/fahmad/rankedinvindex_40GB.tar.bz2

CDF of data transferred over the network during the lifetime of the job
Map ends
Shuffle starts
Shuffle ends
Reduce nearly done
Replicating results to 3
Nodes

Disk I/O Blue : Read
Red : Write

Summary
- Shuffle can constitute significant time of the total job runtime
- Worth investing in good network connectivity for a compute cluster

Stuff that doesn’t add up!
● Why does peak Network Bandwidth for Ranked Inverted Index
overshoot the 1Gb/s mark?
● Why is the sort phase of RII so short?

Future Work
● How does changing the various parameters make a difference? eg
io.sort.mb, io.sort.factor, fs.inmemory.size.mb
● Effect of Combiners?
● Varying the number of Map Tasks and Reduce Tasks
● How many Map tasks are rack local or machine local?
● Investigate the unresolved issues
● Lack of precise information about “topology” and “network
bandwidth” for EMR Clusters

Standard Test Results
Input Size Run Time on
Hadoop (min)
Shuffle Volume Critical Path
tera-sort 300 2353 200 Shuffle
ranked-inverted-index 205 2322 219 Shuffle

Hadoop Network Performance profile

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Hadoop Network Performance profile

Similar to Hadoop Network Performance profile (20)

Recently uploaded

Recently uploaded (20)

Hadoop Network Performance profile