4. 3. Choice of Benchmarks:
3.1. Terasort
3.1.1. Why Terasort?
Terasort [4] is a popular benchmark for Hadoop and is also shipped with most Hadoop
distributions. This benchmark program sorts 1 terabyte of data. Each data item is 100 bytes in
size. The first 10 bytes of a data item constitute its sort key.
Each key is represented as:
<key 10 bytes><rowid 10 bytes><filler 78 bytes>rn
key : random characters from ASCII 32126
rowid : an integer
filler : random characters from the set AZ
The Terasort workload utilizes all aspects of the cluster cpu, network, disk and memory and
also has a large amount of data to shuffle (240 GB). Moreover, this is representative of real world
workloads, as mentioned in the VL2 paper[3]:
“we consider an alltoall data shuffle stress test: all servers simultaneously initiate TCP
transfers to all other servers. This data shuffle pattern arises in large scale sorts, merges and
join operations in the data center. We chose this test because, in our interactions with application
developers, we learned that many use such operations with caution, because the operations are
highly expensive in today’s data center network. However, data shuffles are required, and, if data
shuffles can be efficiently supported, it could have large impact on the overall algorithmic and
data storage strategy.”
3.1.2. How it works?
The Map phase of Terasort partitions input keys into different buckets and then leverages
Hadoop’s default sorting of Map output. Finally, the reducer only collects outputs from different
maps and does not perform a computationintensive task. Due to its simple application logic and
usage of Hadoop’s default sorting mechanism, Terasort is considered a good benchmarking
application.
3.2. Ranked Inverted Index
3.2.1. Why Ranked Inverted Index?
This benchmark was chosen as it is mentioned in the Tarazu[4] paper as a Shuffle heavy
workload. Also, a ranked inverted index is used often in text processing and information retrieval
tasks and is therefore a commonly executed job. For a given text corpus, for each word it
generates a list of documents containing the word in decreasing order of frequency
word > (count1 | file1), (count2 | file2), ...
count1 > count2 > …
10. 5.2. Ranked Inverted Index (RII)
5.1.1. Data for a RII run on Config 1
Total job
Time (min)
Map Time
(min)
Reduce
Time (min)
Shuffle Average
Time
Shuffle Time %
Config 1 12 5.5 11.5 3.5 27.14
5.1.2. CDF of data
The CDF shows that network traffic happens in three distinct phases. First is the Map phase
during which there is steady traffic, although not at high rates. Once, shuffle is activated then the
network traffic picks up, this is where 13 GB of data gets transferred across the network in a
short duration and we see the saturation point of the network. The burst of traffic which happens
after this is the replication of the results to 3 nodes.
5.1.3. Network Activity
The following figure shows the network transfer rates over the lifetime of the job. It shows that
during the Shuffle phase the network traffic reaches 1.5 Gbps, which is something we have not
been able to explain as the maximum expected rate should have been in the range of 700 800
Mbps.