Hadoop & Big Data benchmarking

Benchmarking
Hadoop & Big Data benchmarking
Dr. ir. ing. Bart Vandewoestyne
Sizing Servers Lab, Howest, Kortrijk
IWT TETRA User Group Meeting - November 28, 2014
1 / 62

Benchmarking
Outline
1 Intro: Hadoop essentials
2 Cloudera demo
3 Benchmarks
Micro Benchmarks
BigBench
4 Conclusions
2 / 62

Benchmarking
Intro: Hadoop essentials
Outline
2 Cloudera demo
3 Benchmarks
Micro Benchmarks
BigBench
4 Conclusions
3 / 62

Benchmarking
Hadoop
Hadoop is VMware, but the other way around.
4 / 62

Benchmarking
Hadoop 1.0
Source: Apache Hadoop YARN : moving beyond
MapReduce and batch processing with Apache Hadoop 2,
Hortonworks, 2014)
MapReduce and HDFS are the
core components, while other
components are built around the
core.
5 / 62

Benchmarking
Hadoop 2.0
Source: Apache Hadoop YARN : moving beyond
MapReduce and batch processing with Apache Hadoop 2,
Hortonworks, 2014)
YARN adds a more general
interface to run non-MapReduce
jobs within the Hadoop
framework.
6 / 62

Benchmarking
HDFS
Hadoop Distributed File System
Source: http://www.cac.cornell.edu/vw/MapReduce/dfs.aspx
7 / 62

Benchmarking
MapReduce
MapReduce = Programming Model
WordCount example:
Source: Optimizing Hadoop for MapReduce, Khaled Tannir
8 / 62

Benchmarking
Hadoop distributions
9 / 62

Benchmarking
Cloudera demo
Outline
2 Cloudera demo
3 Benchmarks
Micro Benchmarks
BigBench
4 Conclusions
10 / 62

Benchmarking
Cloudera demo
HDFS
11 / 62

Benchmarking
Cloudera demo
NameNode and DataNodes
12 / 62

Benchmarking
Cloudera demo
Hosts and their roles
13 / 62

Benchmarking
Cloudera demo
NameNode WebUI
NameNode WebUI address
http://sandy-quad-1.sslab.lan:50070/
14 / 62

Benchmarking
Cloudera demo
Replication factor
15 / 62

Benchmarking
Cloudera demo
HDFS Blocks
16 / 62

Benchmarking
Cloudera demo
Hue:

Benchmarking
Cloudera demo
Hadoop jobs: counters/metrics
18 / 62

Benchmarking
Benchmarks
Outline
2 Cloudera demo
3 Benchmarks
Micro Benchmarks
BigBench
4 Conclusions
19 / 62

Benchmarking
Benchmarks
Why benchmark?
My three reasons for using benchmarks:
1 Evaluating the eect of a hardware/software upgrade:
OS, Java VM,. . .
Hadoop, Cloudera CDH, Pig, Hive, Impala,. . .
2 Debugging:
Compare with other clusters or published results.
3 Performance tuning:
E.g. Cloudera CDH default con

g is defensive, not optimal.
20 / 62

Benchmarking
Benchmarks
Micro Benchmarks
Outline
2 Cloudera demo
3 Benchmarks
Micro Benchmarks
BigBench
4 Conclusions
21 / 62

Benchmarking
Benchmarks
Micro Benchmarks
Hadoop: Available tests
hadoop jar /some/path/to/hadoop-*test*.jar
22 / 62

Benchmarking
Benchmarks
Micro Benchmarks
TestDFSIO
Read and write test for HDFS.
Helpful for
getting an idea of how fast your cluster is in terms of I/O,
stress testing HDFS,
discover network performance bottlenecks,
shake out the hardware, OS and Hadoop setup of your cluster
machines (particularly the NameNode and the DataNodes).
23 / 62

Benchmarking
Benchmarks
Micro Benchmarks
TestDFSIO: write test
Generate 10

les of size 1 GB for a total of 10 GB:
$ hadoop jar hadoop-*test*.jar
TestDFSIO -write -nrFiles 10 -fileSize 1000
TestDFSIO is designed to use 1 map task per

Benchmarking
Benchmarks
Micro Benchmarks
TestDFSIO: write test output
Typical output of write test
----- TestDFSIO ----- : write
Date time: Mon Oct 06 10:21:28 CEST 2014
Number of files: 10
Total MBytes processed: 10000.0
Throughput mb/sec: 12.874702111579893
Average IO rate mb/sec: 13.013071060180664
IO rate std deviation: 1.4416050051562712
Test exec time sec: 114.346
25 / 62

Benchmarking
Benchmarks
Micro Benchmarks
Interpreting TestDFSIO results
De

nition (Throughput)
Throughput(N) =
PN
i=0

nition (Average IO rate)
Average IO rate(N) =
PN
i=0 ratei
N
=
PN

lesizei
timei
N
i=0
Here, N is the number of map tasks.
26 / 62

Benchmarking
Benchmarks
Micro Benchmarks
TestDFSIO: read test
Read 10 input

les, each of size 1 GB:
TestDFSIO -read -nrFiles 10 -fileSize 1000
27 / 62

Benchmarking
Benchmarks
Micro Benchmarks
TestDFSIO: read test output
Typical output of read test
----- TestDFSIO ----- : read
Date time: Mon Oct 06 10:56:15 CEST 2014
Number of files: 10
Total MBytes processed: 10000.0
Throughput mb/sec: 402.4306813151435
Average IO rate mb/sec: 492.8257751464844
IO rate std deviation: 196.51233829270575
Test exec time sec: 33.206
28 / 62

Benchmarking
Benchmarks
Micro Benchmarks
In
uence of HDFS replication factor
When interpreting TestDFSIO results, keep in mind:
The HDFS replication factor plays an important role!
A higher replication factor leads to slower writes.
For three identical TestDFSIO write runs (units are MB/s):
HDFS replication factor
1 2 3
Throughput 190 25 13
Average IO-rate 190 10 25 3 13 1
29 / 62

Benchmarking
Benchmarks
Micro Benchmarks
TeraSort
Goal
Sort 1TB of data (or any other amount of data) as fast as possible.
Probably most well-known Hadoop benchmark.
Combines testing the HDFS and MapReduce layers of an
Hadoop cluster.
Typical areas where TeraSort is helpful
Iron out your Hadoop con

guration after your cluster passed a
convincing TestDFSIO benchmark

rst.
Determine whether your MapReduce-related parameters are
set to proper values.
30 / 62

Benchmarking
Benchmarks
Micro Benchmarks
TeraSort: work
ow
TeraGen
/user/bart/terasort-input
TeraSort
/user/bart/terasort-output
TeraValidate
/user/bart/terasort-validate
31 / 62

Benchmarking
Benchmarks
Micro Benchmarks
TeraSort: work
ow
hadoop jar hadoop-mapreduce-examples.jar
teragen 10000000000 /user/bart/input
4 hours on our 4-node cluster
32 / 62

Benchmarking
Benchmarks
Micro Benchmarks
TeraSort: work
ow
terasort /user/bart/input /user/bart/output
33 / 62

Benchmarking
Benchmarks
Micro Benchmarks
TeraSort: work
ow
terasort /user/bart/input /user/bart/output
teravalidate /user/bart/output /user/bart/validate
If something went wrong, TeraValidate's output contains the
problem report.
34 / 62

Benchmarking
Benchmarks
Micro Benchmarks
TeraSort: duration
35 / 62

Benchmarking
Benchmarks
Micro Benchmarks
TeraSort: counters
36 / 62

Benchmarking
Benchmarks
Micro Benchmarks
NNBench
Goal
Load test the NameNode hardware and software.
Generates a lot of HDFS-related requests with normally very
small payloads.
Purpose: put a high HDFS management stress on the
NameNode.
Can simulate requests for creating, reading, renaming and
deleting

Benchmarking
Benchmarks
Micro Benchmarks
NNBench: example
Create 1000

les using 12 maps and 6 reducers:
$ hadoop jar hadoop-*test*.jar nnbench
-operation create_write
-maps 12
-reduces 6
-blockSize 1
-bytesToWrite 0
-numberOfFiles 1000
-replicationFactorPerFile 3
-readFileAfterOpen true
-baseDir /user/bart/NNBench-`hostname -s`
38 / 62

Benchmarking
Benchmarks
Micro Benchmarks
MRBench
Goal
Loop a small job a number of times.
checks whether small job runs are responsive and running
eciently on the cluster
complimentary to TeraSort
puts its focus on the MapReduce layer
impact on the HDFS layer is very limited
39 / 62

Benchmarking
Benchmarks
Micro Benchmarks
MRBench: example
Run a loop of 50 small test jobs:
mrbench -baseDir /user/bart/MRBench
-numRuns 50
40 / 62

Benchmarking
Benchmarks
Micro Benchmarks
MRBench: example
Run a loop of 50 small test jobs:
mrbench -baseDir /user/bart/MRBench
-numRuns 50
Example output:
DataLines Maps Reduces AvgTime (milliseconds)
1 2 1 28822
! average

nish time of executed jobs was 28 seconds.
41 / 62

Benchmarking
Benchmarks
BigBench
Outline
2 Cloudera demo
3 Benchmarks
Micro Benchmarks
BigBench
4 Conclusions
42 / 62

Benchmarking
Benchmarks
BigBench
BigBench
Source: http://mhpersonaltrainer.mhpersonaltrainer.com/mhpersonaltrainer/56616/index
43 / 62

Benchmarking
Benchmarks
BigBench
BigBench
Big Data benchmark based on TPC-DS.
Focus is mostly on MapReduce engines.
Collaboration between industry and academia.
https://github.com/intel-hadoop/Big-Bench/
History
Launched at First Workshop on Big Data Benchmarking
(May 8-9, 2012).
Full kit at Fifth Workshop on Big Data Benchmarking
(August 5-6, 2014).
44 / 62

Benchmarking
Benchmarks
BigBench
BigBench data model
Source: BigBench: Towards an Industry Standard Benchmark for Big Data Analytics, Ghazal et al., 2013.
45 / 62

Benchmarking
Benchmarks
BigBench
BigBench: Data Model - 3 V's
Variety
BigBench data is
structured,
semi-structured,
unstructured.
Velocity
Periodic refreshes for all data.
Dierent velocity for dierent areas:
Vstructured Vunstructured Vsemistructured
Volume
TPC-DS: discrete scale factors
(100, 300, 1000, 3000, 10000, 3000 and 100000).
BigBench: continuous scale factor.
46 / 62

Benchmarking
Benchmarks
BigBench
BigBench: Workload
Workload queries
30 queries
Speci

Hadoop & Big Data benchmarking

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Hadoop & Big Data benchmarking

Similar to Hadoop & Big Data benchmarking (20)

Recently uploaded

Recently uploaded (20)

Hadoop & Big Data benchmarking