SlideShare a Scribd company logo
1 of 62
Download to read offline
Benchmarking 
Hadoop & Big Data benchmarking 
Dr. ir. ing. Bart Vandewoestyne 
Sizing Servers Lab, Howest, Kortrijk 
IWT TETRA User Group Meeting - November 28, 2014 
1 / 62
Benchmarking 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
2 / 62
Benchmarking 
Intro: Hadoop essentials 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
3 / 62
Benchmarking 
Intro: Hadoop essentials 
Hadoop 
Hadoop is VMware, but the other way around. 
4 / 62
Benchmarking 
Intro: Hadoop essentials 
Hadoop 1.0 
Source: Apache Hadoop YARN : moving beyond 
MapReduce and batch processing with Apache Hadoop 2, 
Hortonworks, 2014) 
MapReduce and HDFS are the 
core components, while other 
components are built around the 
core. 
5 / 62
Benchmarking 
Intro: Hadoop essentials 
Hadoop 2.0 
Source: Apache Hadoop YARN : moving beyond 
MapReduce and batch processing with Apache Hadoop 2, 
Hortonworks, 2014) 
YARN adds a more general 
interface to run non-MapReduce 
jobs within the Hadoop 
framework. 
6 / 62
Benchmarking 
Intro: Hadoop essentials 
HDFS 
Hadoop Distributed File System 
Source: http://www.cac.cornell.edu/vw/MapReduce/dfs.aspx 
7 / 62
Benchmarking 
Intro: Hadoop essentials 
MapReduce 
MapReduce = Programming Model 
WordCount example: 
Source: Optimizing Hadoop for MapReduce, Khaled Tannir 
8 / 62
Benchmarking 
Intro: Hadoop essentials 
Hadoop distributions 
9 / 62
Benchmarking 
Cloudera demo 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
10 / 62
Benchmarking 
Cloudera demo 
HDFS 
11 / 62
Benchmarking 
Cloudera demo 
NameNode and DataNodes 
12 / 62
Benchmarking 
Cloudera demo 
Hosts and their roles 
13 / 62
Benchmarking 
Cloudera demo 
NameNode WebUI 
NameNode WebUI address 
http://sandy-quad-1.sslab.lan:50070/ 
14 / 62
Benchmarking 
Cloudera demo 
Replication factor 
15 / 62
Benchmarking 
Cloudera demo 
HDFS Blocks 
16 / 62
Benchmarking 
Cloudera demo 
Hue:
le upload 
17 / 62
Benchmarking 
Cloudera demo 
Hadoop jobs: counters/metrics 
18 / 62
Benchmarking 
Benchmarks 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
19 / 62
Benchmarking 
Benchmarks 
Why benchmark? 
My three reasons for using benchmarks: 
1 Evaluating the eect of a hardware/software upgrade: 
OS, Java VM,. . . 
Hadoop, Cloudera CDH, Pig, Hive, Impala,. . . 
2 Debugging: 
Compare with other clusters or published results. 
3 Performance tuning: 
E.g. Cloudera CDH default con
g is defensive, not optimal. 
20 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
21 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
Hadoop: Available tests 
hadoop jar /some/path/to/hadoop-*test*.jar 
22 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO 
Read and write test for HDFS. 
Helpful for 
getting an idea of how fast your cluster is in terms of I/O, 
stress testing HDFS, 
discover network performance bottlenecks, 
shake out the hardware, OS and Hadoop setup of your cluster 
machines (particularly the NameNode and the DataNodes). 
23 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO: write test 
Generate 10
les of size 1 GB for a total of 10 GB: 
$ hadoop jar hadoop-*test*.jar  
TestDFSIO -write -nrFiles 10 -fileSize 1000 
TestDFSIO is designed to use 1 map task per
le 
(1:1 mapping from
les to map tasks) 
24 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO: write test output 
Typical output of write test 
----- TestDFSIO ----- : write 
Date  time: Mon Oct 06 10:21:28 CEST 2014 
Number of files: 10 
Total MBytes processed: 10000.0 
Throughput mb/sec: 12.874702111579893 
Average IO rate mb/sec: 13.013071060180664 
IO rate std deviation: 1.4416050051562712 
Test exec time sec: 114.346 
25 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
Interpreting TestDFSIO results 
De
nition (Throughput) 
Throughput(N) = 
PN 
i=0
lesizei PN 
i=0 timei 
De
nition (Average IO rate) 
Average IO rate(N) = 
PN 
i=0 ratei 
N 
= 
PN
lesizei 
timei 
N 
i=0 
Here, N is the number of map tasks. 
26 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO: read test 
Read 10 input
les, each of size 1 GB: 
$ hadoop jar hadoop-*test*.jar  
TestDFSIO -read -nrFiles 10 -fileSize 1000 
27 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO: read test output 
Typical output of read test 
----- TestDFSIO ----- : read 
Date  time: Mon Oct 06 10:56:15 CEST 2014 
Number of files: 10 
Total MBytes processed: 10000.0 
Throughput mb/sec: 402.4306813151435 
Average IO rate mb/sec: 492.8257751464844 
IO rate std deviation: 196.51233829270575 
Test exec time sec: 33.206 
28 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
In
uence of HDFS replication factor 
When interpreting TestDFSIO results, keep in mind: 
The HDFS replication factor plays an important role! 
A higher replication factor leads to slower writes. 
For three identical TestDFSIO write runs (units are MB/s): 
HDFS replication factor 
1 2 3 
Throughput 190 25 13 
Average IO-rate 190  10 25  3 13  1 
29 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort 
Goal 
Sort 1TB of data (or any other amount of data) as fast as possible. 
Probably most well-known Hadoop benchmark. 
Combines testing the HDFS and MapReduce layers of an 
Hadoop cluster. 
Typical areas where TeraSort is helpful 
Iron out your Hadoop con
guration after your cluster passed a 
convincing TestDFSIO benchmark
rst. 
Determine whether your MapReduce-related parameters are 
set to proper values. 
30 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: work
ow 
TeraGen 
/user/bart/terasort-input 
TeraSort 
/user/bart/terasort-output 
TeraValidate 
/user/bart/terasort-validate 
31 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: work
ow 
hadoop jar hadoop-mapreduce-examples.jar  
teragen 10000000000 /user/bart/input 
 4 hours on our 4-node cluster 
32 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: work
ow 
hadoop jar hadoop-mapreduce-examples.jar  
teragen 10000000000 /user/bart/input 
 4 hours on our 4-node cluster 
hadoop jar hadoop-mapreduce-examples.jar  
terasort /user/bart/input /user/bart/output 
 5 hours on our 4-node cluster 
33 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: work
ow 
hadoop jar hadoop-mapreduce-examples.jar  
teragen 10000000000 /user/bart/input 
 4 hours on our 4-node cluster 
hadoop jar hadoop-mapreduce-examples.jar  
terasort /user/bart/input /user/bart/output 
 5 hours on our 4-node cluster 
hadoop jar hadoop-mapreduce-examples.jar  
teravalidate /user/bart/output /user/bart/validate 
If something went wrong, TeraValidate's output contains the 
problem report. 
34 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: duration 
35 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: counters 
36 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
NNBench 
Goal 
Load test the NameNode hardware and software. 
Generates a lot of HDFS-related requests with normally very 
small payloads. 
Purpose: put a high HDFS management stress on the 
NameNode. 
Can simulate requests for creating, reading, renaming and 
deleting
les on HDFS. 
37 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
NNBench: example 
Create 1000
les using 12 maps and 6 reducers: 
$ hadoop jar hadoop-*test*.jar nnbench  
-operation create_write  
-maps 12  
-reduces 6  
-blockSize 1  
-bytesToWrite 0  
-numberOfFiles 1000  
-replicationFactorPerFile 3  
-readFileAfterOpen true  
-baseDir /user/bart/NNBench-`hostname -s` 
38 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
MRBench 
Goal 
Loop a small job a number of times. 
checks whether small job runs are responsive and running 
eciently on the cluster 
complimentary to TeraSort 
puts its focus on the MapReduce layer 
impact on the HDFS layer is very limited 
39 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
MRBench: example 
Run a loop of 50 small test jobs: 
$ hadoop jar hadoop-*test*.jar  
mrbench -baseDir /user/bart/MRBench  
-numRuns 50 
40 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
MRBench: example 
Run a loop of 50 small test jobs: 
$ hadoop jar hadoop-*test*.jar  
mrbench -baseDir /user/bart/MRBench  
-numRuns 50 
Example output: 
DataLines Maps Reduces AvgTime (milliseconds) 
1 2 1 28822 
! average
nish time of executed jobs was 28 seconds. 
41 / 62
Benchmarking 
Benchmarks 
BigBench 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
42 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench 
Source: http://mhpersonaltrainer.mhpersonaltrainer.com/mhpersonaltrainer/56616/index 
43 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench 
Big Data benchmark based on TPC-DS. 
Focus is mostly on MapReduce engines. 
Collaboration between industry and academia. 
https://github.com/intel-hadoop/Big-Bench/ 
History 
Launched at First Workshop on Big Data Benchmarking 
(May 8-9, 2012). 
Full kit at Fifth Workshop on Big Data Benchmarking 
(August 5-6, 2014). 
44 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench data model 
Source: BigBench: Towards an Industry Standard Benchmark for Big Data Analytics, Ghazal et al., 2013. 
45 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench: Data Model - 3 V's 
Variety 
BigBench data is 
structured, 
semi-structured, 
unstructured. 
Velocity 
Periodic refreshes for all data. 
Dierent velocity for dierent areas: 
Vstructured  Vunstructured  Vsemistructured 
Volume 
TPC-DS: discrete scale factors 
(100, 300, 1000, 3000, 10000, 3000 and 100000). 
BigBench: continuous scale factor. 
46 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench: Workload 
Workload queries 
30 queries 
Speci

More Related Content

What's hot

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data Srinath Perera
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovAltinity Ltd
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redisTanu Siwag
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 

What's hot (20)

Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redis
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
redis basics
redis basicsredis basics
redis basics
 
Apache flink
Apache flinkApache flink
Apache flink
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redis
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Using galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wanUsing galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wan
 

Viewers also liked

Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayDataWorks Summit
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRDouglas Bernardini
 
Introduction to Apache Hadoop in Persian - آشنایی با هدوپ
Introduction to Apache Hadoop in Persian - آشنایی با هدوپIntroduction to Apache Hadoop in Persian - آشنایی با هدوپ
Introduction to Apache Hadoop in Persian - آشنایی با هدوپMobin Ranjbar
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learnedtcurdt
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

Viewers also liked (16)

Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
Introduction to Apache Hadoop in Persian - آشنایی با هدوپ
Introduction to Apache Hadoop in Persian - آشنایی با هدوپIntroduction to Apache Hadoop in Persian - آشنایی با هدوپ
Introduction to Apache Hadoop in Persian - آشنایی با هدوپ
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Similar to Hadoop & Big Data benchmarking

Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataNicolas Poggi
 
LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLinaro
 
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Ganesh Raju
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusterst_ivanov
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connectorDenny Lee
 
Hadoop tutorial hand-outs
Hadoop tutorial hand-outsHadoop tutorial hand-outs
Hadoop tutorial hand-outspardhavi reddy
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endthkoch
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBencht_ivanov
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
 
Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetVasyl Senko
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios
 
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...inovex GmbH
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudEdureka!
 

Similar to Hadoop & Big Data benchmarking (20)

Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96Boards
 
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
H04502048051
H04502048051H04502048051
H04502048051
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
Hadoop tutorial hand-outs
Hadoop tutorial hand-outsHadoop tutorial hand-outs
Hadoop tutorial hand-outs
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead end
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
2014 hadoop wrocław jug
2014 hadoop   wrocław jug2014 hadoop   wrocław jug
2014 hadoop wrocław jug
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNet
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
 
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Hadoop & Big Data benchmarking

  • 1. Benchmarking Hadoop & Big Data benchmarking Dr. ir. ing. Bart Vandewoestyne Sizing Servers Lab, Howest, Kortrijk IWT TETRA User Group Meeting - November 28, 2014 1 / 62
  • 2. Benchmarking Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 2 / 62
  • 3. Benchmarking Intro: Hadoop essentials Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 3 / 62
  • 4. Benchmarking Intro: Hadoop essentials Hadoop Hadoop is VMware, but the other way around. 4 / 62
  • 5. Benchmarking Intro: Hadoop essentials Hadoop 1.0 Source: Apache Hadoop YARN : moving beyond MapReduce and batch processing with Apache Hadoop 2, Hortonworks, 2014) MapReduce and HDFS are the core components, while other components are built around the core. 5 / 62
  • 6. Benchmarking Intro: Hadoop essentials Hadoop 2.0 Source: Apache Hadoop YARN : moving beyond MapReduce and batch processing with Apache Hadoop 2, Hortonworks, 2014) YARN adds a more general interface to run non-MapReduce jobs within the Hadoop framework. 6 / 62
  • 7. Benchmarking Intro: Hadoop essentials HDFS Hadoop Distributed File System Source: http://www.cac.cornell.edu/vw/MapReduce/dfs.aspx 7 / 62
  • 8. Benchmarking Intro: Hadoop essentials MapReduce MapReduce = Programming Model WordCount example: Source: Optimizing Hadoop for MapReduce, Khaled Tannir 8 / 62
  • 9. Benchmarking Intro: Hadoop essentials Hadoop distributions 9 / 62
  • 10. Benchmarking Cloudera demo Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 10 / 62
  • 12. Benchmarking Cloudera demo NameNode and DataNodes 12 / 62
  • 13. Benchmarking Cloudera demo Hosts and their roles 13 / 62
  • 14. Benchmarking Cloudera demo NameNode WebUI NameNode WebUI address http://sandy-quad-1.sslab.lan:50070/ 14 / 62
  • 15. Benchmarking Cloudera demo Replication factor 15 / 62
  • 16. Benchmarking Cloudera demo HDFS Blocks 16 / 62
  • 18. le upload 17 / 62
  • 19. Benchmarking Cloudera demo Hadoop jobs: counters/metrics 18 / 62
  • 20. Benchmarking Benchmarks Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 19 / 62
  • 21. Benchmarking Benchmarks Why benchmark? My three reasons for using benchmarks: 1 Evaluating the eect of a hardware/software upgrade: OS, Java VM,. . . Hadoop, Cloudera CDH, Pig, Hive, Impala,. . . 2 Debugging: Compare with other clusters or published results. 3 Performance tuning: E.g. Cloudera CDH default con
  • 22. g is defensive, not optimal. 20 / 62
  • 23. Benchmarking Benchmarks Micro Benchmarks Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 21 / 62
  • 24. Benchmarking Benchmarks Micro Benchmarks Hadoop: Available tests hadoop jar /some/path/to/hadoop-*test*.jar 22 / 62
  • 25. Benchmarking Benchmarks Micro Benchmarks TestDFSIO Read and write test for HDFS. Helpful for getting an idea of how fast your cluster is in terms of I/O, stress testing HDFS, discover network performance bottlenecks, shake out the hardware, OS and Hadoop setup of your cluster machines (particularly the NameNode and the DataNodes). 23 / 62
  • 26. Benchmarking Benchmarks Micro Benchmarks TestDFSIO: write test Generate 10
  • 27. les of size 1 GB for a total of 10 GB: $ hadoop jar hadoop-*test*.jar TestDFSIO -write -nrFiles 10 -fileSize 1000 TestDFSIO is designed to use 1 map task per
  • 29. les to map tasks) 24 / 62
  • 30. Benchmarking Benchmarks Micro Benchmarks TestDFSIO: write test output Typical output of write test ----- TestDFSIO ----- : write Date time: Mon Oct 06 10:21:28 CEST 2014 Number of files: 10 Total MBytes processed: 10000.0 Throughput mb/sec: 12.874702111579893 Average IO rate mb/sec: 13.013071060180664 IO rate std deviation: 1.4416050051562712 Test exec time sec: 114.346 25 / 62
  • 31. Benchmarking Benchmarks Micro Benchmarks Interpreting TestDFSIO results De
  • 33. lesizei PN i=0 timei De
  • 34. nition (Average IO rate) Average IO rate(N) = PN i=0 ratei N = PN
  • 35. lesizei timei N i=0 Here, N is the number of map tasks. 26 / 62
  • 36. Benchmarking Benchmarks Micro Benchmarks TestDFSIO: read test Read 10 input
  • 37. les, each of size 1 GB: $ hadoop jar hadoop-*test*.jar TestDFSIO -read -nrFiles 10 -fileSize 1000 27 / 62
  • 38. Benchmarking Benchmarks Micro Benchmarks TestDFSIO: read test output Typical output of read test ----- TestDFSIO ----- : read Date time: Mon Oct 06 10:56:15 CEST 2014 Number of files: 10 Total MBytes processed: 10000.0 Throughput mb/sec: 402.4306813151435 Average IO rate mb/sec: 492.8257751464844 IO rate std deviation: 196.51233829270575 Test exec time sec: 33.206 28 / 62
  • 39. Benchmarking Benchmarks Micro Benchmarks In uence of HDFS replication factor When interpreting TestDFSIO results, keep in mind: The HDFS replication factor plays an important role! A higher replication factor leads to slower writes. For three identical TestDFSIO write runs (units are MB/s): HDFS replication factor 1 2 3 Throughput 190 25 13 Average IO-rate 190 10 25 3 13 1 29 / 62
  • 40. Benchmarking Benchmarks Micro Benchmarks TeraSort Goal Sort 1TB of data (or any other amount of data) as fast as possible. Probably most well-known Hadoop benchmark. Combines testing the HDFS and MapReduce layers of an Hadoop cluster. Typical areas where TeraSort is helpful Iron out your Hadoop con
  • 41. guration after your cluster passed a convincing TestDFSIO benchmark
  • 42. rst. Determine whether your MapReduce-related parameters are set to proper values. 30 / 62
  • 43. Benchmarking Benchmarks Micro Benchmarks TeraSort: work ow TeraGen /user/bart/terasort-input TeraSort /user/bart/terasort-output TeraValidate /user/bart/terasort-validate 31 / 62
  • 44. Benchmarking Benchmarks Micro Benchmarks TeraSort: work ow hadoop jar hadoop-mapreduce-examples.jar teragen 10000000000 /user/bart/input 4 hours on our 4-node cluster 32 / 62
  • 45. Benchmarking Benchmarks Micro Benchmarks TeraSort: work ow hadoop jar hadoop-mapreduce-examples.jar teragen 10000000000 /user/bart/input 4 hours on our 4-node cluster hadoop jar hadoop-mapreduce-examples.jar terasort /user/bart/input /user/bart/output 5 hours on our 4-node cluster 33 / 62
  • 46. Benchmarking Benchmarks Micro Benchmarks TeraSort: work ow hadoop jar hadoop-mapreduce-examples.jar teragen 10000000000 /user/bart/input 4 hours on our 4-node cluster hadoop jar hadoop-mapreduce-examples.jar terasort /user/bart/input /user/bart/output 5 hours on our 4-node cluster hadoop jar hadoop-mapreduce-examples.jar teravalidate /user/bart/output /user/bart/validate If something went wrong, TeraValidate's output contains the problem report. 34 / 62
  • 47. Benchmarking Benchmarks Micro Benchmarks TeraSort: duration 35 / 62
  • 48. Benchmarking Benchmarks Micro Benchmarks TeraSort: counters 36 / 62
  • 49. Benchmarking Benchmarks Micro Benchmarks NNBench Goal Load test the NameNode hardware and software. Generates a lot of HDFS-related requests with normally very small payloads. Purpose: put a high HDFS management stress on the NameNode. Can simulate requests for creating, reading, renaming and deleting
  • 50. les on HDFS. 37 / 62
  • 51. Benchmarking Benchmarks Micro Benchmarks NNBench: example Create 1000
  • 52. les using 12 maps and 6 reducers: $ hadoop jar hadoop-*test*.jar nnbench -operation create_write -maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 -replicationFactorPerFile 3 -readFileAfterOpen true -baseDir /user/bart/NNBench-`hostname -s` 38 / 62
  • 53. Benchmarking Benchmarks Micro Benchmarks MRBench Goal Loop a small job a number of times. checks whether small job runs are responsive and running eciently on the cluster complimentary to TeraSort puts its focus on the MapReduce layer impact on the HDFS layer is very limited 39 / 62
  • 54. Benchmarking Benchmarks Micro Benchmarks MRBench: example Run a loop of 50 small test jobs: $ hadoop jar hadoop-*test*.jar mrbench -baseDir /user/bart/MRBench -numRuns 50 40 / 62
  • 55. Benchmarking Benchmarks Micro Benchmarks MRBench: example Run a loop of 50 small test jobs: $ hadoop jar hadoop-*test*.jar mrbench -baseDir /user/bart/MRBench -numRuns 50 Example output: DataLines Maps Reduces AvgTime (milliseconds) 1 2 1 28822 ! average
  • 56. nish time of executed jobs was 28 seconds. 41 / 62
  • 57. Benchmarking Benchmarks BigBench Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 42 / 62
  • 58. Benchmarking Benchmarks BigBench BigBench Source: http://mhpersonaltrainer.mhpersonaltrainer.com/mhpersonaltrainer/56616/index 43 / 62
  • 59. Benchmarking Benchmarks BigBench BigBench Big Data benchmark based on TPC-DS. Focus is mostly on MapReduce engines. Collaboration between industry and academia. https://github.com/intel-hadoop/Big-Bench/ History Launched at First Workshop on Big Data Benchmarking (May 8-9, 2012). Full kit at Fifth Workshop on Big Data Benchmarking (August 5-6, 2014). 44 / 62
  • 60. Benchmarking Benchmarks BigBench BigBench data model Source: BigBench: Towards an Industry Standard Benchmark for Big Data Analytics, Ghazal et al., 2013. 45 / 62
  • 61. Benchmarking Benchmarks BigBench BigBench: Data Model - 3 V's Variety BigBench data is structured, semi-structured, unstructured. Velocity Periodic refreshes for all data. Dierent velocity for dierent areas: Vstructured Vunstructured Vsemistructured Volume TPC-DS: discrete scale factors (100, 300, 1000, 3000, 10000, 3000 and 100000). BigBench: continuous scale factor. 46 / 62
  • 62. Benchmarking Benchmarks BigBench BigBench: Workload Workload queries 30 queries Speci
  • 63. ed in English (sort of) No required syntax (
  • 64. rst implementation in Aster SQL MR) Kit implemented in Hive, Hadoop MR, Mahout, OpenNLP Business functions (McKinsey) Marketing Merchandising Operations Supply chain Reporting (customers and products) 47 / 62
  • 65. Benchmarking Benchmarks BigBench BigBench: Workload - Technical Aspects Data Sources Number of Queries Percentage Structured 18 60 % Semi-structured 7 23 % Unstructured 5 17 % Analytic techniques Number of Queries Percentage Statistics analysis 6 20 % Data mining 17 57 % Reporting 8 27 % 48 / 62
  • 66. Benchmarking Benchmarks BigBench BigBench: Workload - Technical Aspects Query Types Number of Queries Percentage Pure HiveQL 14 46 % Mahout 5 17 % OpenNLP 5 17 % Custom MR 6 20 % 49 / 62
  • 67. Benchmarking Benchmarks BigBench BigBench: Workload - Technical Aspects Query Types Number of Queries Percentage Pure HiveQL 14 46 % Mahout 5 17 % OpenNLP 5 17 % Custom MR 6 20 % Note that your implementation may vary! 50 / 62
  • 68. Benchmarking Benchmarks BigBench BIgBench: Benchmark Process Source: http://www.tele-task.de/archive/video/flash/24896/ 51 / 62
  • 69. Benchmarking Benchmarks BigBench BigBench: Metric Number of queries run: 30 (2 S + 1) Measured times: TL: loading process TP: power test TTT1 :
  • 70. rst throughput test TTDM : data maintenance task TTT2 : second throughput test De
  • 71. nition (BigBench queries per hour) BBQpH = 30 3 S 3600 S TL + S TP + TTT1 + S TTDM + TTT2 Similar to TPC-DS metric. 52 / 62
  • 72. Benchmarking Benchmarks BigBench BigBench: results 53 / 62
  • 73. Benchmarking Benchmarks BigBench BigBench: monitoring 54 / 62
  • 74. Benchmarking Benchmarks BigBench BigBench: monitoring 55 / 62
  • 75. Benchmarking Benchmarks BigBench BigBench: monitoring 56 / 62
  • 76. Benchmarking Benchmarks BigBench BigBench: monitoring 57 / 62
  • 77. Benchmarking Benchmarks BigBench BigBench: in progress 58 / 62 Source: The Hortonworks Blog
  • 78. Benchmarking Conclusions Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 59 / 62
  • 79. Benchmarking Conclusions Conclusions Use Hadoop distributions! Hadoop cluster administration ! Cloudera Manager. Micro-benchmarks $ BigBench. 60 / 62
  • 80. Benchmarking Conclusions Conclusions Use Hadoop distributions! Hadoop cluster administration ! Cloudera Manager. Micro-benchmarks $ BigBench. Your best benchmark is your own application! 61 / 62
  • 81. Benchmarking Conclusions Questions? Source: https://gigaom.com/2011/12/19/my-hadoop-is-bigger-than-yours/ 62 / 62