SlideShare a Scribd company logo
Benchmarking 
Hadoop & Big Data benchmarking 
Dr. ir. ing. Bart Vandewoestyne 
Sizing Servers Lab, Howest, Kortrijk 
IWT TETRA User Group Meeting - November 28, 2014 
1 / 62
Benchmarking 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
2 / 62
Benchmarking 
Intro: Hadoop essentials 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
3 / 62
Benchmarking 
Intro: Hadoop essentials 
Hadoop 
Hadoop is VMware, but the other way around. 
4 / 62
Benchmarking 
Intro: Hadoop essentials 
Hadoop 1.0 
Source: Apache Hadoop YARN : moving beyond 
MapReduce and batch processing with Apache Hadoop 2, 
Hortonworks, 2014) 
MapReduce and HDFS are the 
core components, while other 
components are built around the 
core. 
5 / 62
Benchmarking 
Intro: Hadoop essentials 
Hadoop 2.0 
Source: Apache Hadoop YARN : moving beyond 
MapReduce and batch processing with Apache Hadoop 2, 
Hortonworks, 2014) 
YARN adds a more general 
interface to run non-MapReduce 
jobs within the Hadoop 
framework. 
6 / 62
Benchmarking 
Intro: Hadoop essentials 
HDFS 
Hadoop Distributed File System 
Source: http://www.cac.cornell.edu/vw/MapReduce/dfs.aspx 
7 / 62
Benchmarking 
Intro: Hadoop essentials 
MapReduce 
MapReduce = Programming Model 
WordCount example: 
Source: Optimizing Hadoop for MapReduce, Khaled Tannir 
8 / 62
Benchmarking 
Intro: Hadoop essentials 
Hadoop distributions 
9 / 62
Benchmarking 
Cloudera demo 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
10 / 62
Benchmarking 
Cloudera demo 
HDFS 
11 / 62
Benchmarking 
Cloudera demo 
NameNode and DataNodes 
12 / 62
Benchmarking 
Cloudera demo 
Hosts and their roles 
13 / 62
Benchmarking 
Cloudera demo 
NameNode WebUI 
NameNode WebUI address 
http://sandy-quad-1.sslab.lan:50070/ 
14 / 62
Benchmarking 
Cloudera demo 
Replication factor 
15 / 62
Benchmarking 
Cloudera demo 
HDFS Blocks 
16 / 62
Benchmarking 
Cloudera demo 
Hue:
le upload 
17 / 62
Benchmarking 
Cloudera demo 
Hadoop jobs: counters/metrics 
18 / 62
Benchmarking 
Benchmarks 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
19 / 62
Benchmarking 
Benchmarks 
Why benchmark? 
My three reasons for using benchmarks: 
1 Evaluating the eect of a hardware/software upgrade: 
OS, Java VM,. . . 
Hadoop, Cloudera CDH, Pig, Hive, Impala,. . . 
2 Debugging: 
Compare with other clusters or published results. 
3 Performance tuning: 
E.g. Cloudera CDH default con
g is defensive, not optimal. 
20 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
21 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
Hadoop: Available tests 
hadoop jar /some/path/to/hadoop-*test*.jar 
22 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO 
Read and write test for HDFS. 
Helpful for 
getting an idea of how fast your cluster is in terms of I/O, 
stress testing HDFS, 
discover network performance bottlenecks, 
shake out the hardware, OS and Hadoop setup of your cluster 
machines (particularly the NameNode and the DataNodes). 
23 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO: write test 
Generate 10
les of size 1 GB for a total of 10 GB: 
$ hadoop jar hadoop-*test*.jar  
TestDFSIO -write -nrFiles 10 -fileSize 1000 
TestDFSIO is designed to use 1 map task per
le 
(1:1 mapping from
les to map tasks) 
24 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO: write test output 
Typical output of write test 
----- TestDFSIO ----- : write 
Date  time: Mon Oct 06 10:21:28 CEST 2014 
Number of files: 10 
Total MBytes processed: 10000.0 
Throughput mb/sec: 12.874702111579893 
Average IO rate mb/sec: 13.013071060180664 
IO rate std deviation: 1.4416050051562712 
Test exec time sec: 114.346 
25 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
Interpreting TestDFSIO results 
De
nition (Throughput) 
Throughput(N) = 
PN 
i=0
lesizei PN 
i=0 timei 
De
nition (Average IO rate) 
Average IO rate(N) = 
PN 
i=0 ratei 
N 
= 
PN
lesizei 
timei 
N 
i=0 
Here, N is the number of map tasks. 
26 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO: read test 
Read 10 input
les, each of size 1 GB: 
$ hadoop jar hadoop-*test*.jar  
TestDFSIO -read -nrFiles 10 -fileSize 1000 
27 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO: read test output 
Typical output of read test 
----- TestDFSIO ----- : read 
Date  time: Mon Oct 06 10:56:15 CEST 2014 
Number of files: 10 
Total MBytes processed: 10000.0 
Throughput mb/sec: 402.4306813151435 
Average IO rate mb/sec: 492.8257751464844 
IO rate std deviation: 196.51233829270575 
Test exec time sec: 33.206 
28 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
In
uence of HDFS replication factor 
When interpreting TestDFSIO results, keep in mind: 
The HDFS replication factor plays an important role! 
A higher replication factor leads to slower writes. 
For three identical TestDFSIO write runs (units are MB/s): 
HDFS replication factor 
1 2 3 
Throughput 190 25 13 
Average IO-rate 190  10 25  3 13  1 
29 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort 
Goal 
Sort 1TB of data (or any other amount of data) as fast as possible. 
Probably most well-known Hadoop benchmark. 
Combines testing the HDFS and MapReduce layers of an 
Hadoop cluster. 
Typical areas where TeraSort is helpful 
Iron out your Hadoop con
guration after your cluster passed a 
convincing TestDFSIO benchmark
rst. 
Determine whether your MapReduce-related parameters are 
set to proper values. 
30 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: work
ow 
TeraGen 
/user/bart/terasort-input 
TeraSort 
/user/bart/terasort-output 
TeraValidate 
/user/bart/terasort-validate 
31 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: work
ow 
hadoop jar hadoop-mapreduce-examples.jar  
teragen 10000000000 /user/bart/input 
 4 hours on our 4-node cluster 
32 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: work
ow 
hadoop jar hadoop-mapreduce-examples.jar  
teragen 10000000000 /user/bart/input 
 4 hours on our 4-node cluster 
hadoop jar hadoop-mapreduce-examples.jar  
terasort /user/bart/input /user/bart/output 
 5 hours on our 4-node cluster 
33 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: work
ow 
hadoop jar hadoop-mapreduce-examples.jar  
teragen 10000000000 /user/bart/input 
 4 hours on our 4-node cluster 
hadoop jar hadoop-mapreduce-examples.jar  
terasort /user/bart/input /user/bart/output 
 5 hours on our 4-node cluster 
hadoop jar hadoop-mapreduce-examples.jar  
teravalidate /user/bart/output /user/bart/validate 
If something went wrong, TeraValidate's output contains the 
problem report. 
34 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: duration 
35 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: counters 
36 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
NNBench 
Goal 
Load test the NameNode hardware and software. 
Generates a lot of HDFS-related requests with normally very 
small payloads. 
Purpose: put a high HDFS management stress on the 
NameNode. 
Can simulate requests for creating, reading, renaming and 
deleting
les on HDFS. 
37 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
NNBench: example 
Create 1000
les using 12 maps and 6 reducers: 
$ hadoop jar hadoop-*test*.jar nnbench  
-operation create_write  
-maps 12  
-reduces 6  
-blockSize 1  
-bytesToWrite 0  
-numberOfFiles 1000  
-replicationFactorPerFile 3  
-readFileAfterOpen true  
-baseDir /user/bart/NNBench-`hostname -s` 
38 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
MRBench 
Goal 
Loop a small job a number of times. 
checks whether small job runs are responsive and running 
eciently on the cluster 
complimentary to TeraSort 
puts its focus on the MapReduce layer 
impact on the HDFS layer is very limited 
39 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
MRBench: example 
Run a loop of 50 small test jobs: 
$ hadoop jar hadoop-*test*.jar  
mrbench -baseDir /user/bart/MRBench  
-numRuns 50 
40 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
MRBench: example 
Run a loop of 50 small test jobs: 
$ hadoop jar hadoop-*test*.jar  
mrbench -baseDir /user/bart/MRBench  
-numRuns 50 
Example output: 
DataLines Maps Reduces AvgTime (milliseconds) 
1 2 1 28822 
! average
nish time of executed jobs was 28 seconds. 
41 / 62
Benchmarking 
Benchmarks 
BigBench 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
42 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench 
Source: http://mhpersonaltrainer.mhpersonaltrainer.com/mhpersonaltrainer/56616/index 
43 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench 
Big Data benchmark based on TPC-DS. 
Focus is mostly on MapReduce engines. 
Collaboration between industry and academia. 
https://github.com/intel-hadoop/Big-Bench/ 
History 
Launched at First Workshop on Big Data Benchmarking 
(May 8-9, 2012). 
Full kit at Fifth Workshop on Big Data Benchmarking 
(August 5-6, 2014). 
44 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench data model 
Source: BigBench: Towards an Industry Standard Benchmark for Big Data Analytics, Ghazal et al., 2013. 
45 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench: Data Model - 3 V's 
Variety 
BigBench data is 
structured, 
semi-structured, 
unstructured. 
Velocity 
Periodic refreshes for all data. 
Dierent velocity for dierent areas: 
Vstructured  Vunstructured  Vsemistructured 
Volume 
TPC-DS: discrete scale factors 
(100, 300, 1000, 3000, 10000, 3000 and 100000). 
BigBench: continuous scale factor. 
46 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench: Workload 
Workload queries 
30 queries 
Speci

More Related Content

What's hot

Ch 4 linker loader
Ch 4 linker loaderCh 4 linker loader
Ch 4 linker loader
Malek Sumaiya
 
Developing a Map Reduce Application
Developing a Map Reduce ApplicationDeveloping a Map Reduce Application
Developing a Map Reduce Application
Dr. C.V. Suresh Babu
 
Computer organisation -morris mano
Computer organisation  -morris manoComputer organisation  -morris mano
Computer organisation -morris mano
vishnu murthy
 
Loaders ( system programming )
Loaders ( system programming ) Loaders ( system programming )
Loaders ( system programming )
Adarsh Patel
 
System Programming Unit II
System Programming Unit IISystem Programming Unit II
System Programming Unit IIManoj Patil
 
Flynns classification
Flynns classificationFlynns classification
Flynns classification
Yasir Khan
 
Intermediate code generation (Compiler Design)
Intermediate code generation (Compiler Design)   Intermediate code generation (Compiler Design)
Intermediate code generation (Compiler Design)
Tasif Tanzim
 
Fundamentals of Language Processing
Fundamentals of Language ProcessingFundamentals of Language Processing
Fundamentals of Language Processing
Hemant Sharma
 
Introduction to algorithms
Introduction to algorithmsIntroduction to algorithms
Introduction to algorithms
subhashchandra197
 
Compiler Design
Compiler DesignCompiler Design
Compiler DesignMir Majid
 
what is compiler and five phases of compiler
what is compiler and five phases of compilerwhat is compiler and five phases of compiler
what is compiler and five phases of compiler
adilmehmood93
 
Dot net assembly
Dot net assemblyDot net assembly
Dot net assembly
Dr.Neeraj Kumar Pandey
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
Abhinav Tyagi
 
System Programing Unit 1
System Programing Unit 1System Programing Unit 1
System Programing Unit 1Manoj Patil
 
Parallel processing
Parallel processingParallel processing
Parallel processing
rajshreemuthiah
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
Dan Gunter
 
Software Engineering
Software EngineeringSoftware Engineering
Software Engineering
janani thirupathi
 
software cost factor
software cost factorsoftware cost factor
software cost factor
Abinaya B
 
Formal Specification in Software Engineering SE9
Formal Specification in Software Engineering SE9Formal Specification in Software Engineering SE9
Formal Specification in Software Engineering SE9koolkampus
 
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInstruction pipeline: Computer Architecture
Instruction pipeline: Computer Architecture
InteX Research Lab
 

What's hot (20)

Ch 4 linker loader
Ch 4 linker loaderCh 4 linker loader
Ch 4 linker loader
 
Developing a Map Reduce Application
Developing a Map Reduce ApplicationDeveloping a Map Reduce Application
Developing a Map Reduce Application
 
Computer organisation -morris mano
Computer organisation  -morris manoComputer organisation  -morris mano
Computer organisation -morris mano
 
Loaders ( system programming )
Loaders ( system programming ) Loaders ( system programming )
Loaders ( system programming )
 
System Programming Unit II
System Programming Unit IISystem Programming Unit II
System Programming Unit II
 
Flynns classification
Flynns classificationFlynns classification
Flynns classification
 
Intermediate code generation (Compiler Design)
Intermediate code generation (Compiler Design)   Intermediate code generation (Compiler Design)
Intermediate code generation (Compiler Design)
 
Fundamentals of Language Processing
Fundamentals of Language ProcessingFundamentals of Language Processing
Fundamentals of Language Processing
 
Introduction to algorithms
Introduction to algorithmsIntroduction to algorithms
Introduction to algorithms
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
 
what is compiler and five phases of compiler
what is compiler and five phases of compilerwhat is compiler and five phases of compiler
what is compiler and five phases of compiler
 
Dot net assembly
Dot net assemblyDot net assembly
Dot net assembly
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
System Programing Unit 1
System Programing Unit 1System Programing Unit 1
System Programing Unit 1
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Software Engineering
Software EngineeringSoftware Engineering
Software Engineering
 
software cost factor
software cost factorsoftware cost factor
software cost factor
 
Formal Specification in Software Engineering SE9
Formal Specification in Software Engineering SE9Formal Specification in Software Engineering SE9
Formal Specification in Software Engineering SE9
 
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInstruction pipeline: Computer Architecture
Instruction pipeline: Computer Architecture
 

Viewers also liked

Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayDataWorks Summit
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Douglas Bernardini
 
Introduction to Apache Hadoop in Persian - آشنایی با هدوپ
Introduction to Apache Hadoop in Persian - آشنایی با هدوپIntroduction to Apache Hadoop in Persian - آشنایی با هدوپ
Introduction to Apache Hadoop in Persian - آشنایی با هدوپ
Mobin Ranjbar
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
tcurdt
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
Konstantin V. Shvachko
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Subhas Kumar Ghosh
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
sudhakara st
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
Rahul Jain
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

Viewers also liked (17)

Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
Introduction to Apache Hadoop in Persian - آشنایی با هدوپ
Introduction to Apache Hadoop in Persian - آشنایی با هدوپIntroduction to Apache Hadoop in Persian - آشنایی با هدوپ
Introduction to Apache Hadoop in Persian - آشنایی با هدوپ
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Similar to Hadoop & Big Data benchmarking

Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
Nicolas Poggi
 
LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96Boards
Linaro
 
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Ganesh Raju
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies
 
H04502048051
H04502048051H04502048051
H04502048051
ijceronline
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
Giovanna Roda
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
Pietro Michiardi
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
t_ivanov
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
Nicolas Poggi
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
Denny Lee
 
Hadoop tutorial hand-outs
Hadoop tutorial hand-outsHadoop tutorial hand-outs
Hadoop tutorial hand-outspardhavi reddy
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endthkoch
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
t_ivanov
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
Nicolas Morales
 
Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNet
Vasyl Senko
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios
 
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
inovex GmbH
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
Edureka!
 

Similar to Hadoop & Big Data benchmarking (20)

Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96Boards
 
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
H04502048051
H04502048051H04502048051
H04502048051
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
Hadoop tutorial hand-outs
Hadoop tutorial hand-outsHadoop tutorial hand-outs
Hadoop tutorial hand-outs
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead end
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
2014 hadoop wrocław jug
2014 hadoop   wrocław jug2014 hadoop   wrocław jug
2014 hadoop wrocław jug
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNet
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
 
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 

Recently uploaded

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 

Recently uploaded (20)

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 

Hadoop & Big Data benchmarking

  • 1. Benchmarking Hadoop & Big Data benchmarking Dr. ir. ing. Bart Vandewoestyne Sizing Servers Lab, Howest, Kortrijk IWT TETRA User Group Meeting - November 28, 2014 1 / 62
  • 2. Benchmarking Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 2 / 62
  • 3. Benchmarking Intro: Hadoop essentials Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 3 / 62
  • 4. Benchmarking Intro: Hadoop essentials Hadoop Hadoop is VMware, but the other way around. 4 / 62
  • 5. Benchmarking Intro: Hadoop essentials Hadoop 1.0 Source: Apache Hadoop YARN : moving beyond MapReduce and batch processing with Apache Hadoop 2, Hortonworks, 2014) MapReduce and HDFS are the core components, while other components are built around the core. 5 / 62
  • 6. Benchmarking Intro: Hadoop essentials Hadoop 2.0 Source: Apache Hadoop YARN : moving beyond MapReduce and batch processing with Apache Hadoop 2, Hortonworks, 2014) YARN adds a more general interface to run non-MapReduce jobs within the Hadoop framework. 6 / 62
  • 7. Benchmarking Intro: Hadoop essentials HDFS Hadoop Distributed File System Source: http://www.cac.cornell.edu/vw/MapReduce/dfs.aspx 7 / 62
  • 8. Benchmarking Intro: Hadoop essentials MapReduce MapReduce = Programming Model WordCount example: Source: Optimizing Hadoop for MapReduce, Khaled Tannir 8 / 62
  • 9. Benchmarking Intro: Hadoop essentials Hadoop distributions 9 / 62
  • 10. Benchmarking Cloudera demo Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 10 / 62
  • 12. Benchmarking Cloudera demo NameNode and DataNodes 12 / 62
  • 13. Benchmarking Cloudera demo Hosts and their roles 13 / 62
  • 14. Benchmarking Cloudera demo NameNode WebUI NameNode WebUI address http://sandy-quad-1.sslab.lan:50070/ 14 / 62
  • 15. Benchmarking Cloudera demo Replication factor 15 / 62
  • 16. Benchmarking Cloudera demo HDFS Blocks 16 / 62
  • 18. le upload 17 / 62
  • 19. Benchmarking Cloudera demo Hadoop jobs: counters/metrics 18 / 62
  • 20. Benchmarking Benchmarks Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 19 / 62
  • 21. Benchmarking Benchmarks Why benchmark? My three reasons for using benchmarks: 1 Evaluating the eect of a hardware/software upgrade: OS, Java VM,. . . Hadoop, Cloudera CDH, Pig, Hive, Impala,. . . 2 Debugging: Compare with other clusters or published results. 3 Performance tuning: E.g. Cloudera CDH default con
  • 22. g is defensive, not optimal. 20 / 62
  • 23. Benchmarking Benchmarks Micro Benchmarks Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 21 / 62
  • 24. Benchmarking Benchmarks Micro Benchmarks Hadoop: Available tests hadoop jar /some/path/to/hadoop-*test*.jar 22 / 62
  • 25. Benchmarking Benchmarks Micro Benchmarks TestDFSIO Read and write test for HDFS. Helpful for getting an idea of how fast your cluster is in terms of I/O, stress testing HDFS, discover network performance bottlenecks, shake out the hardware, OS and Hadoop setup of your cluster machines (particularly the NameNode and the DataNodes). 23 / 62
  • 26. Benchmarking Benchmarks Micro Benchmarks TestDFSIO: write test Generate 10
  • 27. les of size 1 GB for a total of 10 GB: $ hadoop jar hadoop-*test*.jar TestDFSIO -write -nrFiles 10 -fileSize 1000 TestDFSIO is designed to use 1 map task per
  • 29. les to map tasks) 24 / 62
  • 30. Benchmarking Benchmarks Micro Benchmarks TestDFSIO: write test output Typical output of write test ----- TestDFSIO ----- : write Date time: Mon Oct 06 10:21:28 CEST 2014 Number of files: 10 Total MBytes processed: 10000.0 Throughput mb/sec: 12.874702111579893 Average IO rate mb/sec: 13.013071060180664 IO rate std deviation: 1.4416050051562712 Test exec time sec: 114.346 25 / 62
  • 31. Benchmarking Benchmarks Micro Benchmarks Interpreting TestDFSIO results De
  • 33. lesizei PN i=0 timei De
  • 34. nition (Average IO rate) Average IO rate(N) = PN i=0 ratei N = PN
  • 35. lesizei timei N i=0 Here, N is the number of map tasks. 26 / 62
  • 36. Benchmarking Benchmarks Micro Benchmarks TestDFSIO: read test Read 10 input
  • 37. les, each of size 1 GB: $ hadoop jar hadoop-*test*.jar TestDFSIO -read -nrFiles 10 -fileSize 1000 27 / 62
  • 38. Benchmarking Benchmarks Micro Benchmarks TestDFSIO: read test output Typical output of read test ----- TestDFSIO ----- : read Date time: Mon Oct 06 10:56:15 CEST 2014 Number of files: 10 Total MBytes processed: 10000.0 Throughput mb/sec: 402.4306813151435 Average IO rate mb/sec: 492.8257751464844 IO rate std deviation: 196.51233829270575 Test exec time sec: 33.206 28 / 62
  • 39. Benchmarking Benchmarks Micro Benchmarks In uence of HDFS replication factor When interpreting TestDFSIO results, keep in mind: The HDFS replication factor plays an important role! A higher replication factor leads to slower writes. For three identical TestDFSIO write runs (units are MB/s): HDFS replication factor 1 2 3 Throughput 190 25 13 Average IO-rate 190 10 25 3 13 1 29 / 62
  • 40. Benchmarking Benchmarks Micro Benchmarks TeraSort Goal Sort 1TB of data (or any other amount of data) as fast as possible. Probably most well-known Hadoop benchmark. Combines testing the HDFS and MapReduce layers of an Hadoop cluster. Typical areas where TeraSort is helpful Iron out your Hadoop con
  • 41. guration after your cluster passed a convincing TestDFSIO benchmark
  • 42. rst. Determine whether your MapReduce-related parameters are set to proper values. 30 / 62
  • 43. Benchmarking Benchmarks Micro Benchmarks TeraSort: work ow TeraGen /user/bart/terasort-input TeraSort /user/bart/terasort-output TeraValidate /user/bart/terasort-validate 31 / 62
  • 44. Benchmarking Benchmarks Micro Benchmarks TeraSort: work ow hadoop jar hadoop-mapreduce-examples.jar teragen 10000000000 /user/bart/input 4 hours on our 4-node cluster 32 / 62
  • 45. Benchmarking Benchmarks Micro Benchmarks TeraSort: work ow hadoop jar hadoop-mapreduce-examples.jar teragen 10000000000 /user/bart/input 4 hours on our 4-node cluster hadoop jar hadoop-mapreduce-examples.jar terasort /user/bart/input /user/bart/output 5 hours on our 4-node cluster 33 / 62
  • 46. Benchmarking Benchmarks Micro Benchmarks TeraSort: work ow hadoop jar hadoop-mapreduce-examples.jar teragen 10000000000 /user/bart/input 4 hours on our 4-node cluster hadoop jar hadoop-mapreduce-examples.jar terasort /user/bart/input /user/bart/output 5 hours on our 4-node cluster hadoop jar hadoop-mapreduce-examples.jar teravalidate /user/bart/output /user/bart/validate If something went wrong, TeraValidate's output contains the problem report. 34 / 62
  • 47. Benchmarking Benchmarks Micro Benchmarks TeraSort: duration 35 / 62
  • 48. Benchmarking Benchmarks Micro Benchmarks TeraSort: counters 36 / 62
  • 49. Benchmarking Benchmarks Micro Benchmarks NNBench Goal Load test the NameNode hardware and software. Generates a lot of HDFS-related requests with normally very small payloads. Purpose: put a high HDFS management stress on the NameNode. Can simulate requests for creating, reading, renaming and deleting
  • 50. les on HDFS. 37 / 62
  • 51. Benchmarking Benchmarks Micro Benchmarks NNBench: example Create 1000
  • 52. les using 12 maps and 6 reducers: $ hadoop jar hadoop-*test*.jar nnbench -operation create_write -maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 -replicationFactorPerFile 3 -readFileAfterOpen true -baseDir /user/bart/NNBench-`hostname -s` 38 / 62
  • 53. Benchmarking Benchmarks Micro Benchmarks MRBench Goal Loop a small job a number of times. checks whether small job runs are responsive and running eciently on the cluster complimentary to TeraSort puts its focus on the MapReduce layer impact on the HDFS layer is very limited 39 / 62
  • 54. Benchmarking Benchmarks Micro Benchmarks MRBench: example Run a loop of 50 small test jobs: $ hadoop jar hadoop-*test*.jar mrbench -baseDir /user/bart/MRBench -numRuns 50 40 / 62
  • 55. Benchmarking Benchmarks Micro Benchmarks MRBench: example Run a loop of 50 small test jobs: $ hadoop jar hadoop-*test*.jar mrbench -baseDir /user/bart/MRBench -numRuns 50 Example output: DataLines Maps Reduces AvgTime (milliseconds) 1 2 1 28822 ! average
  • 56. nish time of executed jobs was 28 seconds. 41 / 62
  • 57. Benchmarking Benchmarks BigBench Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 42 / 62
  • 58. Benchmarking Benchmarks BigBench BigBench Source: http://mhpersonaltrainer.mhpersonaltrainer.com/mhpersonaltrainer/56616/index 43 / 62
  • 59. Benchmarking Benchmarks BigBench BigBench Big Data benchmark based on TPC-DS. Focus is mostly on MapReduce engines. Collaboration between industry and academia. https://github.com/intel-hadoop/Big-Bench/ History Launched at First Workshop on Big Data Benchmarking (May 8-9, 2012). Full kit at Fifth Workshop on Big Data Benchmarking (August 5-6, 2014). 44 / 62
  • 60. Benchmarking Benchmarks BigBench BigBench data model Source: BigBench: Towards an Industry Standard Benchmark for Big Data Analytics, Ghazal et al., 2013. 45 / 62
  • 61. Benchmarking Benchmarks BigBench BigBench: Data Model - 3 V's Variety BigBench data is structured, semi-structured, unstructured. Velocity Periodic refreshes for all data. Dierent velocity for dierent areas: Vstructured Vunstructured Vsemistructured Volume TPC-DS: discrete scale factors (100, 300, 1000, 3000, 10000, 3000 and 100000). BigBench: continuous scale factor. 46 / 62
  • 62. Benchmarking Benchmarks BigBench BigBench: Workload Workload queries 30 queries Speci
  • 63. ed in English (sort of) No required syntax (
  • 64. rst implementation in Aster SQL MR) Kit implemented in Hive, Hadoop MR, Mahout, OpenNLP Business functions (McKinsey) Marketing Merchandising Operations Supply chain Reporting (customers and products) 47 / 62
  • 65. Benchmarking Benchmarks BigBench BigBench: Workload - Technical Aspects Data Sources Number of Queries Percentage Structured 18 60 % Semi-structured 7 23 % Unstructured 5 17 % Analytic techniques Number of Queries Percentage Statistics analysis 6 20 % Data mining 17 57 % Reporting 8 27 % 48 / 62
  • 66. Benchmarking Benchmarks BigBench BigBench: Workload - Technical Aspects Query Types Number of Queries Percentage Pure HiveQL 14 46 % Mahout 5 17 % OpenNLP 5 17 % Custom MR 6 20 % 49 / 62
  • 67. Benchmarking Benchmarks BigBench BigBench: Workload - Technical Aspects Query Types Number of Queries Percentage Pure HiveQL 14 46 % Mahout 5 17 % OpenNLP 5 17 % Custom MR 6 20 % Note that your implementation may vary! 50 / 62
  • 68. Benchmarking Benchmarks BigBench BIgBench: Benchmark Process Source: http://www.tele-task.de/archive/video/flash/24896/ 51 / 62
  • 69. Benchmarking Benchmarks BigBench BigBench: Metric Number of queries run: 30 (2 S + 1) Measured times: TL: loading process TP: power test TTT1 :
  • 70. rst throughput test TTDM : data maintenance task TTT2 : second throughput test De
  • 71. nition (BigBench queries per hour) BBQpH = 30 3 S 3600 S TL + S TP + TTT1 + S TTDM + TTT2 Similar to TPC-DS metric. 52 / 62
  • 72. Benchmarking Benchmarks BigBench BigBench: results 53 / 62
  • 73. Benchmarking Benchmarks BigBench BigBench: monitoring 54 / 62
  • 74. Benchmarking Benchmarks BigBench BigBench: monitoring 55 / 62
  • 75. Benchmarking Benchmarks BigBench BigBench: monitoring 56 / 62
  • 76. Benchmarking Benchmarks BigBench BigBench: monitoring 57 / 62
  • 77. Benchmarking Benchmarks BigBench BigBench: in progress 58 / 62 Source: The Hortonworks Blog
  • 78. Benchmarking Conclusions Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 59 / 62
  • 79. Benchmarking Conclusions Conclusions Use Hadoop distributions! Hadoop cluster administration ! Cloudera Manager. Micro-benchmarks $ BigBench. 60 / 62
  • 80. Benchmarking Conclusions Conclusions Use Hadoop distributions! Hadoop cluster administration ! Cloudera Manager. Micro-benchmarks $ BigBench. Your best benchmark is your own application! 61 / 62
  • 81. Benchmarking Conclusions Questions? Source: https://gigaom.com/2011/12/19/my-hadoop-is-bigger-than-yours/ 62 / 62