Benchmarking Hadoop and Big Data

@BDOOP_BCN
Benchmarking Hadoop
by Nicolas Poggi @ni_po
June 2, 2015

What is BDOOP about?
● A group to share on Data
● Scalability
● Performance
● Configurations
● Cluster design
● Benchmarking
● …a/couple of beer/s!
• Having sysadmins in mind
● Also POs
● Not a group to learn
• Java
• Mapreduce programming
• Hadoop base concepts

BDOOP Group Objectives
● Create a local community to
● Learn Big Data
● performance and scalability
● Share
● day-to-day problems and solutions
● Present your work and findings
● Have talks from renown experts
● > Your objective here <

Benchmarking Motivation and Intro

Hadoop design
 Hadoop designed to solve complex data
 Structured and non structured
 With [close to] linear scalability
 Simplifying the programming model
 From MPI, OpenMP, CUDA, …
 Operates as a blackbox for data analysts
Image source: Hadoop, the definitive guide

Hadoop attributes
 Fault tolerant
 from commodityhardware
 Built in redundancy
 via replication
 Automatic scales out / down
 With [almost] linear scalability
 Move computation to data
 minimize communication
 Share nothing architecture

Hadoop highly-scalable but…
 Not a high-performance solution!
 Requires
 Design,
 Clusters, topology clusters
 Setup,
 OS, Hadoop config
 and tuning required
 Iterative approach
 Time consuming
 And extensive benchmarking!

Hadoop parameters
 > 100+ tunable parameters
 mapred.map/reduce.tasks.speculative.execution
 obscure and interrelated
 io.sort.mb 100 (300)
 io.sort.record.percent 5% (15%)
 io.sort.spill.percent 80% (95 – 100%)
 Number of Mappers and Reducers
 Rule of thumb 0.5 - 2 per CPU core

Hadoop ecosystem
 Large and spread
 Dominated by big players
 Custom patches
 Default values not ideal
 Product claims
 Cloud vs. On-premise
 IaaS
 PaaS
 EMR, HDInsight
 Needs standardization
and auditing!
DATA

Product claims
 Need auditing!

Workload (jobs)
 All jobs are different!
 Different requirements
 CPU bound
 Memory bound
 I/O bound
 … a bit of all
 Different tuning for
each
 Needs benchmarking!
Terasort
K-means
Wordcount
Sample mappers and reducer for 3 popular
benchmarks:

One for all config?
Vertical line:Average performance forthisworkloadacrossconfigurations
Valuesto the right: above average
Valuesto the left: below average
Is there one software configurationiterationthat fits everybody?
Configurations
Good for Terasort but
bad for Wordcount
Good for Terasort but
bad for Wordcount
Good for Wordcount but
very bad for Terasort

Example of SSD impact to Execution time
 Impact of SSDs to running time of Terasort
SSDs
HDDs
Configurations
SSD
SATA

Too many choices?
Remote volumes
-
-
Rotational HDDs
JBODs
Large VMs
Small VMs
GbEthernet
InfiniBand
RAID
Cost
Performance
On-Premise
Cloud
And where is my system
configurationpositionedon
each of these axes?
Highavailability
Replication
+
+

Why benchmark?
 Validate assumptions
 Reproduce bad behavior
 Debugging
 Measure performance and scale
 Simulate higher load
 Find bottlenecks/ limits
 Plan for growth
 Test different
 SW and HW
Source: Based on High Performance MySQL, benchmarking MySQL chapter

Benchmarking stakeholders and use cases
 End-user / consumer
 Compare products
 Developer
 Profiling
 CI / QA
 Sysadmin / architect
 Cluster sizing
 SW and HW vendors
 Product claims
 Marketing
 Researcher
 …

Big Data Vs
 Volume
 Velocity
 Variety
 Structured, semi, unstructured data
 Different types of data (genres)
 Veracity
 Value
Sample scale factorfrom TPCx-HS

Data generation
 Real vs. Synthetic
 Random data vs. repeatable
 Datageneration time
 Paralle
 Datadistribution
 Flat or uniformly distributed
 Gaussian (normal distribution,
skew)

Issues Benchmarking Big Data
 Big Scale
 Single node vs Multiple nodes
 10MB vs 10TB
 On-metal vs. virtualized vs. cloud
 Non-deterministic/ Randomness
 Need to average multiple runs
 How long to benchmark
 Systemwarm-up
 Distributed systems
 Failures?

Types of benchmarks and Standards
 Micro benchmarks
 HDFSIO
 Functional
 Terasort, ETL
 Genre-specific
 Graph 500
 Application level
 BigBench
 TPC (implementation) vs SPEC (reference)

TPC vs. SPEC models
 Specification based
 Performance, price,
energy in one benchmark
 End-to-end
 Multiple tests (ACID, load)
 Independent review
 Full disclosure
 TPC Technology
Conference
 Kit based
 Performance and energy
in separate benchmarks
 Server-centric
 Single test
 Peer review
 Summary disclosure
 SPEC Research Group,
ICPE
Source: From presentation by Meikel Poess, 1stWBDB, May 2012

Data Benchmarks
Classical SQL OLAP DB Big Data
 First there was TPC-H
 Classical SQL OLAP
benchmark
 MRBench for M/R
 On top of Hive or Impala
for Hadoop
 Then sorting
 Terasort
 Unofficial standard
 Now part of TPCx-HS
 Hadoop samples
 Wordcount, grep,terasort,DFSIO
 YCSB
 From Yahoo!
 For NoSQL, HBASE implementation
 GridMix
 CALDA
 HiBench
 SWIM
 BigBench
 based on TCP-DS + ML
 30 queries
 BigDataBench
 33 workloads
 TPCx-HS

Comparisonof popularHadoop benchmarks
Spec[1
]
App
domains
Workload
types
Workload
s
Scalable
data
sets[2]
Diverse
implem[3]
Multi-
tenancy[4]
Subset[5] Simulator
[6]
BigDataBench Y Five Four[7] Thirty-
three [8] Eight[9] Y Y Y Y
BigBench Y One Three Ten Three N N N N
CloudSuite N N/A Two Eight Three N N N Y
HiBench N N/A Two Ten Three N N N N
CALDA Y N/A One Five N/A Y N N N
YCSB Y N/A One Six N/A Y N N N
LinkBench Y N/A One Ten N/A Y N N N
AMP Benchma
rks
Y N/A One Four N/A Y N N N
The Differences of BigDataBench from Other Benchmarks Suites.
Source: BigDataBench homepage

What to measure and metrics
 Job execution time
 Throughput
 Units / time
 Framework overhead
 # of spills
 Scalability
 Concurrency
 Abstract metrics
 CPU
 MEM
 DISK
 IOPS, latency, bandwidth
 NET
 Latency bandwidth
 TPCx-HS performance
metric (HSph@SF)

Project ALOJA online repository
 Entry point for explore the results collected from the
executions,
 Provides insights on the obtained results through
continuouslyevolving data views.
 Online results at: http://hadoop.bsc.es

ALOJA Platform: Evolution and status
 Benchmarking, Repository, and Analytics tools for Big Data
 Composed of open-source
 Benchmarking, provisioning and orchestration tools,
 high-level system performance metric collection,
 low-level Hadoop instrumentation based on BSC Tools
 and Web based data analytics tools
 Andrecommendations
 Online Big Data Benchmark repository of:
 20,000+ runs (from HiBench)
 Sharable, comparable, repeatable, verifiable executions
 Abstracting and leveraging tools for BD benchmarking
 Not reinventing the wheel but,
 most current BD tools designed for production, not for benchmarking
 leverages current compatible tools and projects
 Dev VM toolset and sandbox
 via Vagrant
Big Data
Benchmarking
Online
Repository
Analytics

Workflow in ALOJA
Cluster(s)
definition
• VM sizes
• # nodes
• OS, disks
• Capabilities
Execution
plan
• Start cluster
• Exec Benchmarks
• Gather results
• Cleanup
Import
data
• Convert perf metric
• Parse logs
• Import into DB
Evaluate
data
• Data views in Vagrant VM
• Or http://hadoop.bsc.es
PA and KD
•Predictive
Analytics
•Knowledge
Discovery
Historic
Repo

34
Benchmarks Execution comparisons
 You can compare, side by side, all execution parameters:
 CPU, Memory, Network, Disk, Hadoop parameters….
Sample:
http://hadoop.bsc.es/perfcharts?execs[]=91144

HiBench suiteHiBench : A Benchmark Suite for Hadoop
HiBench
A Comprehensive & Realistic Benchmark Suite
Enhanced DFSIO
Micro Benchmarks Web Search
Sort
WordCount
TeraSort
Nutch Indexing
Page Rank
Machine Learning
Bayesian Classification
K-Means Clustering
HDFS
Code at: https://github.com/intel-hadoop/HiBench

Job resource requirements 1/2
Source: Intel HiBench

Job resource requirements 2/2
Source: Intel HiBench

Impact of SW configurations in Speedup
Number of mappers Compression algorithm
No comp.
ZLIB
BZIP2
snappy
4m
6m
8m
10m
Speedup (higher is better)
Results using: http://hadoop.bsc.es/configimprovement
Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf

Impact of HW configurationsin Speedup
Disks and Network Cloud remote volumes
Local only
1 Remote
2 Remotes
3 Remotes
3 Remotes
/tmp local
2 Remotes
/tmp local
1 Remotes
/tmp local
HDD-ETH
HDD-IB
SSD-ETH
SDD-IB
Speedup (higher is better)
Results using: http://hadoop.bsc.es/configimprovement
Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf

Cost/Performance Scalability
 Terasort (100GB)
Sample from: http://hadoop.bsc.es/nodeseval
Execution time Execution cost

InfiniBand + SDD (LOCAL)
GbE SDD + (LOCAL) CLOUD (local disk/tmpand HDFS)
CLOUD (/tmpinLocal Disk, HDFSin Blob storage 1-3
devices)
CLOUD (/tmpandHDFSin Blob storage
1-3 devices)
InfiniBand + SATA disks (LOCAL)
GbE+ SATA disks (LOCAL)
Price
Performance
Cost-effectiveness On-premise vs. Cloud)
Details at: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf

Common Benchmarking pitfalls
 Scalability
 Assuming near scalability
 Compare apples to apples
 if benchmarking HW change HW
 but leave SW the same
 Terasort in v1 != Terasort in v2
 Test for Big Data use large data
 stress the system
 If results are too good to be true, they
probably aren't
 Don’t believe in miracles
 Expect vendor lies
Source: adapted from Benchmarking Big Data Systems by YANPEI CHEN and GWEN SHAPIRA at Big Data Spain

Resources
 ALOJA Benchmarking platform and online repository
 http://hadoop.bsc.es/
 Big Data Benchmarking Community (BDBC) mailing list
 (~200 members from ~80organizations)
 http://clds.sdsc.edu/bdbc/community
 Workshop Big Data Benchmarking (WBDB)
 Next: http://clds.sdsc.edu/wbdb2015.ca
 SPEC Research Big Data working group
 http://research.spec.org/working-groups/big-data-working-group.html
 Slides and video:
 Michael Frank on Big Data benchmarking
 http://www.tele-task.de/archive/podcast/20430/
 Tilmann Rabl Big Data Benchmarking Tutorial
 http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl

Benchmarking Hadoop and Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Benchmarking Hadoop and Big Data

Similar to Benchmarking Hadoop and Big Data (20)

More from Nicolas Poggi

More from Nicolas Poggi (8)

Recently uploaded

Recently uploaded (20)

Benchmarking Hadoop and Big Data