3. What is BDOOP about?
● A group to share on Data
● Scalability
● Performance
● Configurations
● Cluster design
● Benchmarking
● …a/couple of beer/s!
• Having sysadmins in mind
● Also POs
● Not a group to learn
• Java
• Mapreduce programming
• Hadoop base concepts
4. BDOOP Group Objectives
● Create a local community to
● Learn Big Data
● performance and scalability
● Share
● day-to-day problems and solutions
● Present your work and findings
● Have talks from renown experts
● > Your objective here <
6. Hadoop design
Hadoop designed to solve complex data
Structured and non structured
With [close to] linear scalability
Simplifying the programming model
From MPI, OpenMP, CUDA, …
Operates as a blackbox for data analysts
Image source: Hadoop, the definitive guide
7. Hadoop attributes
Fault tolerant
from commodityhardware
Built in redundancy
via replication
Automatic scales out / down
With [almost] linear scalability
Move computation to data
minimize communication
Share nothing architecture
8. Hadoop highly-scalable but…
Not a high-performance solution!
Requires
Design,
Clusters, topology clusters
Setup,
OS, Hadoop config
and tuning required
Iterative approach
Time consuming
And extensive benchmarking!
9. Hadoop parameters
> 100+ tunable parameters
mapred.map/reduce.tasks.speculative.execution
obscure and interrelated
io.sort.mb 100 (300)
io.sort.record.percent 5% (15%)
io.sort.spill.percent 80% (95 – 100%)
Number of Mappers and Reducers
Rule of thumb 0.5 - 2 per CPU core
10. Hadoop ecosystem
Large and spread
Dominated by big players
Custom patches
Default values not ideal
Product claims
Cloud vs. On-premise
IaaS
PaaS
EMR, HDInsight
Needs standardization
and auditing!
DATA
12. Workload (jobs)
All jobs are different!
Different requirements
CPU bound
Memory bound
I/O bound
… a bit of all
Different tuning for
each
Needs benchmarking!
Terasort
K-means
Wordcount
Sample mappers and reducer for 3 popular
benchmarks:
13. One for all config?
Vertical line:Average performance forthisworkloadacrossconfigurations
Valuesto the right: above average
Valuesto the left: below average
Is there one software configurationiterationthat fits everybody?
Configurations
Good for Terasort but
bad for Wordcount
Good for Terasort but
bad for Wordcount
Good for Wordcount but
very bad for Terasort
14. Example of SSD impact to Execution time
Impact of SSDs to running time of Terasort
SSDs
HDDs
Configurations
SSD
SATA
15. Too many choices?
Remote volumes
-
-
Rotational HDDs
JBODs
Large VMs
Small VMs
GbEthernet
InfiniBand
RAID
Cost
Performance
On-Premise
Cloud
And where is my system
configurationpositionedon
each of these axes?
Highavailability
Replication
+
+
17. Why benchmark?
Validate assumptions
Reproduce bad behavior
Debugging
Measure performance and scale
Simulate higher load
Find bottlenecks/ limits
Plan for growth
Test different
SW and HW
Source: Based on High Performance MySQL, benchmarking MySQL chapter
19. Big Data Vs
Volume
Velocity
Variety
Structured, semi, unstructured data
Different types of data (genres)
Veracity
Value
Sample scale factorfrom TPCx-HS
20. Data generation
Real vs. Synthetic
Random data vs. repeatable
Datageneration time
Paralle
Datadistribution
Flat or uniformly distributed
Gaussian (normal distribution,
skew)
21. Issues Benchmarking Big Data
Big Scale
Single node vs Multiple nodes
10MB vs 10TB
On-metal vs. virtualized vs. cloud
Non-deterministic/ Randomness
Need to average multiple runs
How long to benchmark
Systemwarm-up
Distributed systems
Failures?
23. TPC vs. SPEC models
Specification based
Performance, price,
energy in one benchmark
End-to-end
Multiple tests (ACID, load)
Independent review
Full disclosure
TPC Technology
Conference
Kit based
Performance and energy
in separate benchmarks
Server-centric
Single test
Peer review
Summary disclosure
SPEC Research Group,
ICPE
Source: From presentation by Meikel Poess, 1stWBDB, May 2012
24. Data Benchmarks
Classical SQL OLAP DB Big Data
First there was TPC-H
Classical SQL OLAP
benchmark
MRBench for M/R
On top of Hive or Impala
for Hadoop
Then sorting
Terasort
Unofficial standard
Now part of TPCx-HS
Hadoop samples
Wordcount, grep,terasort,DFSIO
YCSB
From Yahoo!
For NoSQL, HBASE implementation
GridMix
CALDA
HiBench
SWIM
BigBench
based on TCP-DS + ML
30 queries
BigDataBench
33 workloads
TPCx-HS
26. What to measure and metrics
Job execution time
Throughput
Units / time
Framework overhead
# of spills
Scalability
Concurrency
Abstract metrics
CPU
MEM
DISK
IOPS, latency, bandwidth
NET
Latency bandwidth
TPCx-HS performance
metric (HSph@SF)
28. Project ALOJA online repository
Entry point for explore the results collected from the
executions,
Provides insights on the obtained results through
continuouslyevolving data views.
Online results at: http://hadoop.bsc.es
29. ALOJA Platform: Evolution and status
Benchmarking, Repository, and Analytics tools for Big Data
Composed of open-source
Benchmarking, provisioning and orchestration tools,
high-level system performance metric collection,
low-level Hadoop instrumentation based on BSC Tools
and Web based data analytics tools
Andrecommendations
Online Big Data Benchmark repository of:
20,000+ runs (from HiBench)
Sharable, comparable, repeatable, verifiable executions
Abstracting and leveraging tools for BD benchmarking
Not reinventing the wheel but,
most current BD tools designed for production, not for benchmarking
leverages current compatible tools and projects
Dev VM toolset and sandbox
via Vagrant
Big Data
Benchmarking
Online
Repository
Analytics
30. Workflow in ALOJA
Cluster(s)
definition
• VM sizes
• # nodes
• OS, disks
• Capabilities
Execution
plan
• Start cluster
• Exec Benchmarks
• Gather results
• Cleanup
Import
data
• Convert perf metric
• Parse logs
• Import into DB
Evaluate
data
• Data views in Vagrant VM
• Or http://hadoop.bsc.es
PA and KD
•Predictive
Analytics
•Knowledge
Discovery
Historic
Repo
31. 34
Benchmarks Execution comparisons
You can compare, side by side, all execution parameters:
CPU, Memory, Network, Disk, Hadoop parameters….
Sample:
http://hadoop.bsc.es/perfcharts?execs[]=91144
32. HiBench suiteHiBench : A Benchmark Suite for Hadoop
HiBench
A Comprehensive & Realistic Benchmark Suite
Enhanced DFSIO
Micro Benchmarks Web Search
Sort
WordCount
TeraSort
Nutch Indexing
Page Rank
Machine Learning
Bayesian Classification
K-Means Clustering
HDFS
Code at: https://github.com/intel-hadoop/HiBench
35. Impact of SW configurations in Speedup
Number of mappers Compression algorithm
No comp.
ZLIB
BZIP2
snappy
4m
6m
8m
10m
Speedup (higher is better)
Results using: http://hadoop.bsc.es/configimprovement
Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
36. Impact of HW configurationsin Speedup
Disks and Network Cloud remote volumes
Local only
1 Remote
2 Remotes
3 Remotes
3 Remotes
/tmp local
2 Remotes
/tmp local
1 Remotes
/tmp local
HDD-ETH
HDD-IB
SSD-ETH
SDD-IB
Speedup (higher is better)
Results using: http://hadoop.bsc.es/configimprovement
Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
39. Common Benchmarking pitfalls
Scalability
Assuming near scalability
Compare apples to apples
if benchmarking HW change HW
but leave SW the same
Terasort in v1 != Terasort in v2
Test for Big Data use large data
stress the system
If results are too good to be true, they
probably aren't
Don’t believe in miracles
Expect vendor lies
Source: adapted from Benchmarking Big Data Systems by YANPEI CHEN and GWEN SHAPIRA at Big Data Spain
40. Resources
ALOJA Benchmarking platform and online repository
http://hadoop.bsc.es/
Big Data Benchmarking Community (BDBC) mailing list
(~200 members from ~80organizations)
http://clds.sdsc.edu/bdbc/community
Workshop Big Data Benchmarking (WBDB)
Next: http://clds.sdsc.edu/wbdb2015.ca
SPEC Research Big Data working group
http://research.spec.org/working-groups/big-data-working-group.html
Slides and video:
Michael Frank on Big Data benchmarking
http://www.tele-task.de/archive/podcast/20430/
Tilmann Rabl Big Data Benchmarking Tutorial
http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl