Big Data Benchmarking

Big Data Benchmarks
Srinivasa Rao Aravilli
N Venkata Naga Ravi

2
Why ..
 Evaluating the effect of a hardware/software
upgrade:
 OS, Java VM,. . .
 Hadoop, Cloudera CDH, Pig, Hive, Impala,.
. .
 Debugging:
 Compare with other clusters or published
results.
 Performance tuning

3
Industry Standard benchmarking organizations
• TPC - Transaction Processing Performance Council (http://www.tpc.org/ )
• SPEC - The Standard Performance Evaluation Corporation
(https://www.spec.org/ )
• CLDS – Centre for Large- scale Data System Research
(http://clds.sdsc.edu/bdbc)
• Top Outcomes
• BigData Top100 - an end-to-end application-layer benchmark for big data
applications
• Terasort - Functional benchmark focusing on Sort function ( quicksort using
MapReduce)
• Hibench
• Sort, Machine learning ( K-means clustering, Classification)

4
Types of Benchmark
• Micro-benchmarks. To evaluate specific lower-level, system operations
• E.g., A Micro-benchmark Suite for Evaluating HDFS Operations on
Modern Clusters, Panda et al, OSU
• Functional / component benchmarks. Specific high-level function.
• E.g. Sorting: Terasort
• E.g. Basic SQL: Individual SQL operations, e.g. Select, Project, Join,
Order-By, ...
• Application-level benchmarks.
• Measure system performance (hardware and software) for a given
application scenario—with given data and workload

5
Terasort using Hadoop
Terasort includes 3 MapReduce Applications
• Teragen – generates the data
• Terasort – samples the input data and uses them with MapReduce to
sort the data
• Teravalidate – validates the output data is sorted

7
Map Reduce Modelloser look at MapReduce’s implementation model
source: http:/ / developer.yahoo.com/ hadoop/ tutorial/ module4.html”

8
Benchmarking Suite
• HiBench, Yan Li, Intel (https://github.com/intel-hadoop/HiBench)
• YCSB -Yahoo Cloud Serving Benchmark, Brian Cooper, Yahoo!
(https://github.com/brianfrankcooper/YCSB/)
• Berkeley Big Data Benchmark, Pavlo et al., AMPLab
(https://amplab.cs.berkeley.edu/benchmark/)
• BigDataBench, Jianfeng Zhan, Chinese Academy of Sciences
(http://prof.ict.ac.cn/BigDataBench/)
• Grid Mix (http://hadoop.apache.org/docs/r1.2.1/gridmix.html)
• Big Bench (https://github.com/intel-hadoop/Big-Bench)
• TPCx-HS (http://www.tpc.org/tpcx-hs/ )

9
TPCx-HS benchmarks
X: Express H: Hadoop S: Sort
• TPCx-HS was developed to provide an
objective measure of hardware,
operating system and commercial
Apache Hadoop File System API
compatible software distributions, and
to provide the industry with verifiable
performance, price-performance and
availability metrics.
• http://www.tpc.org/tpcx-hs/

11
TPCx-HS benchmarks
Scale Factor
The TPCx-HS follows a stepped size model. Scale factor (SF) used for the test
dataset must be chosen from the set of fixed Scale Factors defined as :
• 1TB, 3TB, 10TB, 30TB, 100TB, 300TB, 1000TB, 3000TB, 10000TB.
• The corresponding number of records are
• 10B, 30B, 100B, 300B, 1000B, 3000B, 10000B, 30000B, 100000B,
where each record is 100 bytes generated by HSGen.
• http://www.tpc.org/tpcx-hs/

12
TPCx-HS benchmarks - Metrics

13
TPCx-HS Results on Cisco UCS
Cisco Published Results

14
Comparison of various Benchmarks Suites.

17
Spark sorted the same data 3X faster using 10X fewer machines. All the sorting
took place on disk (HDFS), without using Spark’s in-memory cache.

18
Sort Bench Mark http://sortbenchmark.org/
• GraySort
• MinuteSort
• CloudSort
• JouleSort
• PennySort
• TeraByteSort
• DatamationSort

Big Data Benchmarking

More Related Content

What's hot

Viewers also liked

Similar to Big Data Benchmarking

More from Venkata Naga Ravi

Recently uploaded

Big Data Benchmarking

Editor's Notes