Big Data Benchmarks
Srinivasa Rao Aravilli
N Venkata Naga Ravi
2
Why ..
 Evaluating the effect of a hardware/software
upgrade:
 OS, Java VM,. . .
 Hadoop, Cloudera CDH, Pig, Hive, Impala,.
. .
 Debugging:
 Compare with other clusters or published
results.
 Performance tuning
3
Industry Standard benchmarking organizations
• TPC - Transaction Processing Performance Council (http://www.tpc.org/ )
• SPEC - The Standard Performance Evaluation Corporation
(https://www.spec.org/ )
• CLDS – Centre for Large- scale Data System Research
(http://clds.sdsc.edu/bdbc)
• Top Outcomes
• BigData Top100 - an end-to-end application-layer benchmark for big data
applications
• Terasort - Functional benchmark focusing on Sort function ( quicksort using
MapReduce)
• Hibench
• Sort, Machine learning ( K-means clustering, Classification)
4
Types of Benchmark
• Micro-benchmarks. To evaluate specific lower-level, system operations
• E.g., A Micro-benchmark Suite for Evaluating HDFS Operations on
Modern Clusters, Panda et al, OSU
• Functional / component benchmarks. Specific high-level function.
• E.g. Sorting: Terasort
• E.g. Basic SQL: Individual SQL operations, e.g. Select, Project, Join,
Order-By, ...
• Application-level benchmarks.
• Measure system performance (hardware and software) for a given
application scenario—with given data and workload
5
Terasort using Hadoop
Terasort includes 3 MapReduce Applications
• Teragen – generates the data
• Terasort – samples the input data and uses them with MapReduce to
sort the data
• Teravalidate – validates the output data is sorted
6
MapReduce for Teragen
7
Map Reduce Modelloser look at MapReduce’s implementation model
source: http:/ / developer.yahoo.com/ hadoop/ tutorial/ module4.html”
8
Benchmarking Suite
• HiBench, Yan Li, Intel (https://github.com/intel-hadoop/HiBench)
• YCSB -Yahoo Cloud Serving Benchmark, Brian Cooper, Yahoo!
(https://github.com/brianfrankcooper/YCSB/)
• Berkeley Big Data Benchmark, Pavlo et al., AMPLab
(https://amplab.cs.berkeley.edu/benchmark/)
• BigDataBench, Jianfeng Zhan, Chinese Academy of Sciences
(http://prof.ict.ac.cn/BigDataBench/)
• Grid Mix (http://hadoop.apache.org/docs/r1.2.1/gridmix.html)
• Big Bench (https://github.com/intel-hadoop/Big-Bench)
• TPCx-HS (http://www.tpc.org/tpcx-hs/ )
9
TPCx-HS benchmarks
X: Express H: Hadoop S: Sort
• TPCx-HS was developed to provide an
objective measure of hardware,
operating system and commercial
Apache Hadoop File System API
compatible software distributions, and
to provide the industry with verifiable
performance, price-performance and
availability metrics.
• http://www.tpc.org/tpcx-hs/
10
TPCx HS Demo
11
TPCx-HS benchmarks
Scale Factor
The TPCx-HS follows a stepped size model. Scale factor (SF) used for the test
dataset must be chosen from the set of fixed Scale Factors defined as :
• 1TB, 3TB, 10TB, 30TB, 100TB, 300TB, 1000TB, 3000TB, 10000TB.
• The corresponding number of records are
• 10B, 30B, 100B, 300B, 1000B, 3000B, 10000B, 30000B, 100000B,
where each record is 100 bytes generated by HSGen.
• http://www.tpc.org/tpcx-hs/
12
TPCx-HS benchmarks - Metrics
13
TPCx-HS Results on Cisco UCS
Cisco Published Results
14
Comparison of various Benchmarks Suites.
15
16
Spark Performance
17
Spark sorted the same data 3X faster using 10X fewer machines. All the sorting
took place on disk (HDFS), without using Spark’s in-memory cache.
18
Sort Bench Mark http://sortbenchmark.org/
• GraySort
• MinuteSort
• CloudSort
• JouleSort
• PennySort
• TeraByteSort
• DatamationSort

Big Data Benchmarking

  • 1.
    Big Data Benchmarks SrinivasaRao Aravilli N Venkata Naga Ravi
  • 2.
    2 Why ..  Evaluatingthe effect of a hardware/software upgrade:  OS, Java VM,. . .  Hadoop, Cloudera CDH, Pig, Hive, Impala,. . .  Debugging:  Compare with other clusters or published results.  Performance tuning
  • 3.
    3 Industry Standard benchmarkingorganizations • TPC - Transaction Processing Performance Council (http://www.tpc.org/ ) • SPEC - The Standard Performance Evaluation Corporation (https://www.spec.org/ ) • CLDS – Centre for Large- scale Data System Research (http://clds.sdsc.edu/bdbc) • Top Outcomes • BigData Top100 - an end-to-end application-layer benchmark for big data applications • Terasort - Functional benchmark focusing on Sort function ( quicksort using MapReduce) • Hibench • Sort, Machine learning ( K-means clustering, Classification)
  • 4.
    4 Types of Benchmark •Micro-benchmarks. To evaluate specific lower-level, system operations • E.g., A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Panda et al, OSU • Functional / component benchmarks. Specific high-level function. • E.g. Sorting: Terasort • E.g. Basic SQL: Individual SQL operations, e.g. Select, Project, Join, Order-By, ... • Application-level benchmarks. • Measure system performance (hardware and software) for a given application scenario—with given data and workload
  • 5.
    5 Terasort using Hadoop Terasortincludes 3 MapReduce Applications • Teragen – generates the data • Terasort – samples the input data and uses them with MapReduce to sort the data • Teravalidate – validates the output data is sorted
  • 6.
  • 7.
    7 Map Reduce Modelloserlook at MapReduce’s implementation model source: http:/ / developer.yahoo.com/ hadoop/ tutorial/ module4.html”
  • 8.
    8 Benchmarking Suite • HiBench,Yan Li, Intel (https://github.com/intel-hadoop/HiBench) • YCSB -Yahoo Cloud Serving Benchmark, Brian Cooper, Yahoo! (https://github.com/brianfrankcooper/YCSB/) • Berkeley Big Data Benchmark, Pavlo et al., AMPLab (https://amplab.cs.berkeley.edu/benchmark/) • BigDataBench, Jianfeng Zhan, Chinese Academy of Sciences (http://prof.ict.ac.cn/BigDataBench/) • Grid Mix (http://hadoop.apache.org/docs/r1.2.1/gridmix.html) • Big Bench (https://github.com/intel-hadoop/Big-Bench) • TPCx-HS (http://www.tpc.org/tpcx-hs/ )
  • 9.
    9 TPCx-HS benchmarks X: ExpressH: Hadoop S: Sort • TPCx-HS was developed to provide an objective measure of hardware, operating system and commercial Apache Hadoop File System API compatible software distributions, and to provide the industry with verifiable performance, price-performance and availability metrics. • http://www.tpc.org/tpcx-hs/
  • 10.
  • 11.
    11 TPCx-HS benchmarks Scale Factor TheTPCx-HS follows a stepped size model. Scale factor (SF) used for the test dataset must be chosen from the set of fixed Scale Factors defined as : • 1TB, 3TB, 10TB, 30TB, 100TB, 300TB, 1000TB, 3000TB, 10000TB. • The corresponding number of records are • 10B, 30B, 100B, 300B, 1000B, 3000B, 10000B, 30000B, 100000B, where each record is 100 bytes generated by HSGen. • http://www.tpc.org/tpcx-hs/
  • 12.
  • 13.
    13 TPCx-HS Results onCisco UCS Cisco Published Results
  • 14.
    14 Comparison of variousBenchmarks Suites.
  • 15.
  • 16.
  • 17.
    17 Spark sorted thesame data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache.
  • 18.
    18 Sort Bench Markhttp://sortbenchmark.org/ • GraySort • MinuteSort • CloudSort • JouleSort • PennySort • TeraByteSort • DatamationSort

Editor's Notes

  • #7 <10 bytes key><10 bytes rowid><78 bytes filler>\r\n $ hadoop jar hadoop-*examples*.jar teragen -D dfs.block.size=536870912 ... http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/
  • #14 CPU Type: Intel Xeon E5-2660 - 2.20 GHz   Total # of Processors: 32   Total # of Cores: 320  Total # of Threads: 640  Cluster: Yes Data Generation Time (hours): .23 Data Sort Time (hours): 1.29 Data Validation Time (hours): .22 Total Storage/Database Size Ratio: 38.40 TPCx - HS FDR 11 January, 2015 Measured Configuration: The measured configuration consisted of :  Total Nodes: 16  Total Processors/Cores/Threads: 32/320/640  Total Memory: 4,096GB  Total Number of Storage Drives/Devices: 384  Total Storage Capacity: 384 TB
  • #16 MVAPICH2 is an open source implementation of Message Passing Interface (MPI) that delivers the best performance, scalability and fault tolerance for high-end computing systems and servers using InfiniBand, 10GigE/iWARP and RoCE networking technologies.
  • #18 https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html