Evaluating the effect of a hardware/software
OS, Java VM,. . .
Hadoop, Cloudera CDH, Pig, Hive, Impala,.
Compare with other clusters or published
Industry Standard benchmarking organizations
• TPC - Transaction Processing Performance Council (http://www.tpc.org/ )
• SPEC - The Standard Performance Evaluation Corporation
• CLDS – Centre for Large- scale Data System Research
• Top Outcomes
• BigData Top100 - an end-to-end application-layer benchmark for big data
• Terasort - Functional benchmark focusing on Sort function ( quicksort using
• Sort, Machine learning ( K-means clustering, Classification)
Types of Benchmark
• Micro-benchmarks. To evaluate specific lower-level, system operations
• E.g., A Micro-benchmark Suite for Evaluating HDFS Operations on
Modern Clusters, Panda et al, OSU
• Functional / component benchmarks. Specific high-level function.
• E.g. Sorting: Terasort
• E.g. Basic SQL: Individual SQL operations, e.g. Select, Project, Join,
• Application-level benchmarks.
• Measure system performance (hardware and software) for a given
application scenario—with given data and workload
Terasort using Hadoop
Terasort includes 3 MapReduce Applications
• Teragen – generates the data
• Terasort – samples the input data and uses them with MapReduce to
sort the data
• Teravalidate – validates the output data is sorted
Map Reduce Modelloser look at MapReduce’s implementation model
source: http:/ / developer.yahoo.com/ hadoop/ tutorial/ module4.html”
• HiBench, Yan Li, Intel (https://github.com/intel-hadoop/HiBench)
• YCSB -Yahoo Cloud Serving Benchmark, Brian Cooper, Yahoo!
• Berkeley Big Data Benchmark, Pavlo et al., AMPLab
• BigDataBench, Jianfeng Zhan, Chinese Academy of Sciences
• Grid Mix (http://hadoop.apache.org/docs/r1.2.1/gridmix.html)
• Big Bench (https://github.com/intel-hadoop/Big-Bench)
• TPCx-HS (http://www.tpc.org/tpcx-hs/ )
X: Express H: Hadoop S: Sort
• TPCx-HS was developed to provide an
objective measure of hardware,
operating system and commercial
Apache Hadoop File System API
compatible software distributions, and
to provide the industry with verifiable
performance, price-performance and
The TPCx-HS follows a stepped size model. Scale factor (SF) used for the test
dataset must be chosen from the set of fixed Scale Factors defined as :
• 1TB, 3TB, 10TB, 30TB, 100TB, 300TB, 1000TB, 3000TB, 10000TB.
• The corresponding number of records are
• 10B, 30B, 100B, 300B, 1000B, 3000B, 10000B, 30000B, 100000B,
where each record is 100 bytes generated by HSGen.
CPU Type: Intel Xeon E5-2660 - 2.20 GHz Total # of Processors: 32 Total # of Cores: 320 Total # of Threads: 640 Cluster: Yes Data Generation Time (hours): .23 Data Sort Time (hours): 1.29 Data Validation Time (hours): .22 Total Storage/Database Size Ratio: 38.40
Total Nodes: 16
Total Processors/Cores/Threads: 32/320/640
Total Memory: 4,096GB
Total Number of Storage Drives/Devices: 384
Total Storage Capacity: 384
MVAPICH2 is an open source implementation of Message Passing Interface (MPI) that delivers the best performance, scalability and fault tolerance for high-end computing systems and servers using InfiniBand, 10GigE/iWARP and RoCE networking technologies.