Advertisement
Advertisement

More Related Content

Advertisement

Similar to Big Data Benchmarking(20)

Advertisement

Big Data Benchmarking

  1. Big Data Benchmarks Srinivasa Rao Aravilli N Venkata Naga Ravi
  2. 2 Why ..  Evaluating the effect of a hardware/software upgrade:  OS, Java VM,. . .  Hadoop, Cloudera CDH, Pig, Hive, Impala,. . .  Debugging:  Compare with other clusters or published results.  Performance tuning
  3. 3 Industry Standard benchmarking organizations • TPC - Transaction Processing Performance Council (http://www.tpc.org/ ) • SPEC - The Standard Performance Evaluation Corporation (https://www.spec.org/ ) • CLDS – Centre for Large- scale Data System Research (http://clds.sdsc.edu/bdbc) • Top Outcomes • BigData Top100 - an end-to-end application-layer benchmark for big data applications • Terasort - Functional benchmark focusing on Sort function ( quicksort using MapReduce) • Hibench • Sort, Machine learning ( K-means clustering, Classification)
  4. 4 Types of Benchmark • Micro-benchmarks. To evaluate specific lower-level, system operations • E.g., A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Panda et al, OSU • Functional / component benchmarks. Specific high-level function. • E.g. Sorting: Terasort • E.g. Basic SQL: Individual SQL operations, e.g. Select, Project, Join, Order-By, ... • Application-level benchmarks. • Measure system performance (hardware and software) for a given application scenario—with given data and workload
  5. 5 Terasort using Hadoop Terasort includes 3 MapReduce Applications • Teragen – generates the data • Terasort – samples the input data and uses them with MapReduce to sort the data • Teravalidate – validates the output data is sorted
  6. 6 MapReduce for Teragen
  7. 7 Map Reduce Modelloser look at MapReduce’s implementation model source: http:/ / developer.yahoo.com/ hadoop/ tutorial/ module4.html”
  8. 8 Benchmarking Suite • HiBench, Yan Li, Intel (https://github.com/intel-hadoop/HiBench) • YCSB -Yahoo Cloud Serving Benchmark, Brian Cooper, Yahoo! (https://github.com/brianfrankcooper/YCSB/) • Berkeley Big Data Benchmark, Pavlo et al., AMPLab (https://amplab.cs.berkeley.edu/benchmark/) • BigDataBench, Jianfeng Zhan, Chinese Academy of Sciences (http://prof.ict.ac.cn/BigDataBench/) • Grid Mix (http://hadoop.apache.org/docs/r1.2.1/gridmix.html) • Big Bench (https://github.com/intel-hadoop/Big-Bench) • TPCx-HS (http://www.tpc.org/tpcx-hs/ )
  9. 9 TPCx-HS benchmarks X: Express H: Hadoop S: Sort • TPCx-HS was developed to provide an objective measure of hardware, operating system and commercial Apache Hadoop File System API compatible software distributions, and to provide the industry with verifiable performance, price-performance and availability metrics. • http://www.tpc.org/tpcx-hs/
  10. 10 TPCx HS Demo
  11. 11 TPCx-HS benchmarks Scale Factor The TPCx-HS follows a stepped size model. Scale factor (SF) used for the test dataset must be chosen from the set of fixed Scale Factors defined as : • 1TB, 3TB, 10TB, 30TB, 100TB, 300TB, 1000TB, 3000TB, 10000TB. • The corresponding number of records are • 10B, 30B, 100B, 300B, 1000B, 3000B, 10000B, 30000B, 100000B, where each record is 100 bytes generated by HSGen. • http://www.tpc.org/tpcx-hs/
  12. 12 TPCx-HS benchmarks - Metrics
  13. 13 TPCx-HS Results on Cisco UCS Cisco Published Results
  14. 14 Comparison of various Benchmarks Suites.
  15. 15
  16. 16 Spark Performance
  17. 17 Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache.
  18. 18 Sort Bench Mark http://sortbenchmark.org/ • GraySort • MinuteSort • CloudSort • JouleSort • PennySort • TeraByteSort • DatamationSort

Editor's Notes

  1. <10 bytes key><10 bytes rowid><78 bytes filler>\r\n $ hadoop jar hadoop-*examples*.jar teragen -D dfs.block.size=536870912 ... http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/
  2. CPU Type: Intel Xeon E5-2660 - 2.20 GHz   Total # of Processors: 32   Total # of Cores: 320  Total # of Threads: 640  Cluster: Yes Data Generation Time (hours): .23 Data Sort Time (hours): 1.29 Data Validation Time (hours): .22 Total Storage/Database Size Ratio: 38.40 TPCx - HS FDR 11 January, 2015 Measured Configuration: The measured configuration consisted of :  Total Nodes: 16  Total Processors/Cores/Threads: 32/320/640  Total Memory: 4,096GB  Total Number of Storage Drives/Devices: 384  Total Storage Capacity: 384 TB
  3. MVAPICH2 is an open source implementation of Message Passing Interface (MPI) that delivers the best performance, scalability and fault tolerance for high-end computing systems and servers using InfiniBand, 10GigE/iWARP and RoCE networking technologies.
  4. https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
Advertisement