Using BigBench to compare Hive
and Spark
Alejandro Montero, Nicolas Poggi
1
What is BigBench (TPCx-BB1)?
• Specification-based benchmark with an open-source implementation, proposed as
the first Big Data benchmark standard.
• BigBench covers all major Big Data characteristics
• Volume -> Scale factor.
• Velocity -> Table refresh.
• Variety -> Data type disparity.
• Extension of TPC-DS.
• Borrows 10 pure QL queries from TCP-DS.
• Added BigData tables and use cases: Machine Learning, Natural Language Processing, ...
• Can support multiple implementations, multiple BigData engines and table formats.
• Can execute multiple parallel streams.
• Defines scale factors for data.
• Tested: 100 GB.
2[1]: http://www.tpc.org/tpc_documents_current_versions/pdf/tpcx-bb_v1.2.0.pdf
BigBench – Overview
3
Unstructured DataStructured Data
Semi-Structured Data
Marketprice Items
Sales
Web Page Customers
Reviews
Web Log
Workload:
• 14 Pure QL queries.
• 10 Borrowed from TPC-DS.
• 4 Queries with MapReduce pre-processing.
• 7 Natural Language Processing Queries.
• 5 Machine Learning Queries.
BigBench v1.2 – Reference Implementation
HDFS
Hive Metastore
MapReduce Tez Spark
Yarn
Hive Spark SQL
Mahout ML Custom Spark MLlibApplication
SQL Engine
Table Metastore
Execution Engine
Filesystem
Benchmarked systems:
• Hive + MapReduce + Mahout
• Hive + MapReduce + Spark_MLlib
• Hive + Tez + Mahout
• Hive + Tez + Spark_MLlib
• Spark SQL + Mahout
• Spark SQL + Spark_MLlib
• Spark 2 SQL + Mahout
Work in progress:
• Hive 2
• Spark 2 SQL + Spark_MLlib
The cluster – HDInsight PaaS
5
Model HDInsight D4v3
# Head nodes 2
# Working nodes 4
# Zookeeper nodes 3
CPU Intel(R) Xeon(R) CPU E5-2673 v3
8 x 2,4 GHz cores
RAM 28 GB
HDFS Remote
Software HortonWorks Data Platform 2.5
Spark config 1 executor/working node
3 cores/executor
Pure QL
6
6223
1601
1848
1457
0
1000
2000
3000
4000
5000
6000
7000
Hive_MR Hive_tez Spark_1.6.2 Spark_2.0.1
Timeinseconds
Average of three executions using 100 GB Scale Factor
Query 12 CPU behavior
7
Tez Spark 1.6.2 Spark 2.0.2
Average of three executions using 100 GB Scale Factor
Custom Reducers
8
2815
1122
1629
1466
0
500
1000
1500
2000
2500
3000
Hive_MR Hive_tez Spark_1.6.2 Spark_2.0.1
Timeinseconds
Average of three executions using 100 GB Scale Factor
Natural Language Processing
9
5004
1100
2289
1913
0
1000
2000
3000
4000
5000
6000
Hive_MR Hive_Tez Spark_1.6.2 Spark_2.0.1
Timeinseconds
Average of three executions using 100 GB Scale Factor
Machine Learning
10
3550
1613
3769
1898
4045
1937
3390
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Hive_MR +
Mahout
Hive_MR +
Spark_ml
Hive_tez +
Mahout
Hive_Tez +
Spark_ML
Spark_1.6.2 +
Mahout
Spark_1.6.2 +
Spark_ML
Spark_2.0.1 +
Mahout
Timeinseconds
Average of three executions using 100 GB Scale Factor
11
Aggregated Results
6223 6223
1601 1623 1848 1951 1457
2815 2815
1122 1137
1629 1489
1466
5004 5004
1100 1082
2289 2356
1913
3550
1613
3769
1898
4045
1937 3390
17592
15655
7592
5740
9811
7733
8226
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
Hive_MR +
Mahout
Hive_MR +
Spark_ML
Hive_tez +
Mahout
Hive_Tez +
Spark_ML
Spark_1.6.2 +
Mahout
Spark_1.6.2 +
Spark_ML
Spark_2.0.1 +
Mahout
Timeinseconds
Pure-QL Custom Reducers NLP ML Total
Average of three executions using 100 GB Scale Factor
Conclusions
• Hive on Tez greatly improves SQL performance over Hive on MapReduce.
• It is also faster than Hive on spark 1.
• Hive on spark 2 is slightly faster.
• The Spark implementation is based on hive…
• Spark MLlib has an excellent performance over Mahout.
• Best production combination: Apache Tez for SQL + Spark MLlib for
Machine Learning.
12
Thanks, questions?
Follow up / feedback : Alejandro.montero@bsc.es
Using BigBench to compare Hive and Spark
13

Using BigBench to compare Hive and Spark (short version)

  • 1.
    Using BigBench tocompare Hive and Spark Alejandro Montero, Nicolas Poggi 1
  • 2.
    What is BigBench(TPCx-BB1)? • Specification-based benchmark with an open-source implementation, proposed as the first Big Data benchmark standard. • BigBench covers all major Big Data characteristics • Volume -> Scale factor. • Velocity -> Table refresh. • Variety -> Data type disparity. • Extension of TPC-DS. • Borrows 10 pure QL queries from TCP-DS. • Added BigData tables and use cases: Machine Learning, Natural Language Processing, ... • Can support multiple implementations, multiple BigData engines and table formats. • Can execute multiple parallel streams. • Defines scale factors for data. • Tested: 100 GB. 2[1]: http://www.tpc.org/tpc_documents_current_versions/pdf/tpcx-bb_v1.2.0.pdf
  • 3.
    BigBench – Overview 3 UnstructuredDataStructured Data Semi-Structured Data Marketprice Items Sales Web Page Customers Reviews Web Log Workload: • 14 Pure QL queries. • 10 Borrowed from TPC-DS. • 4 Queries with MapReduce pre-processing. • 7 Natural Language Processing Queries. • 5 Machine Learning Queries.
  • 4.
    BigBench v1.2 –Reference Implementation HDFS Hive Metastore MapReduce Tez Spark Yarn Hive Spark SQL Mahout ML Custom Spark MLlibApplication SQL Engine Table Metastore Execution Engine Filesystem Benchmarked systems: • Hive + MapReduce + Mahout • Hive + MapReduce + Spark_MLlib • Hive + Tez + Mahout • Hive + Tez + Spark_MLlib • Spark SQL + Mahout • Spark SQL + Spark_MLlib • Spark 2 SQL + Mahout Work in progress: • Hive 2 • Spark 2 SQL + Spark_MLlib
  • 5.
    The cluster –HDInsight PaaS 5 Model HDInsight D4v3 # Head nodes 2 # Working nodes 4 # Zookeeper nodes 3 CPU Intel(R) Xeon(R) CPU E5-2673 v3 8 x 2,4 GHz cores RAM 28 GB HDFS Remote Software HortonWorks Data Platform 2.5 Spark config 1 executor/working node 3 cores/executor
  • 6.
    Pure QL 6 6223 1601 1848 1457 0 1000 2000 3000 4000 5000 6000 7000 Hive_MR Hive_tezSpark_1.6.2 Spark_2.0.1 Timeinseconds Average of three executions using 100 GB Scale Factor
  • 7.
    Query 12 CPUbehavior 7 Tez Spark 1.6.2 Spark 2.0.2 Average of three executions using 100 GB Scale Factor
  • 8.
    Custom Reducers 8 2815 1122 1629 1466 0 500 1000 1500 2000 2500 3000 Hive_MR Hive_tezSpark_1.6.2 Spark_2.0.1 Timeinseconds Average of three executions using 100 GB Scale Factor
  • 9.
    Natural Language Processing 9 5004 1100 2289 1913 0 1000 2000 3000 4000 5000 6000 Hive_MRHive_Tez Spark_1.6.2 Spark_2.0.1 Timeinseconds Average of three executions using 100 GB Scale Factor
  • 10.
    Machine Learning 10 3550 1613 3769 1898 4045 1937 3390 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Hive_MR + Mahout Hive_MR+ Spark_ml Hive_tez + Mahout Hive_Tez + Spark_ML Spark_1.6.2 + Mahout Spark_1.6.2 + Spark_ML Spark_2.0.1 + Mahout Timeinseconds Average of three executions using 100 GB Scale Factor
  • 11.
    11 Aggregated Results 6223 6223 16011623 1848 1951 1457 2815 2815 1122 1137 1629 1489 1466 5004 5004 1100 1082 2289 2356 1913 3550 1613 3769 1898 4045 1937 3390 17592 15655 7592 5740 9811 7733 8226 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 Hive_MR + Mahout Hive_MR + Spark_ML Hive_tez + Mahout Hive_Tez + Spark_ML Spark_1.6.2 + Mahout Spark_1.6.2 + Spark_ML Spark_2.0.1 + Mahout Timeinseconds Pure-QL Custom Reducers NLP ML Total Average of three executions using 100 GB Scale Factor
  • 12.
    Conclusions • Hive onTez greatly improves SQL performance over Hive on MapReduce. • It is also faster than Hive on spark 1. • Hive on spark 2 is slightly faster. • The Spark implementation is based on hive… • Spark MLlib has an excellent performance over Mahout. • Best production combination: Apache Tez for SQL + Spark MLlib for Machine Learning. 12
  • 13.
    Thanks, questions? Follow up/ feedback : Alejandro.montero@bsc.es Using BigBench to compare Hive and Spark 13