Hive on spark is blazing fast or is it final

45,922 views

Published on

This presentation was given at the Strata + Hadoop World, 2015 in San Jose.

Apache Hive is the most popular and most widely used SQL solution for Hadoop. To keep pace with Hadoop’s increasingly vital role in the Enterprise, Hive has transformed from a batch-only, high-latency system into a modern SQL engine capable of both batch and interactive queries over large datasets. Hive’s momentum is accelerating: With Spark integration and a shift to in-memory processing on the horizon, Hive continues to expand the boundaries of Big Data.

In this talk the speakers examined Hive performance, past, present and future. In particular they looked at Hive’s origins as a petabyte scale SQL engine.

Through some numbers and graphs, they showed how Hive became 100x faster by moving beyond MapReduce, by vectorizing execution and by introducing a cost-based optimizer.
They detailed and discussed the challenges of scalable SQL on Hadoop.

The looked into Hive’s sub-second future, powered by LLAP and Hive on Spark.

And showed just how fast Hive on Spark really is.

Published in: Software
3 Comments
139 Likes
Statistics
Notes
  • Njce! Thanks for sharing.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Apache Hive Tutorial (Videos and Books) Just $14 http://www.dbmanagement.info/Tutorials/Apache_Hive.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Slide 53 is mentioning ‘Promising initial release’ for Hive on Spark. It is worth mentioning that ‘Hive on Spark’ was still in a beta release at the time this presentation was given by Hortonworks at the 2015 Strata + Hadoop World in San Jose on February 20, 2015. Please see: Spark Umbrella JIRA for the Hive on Spark project: https://issues.apache.org/jira/browse/HIVE-7292 Download the Hive-on-Spark Beta: http://blog.cloudera.com/blog/2015/02/download-the-hive-on-spark-beta/ 2. As for Spark SQL it was still in alpha release on February 20, 2015: the date this presentation was given!! Spark SQL already ships inside CDH 5.x. However it is tagged as an alpha — not supported and not recommended for production use 3. I hope to see an update of this Hortonworks benchmark in this presentation once “Hive on Spark’ project is complete and released. Disclaimer: This is a vendor-independent comment that expresses my own opinions. I am not endorsing nor promoting any product or vendor mentioned in my comment.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
45,922
On SlideShare
0
From Embeds
0
Number of Embeds
1,258
Actions
Shares
0
Downloads
1,547
Comments
3
Likes
139
Embeds 0
No embeds

No notes for slide

Hive on spark is blazing fast or is it final

  1. 1. Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive on Spark is Blazing Fast… Or Is It? Carter Shanklin and Mostafa Mokhtar
  2. 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Why SQL on Hadoop? Solving for Scale. Hadoop is great for cost, but MapReduce is too difficult. SQL on Hadoop makes Hadoop real and gives me scale that traditional SQL can’t offer. I’m deleting important data because it’s too expensive to store it. $
  3. 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SQL at Facebook: Emergence of Apache Hive Developed Hive to address traditional RDBMS limitations. 300+ PB of data under management(1). 600+ TB of data loaded daily. 60,000+ Hive queries per day(2). More than 1,000 users per day. Initial Apache release in April 2009.
  4. 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Classic: Strengths and Challenges Familiar SQL Interface+ Economical Processing of Petabytes+ Hive Classic tied to MapReduce, leading to latency Traditional SQL Workloads Needed Higher Performance!
  5. 5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Need for Speed: The Stinger Initiative Stinger: An Open Roadmap to improve Apache Hive’s performance 100x. Launched: February 2013; Delivered: April 2014. Delivered in 100% Apache Open Source. SQL Engine Vectorized SQL Engine Columnar Storage ORCFile = 100X+ + Distributed Execution Apache Tez
  6. 6. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Stinger Phase 3: TPC-DS Benchmark at 30 Terabyte Scale Sample of 50 queries from TPC-DS at 30 terabyte scale. Average 52x Query Speedup, Maximum 160x Query Speedup. Total benchmark time decreased from 7.8 days to 9.3 hours.(3) Cost-Based Optimizer added in Hive 14 gave additional 2.5x Speedup.
  7. 7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive + Stinger at Yahoo Around 1 million Hive jobs run every month. Scalei Total benchmark time from 8.1 hours to 1.3 hours at 10TB scale. Performancei Up to 82x faster.(4) Performancei
  8. 8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Stinger at Spotify Query 25 TB of compressed data in 10 Minutes across 690 nodes (MapReduce too slow to complete.) Speedi 16x less HDFS read when using ORCFile versus Avro.(5) Efficiencyi
  9. 9. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORCFile at Facebook Saved more than 1,400 servers worth of storage. Compressioni Compression ratio increased from 5x to 8x globally. Compressioni
  10. 10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive on Tez: Conclusion Hive on Tez delivers fast batch and interactive SQL today. But users need more speed! Proven at petabyte scale. Scalei The most comprehensive open-source SQL on Hadoop. SQLi More than 90 Hortonworks customers use Hive-on-Tez today for fast SQL. Speedi Hortonworks Customer Support metrics as of Feb/2015
  11. 11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Next Stop: Stinger.next and Sub-Second SQL Emergence of LLAP and Hive-on-Spark bring Sub-Second within reach. What does it take to get Hive to sub-second? Does Hive-on-Spark get us there?
  12. 12. Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Performance Today and the Sub-Second Future Hive on Tez, Hive on Spark, Hive on Mapreduce & Spark-SQL
  13. 13. Page 13 © Hortonworks Inc. 2014 Query processing in Hadoop Cache Block Cache Linux Cache Storage Columnar Storage Parquet File Distrided ExecutionEngine SQL Engine Hive Engine SQL SQL support HiveQL Tez Columnar Storage ORC File MapReduce Spark Spark-SQL SQL Engine
  14. 14. Page 14 © Hortonworks Inc. 2014 Query processing in Hadoop Cache Block Cache Linux Cache Storage Columnar Storage Parquet File Distrided ExecutionEngine SQL Engine Hive Engine SQL SQL support HiveQL Tez Columnar Storage ORC File MapReduce Spark Spark-SQL SQL Engine What is covered today in terms of performance
  15. 15. Page 15 © Hortonworks Inc. 2014 Performance comparison : Test bed Component Version Hive 1.2.0 Tez 0.5.2 Spark 1.2.0 Hadoop 2.6.0 Software : Hardware 20 physical nodes, each with: ● 2x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz for total of 16 CPU cores/machine ● Hyper-threading enabled ● 256GB RAM per node ● 6x 4TB WDC WD4000FYYZ-0 drives per node ● 10 Gigabit interconnect between the nodes Note: Based on the YARN Node Manager’s Memory Resource setting used below, only 128 GB of RAM per node was dedicated to query processing. Execution Engine Primitives on 30TB Scale factor TPC-DS queries on 30TB Scale factor TPC-DS queries on 200GB Scale factor Spark X X X Tez X X X Map Reduce X Spark-SQL X X X Performance benchmarks :
  16. 16. Page 16 © Hortonworks Inc. 2014 Performance comparison : Configurations Hive on Tez ● 128GB of memory allocated ● 16 out of 32 Logical processors allocated ● hive.execution.engine = tez ● hive.auto.convert.join.noconditionaltask. size = 600MB ● Vectorization enabled ● CBO enabled ● Fetch column stats enabled Other settings ● hive.prewarm.numcontainers = 317 ● hive.tez.auto.reducer.parallelism = true Hive on Spark ● 128GB of memory allocated ● 16 out of 32 Logical processors allocated ● hive.execution.engine=spark ● Configuration parameters followed recomendation from Hive on Spark wiki http://tinyurl.com/pk2ju8e which also had CBO, Vectoriztion, fetch column stats enabled etc.. ● spark.master=yarn-master Spark settings ● spark.shuffle.memoryFraction = 0.5 ● spark.storage.memoryFraction = 0.1 ● spark.shuffle.consolidateFiles = true ● spark.serializer = org.apache.spark.serializer.KryoSerializer Spark-SQL ● 128GB of memory allocated ● 16 out of 32 Logical processors allocated ● spark.shuffle.memoryFraction = 0.5 ● spark.storage.memoryFraction = 0.1 ● spark.shuffle.consolidateFiles = true ● spark.serializer = org.apache.spark.serializer.KryoSerializer ● spark.sql.shuffle.partitions = 1009 ● spark-sql --master yarn-client ● driver-memory 8g ● Default GC configuration spark.sql.codegen was not enabled as it caused most queries to fail.
  17. 17. Page 17 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 200GB ● Warm timings reported, Cold queries on Spark are significantly slower ● Hive on Tez using ORC format ● Hive on Spark using Parquet format ● Spark-sql using Parquet format 1,118 1,982 1,235
  18. 18. Page 18 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 200GB continued.. ● Warm timings reported, Cold queries on Spark are significantly slower ● Hive on Tez using ORC format ● Hive on Spark using Parquet format ● Spark-sql using Parquet format 1,118 1,982 1,235 Hive on Tez is 77% faster than Hive on Spark 10% faster than Spark-sql Spark-sql is 60% faster than Hive on Spark
  19. 19. Page 19 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 200GB summary
  20. 20. Page 20 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 200GB summary Even simple queries don’t run in sub- second
  21. 21. Page 21 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 200GB summary Even simple queries don’t run in sub- second
  22. 22. Page 22 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 200GB ● 200GB Scale factor, un-partitioned schema ● 45x unmodified queries from TPC-DS ● ORC format compression ratio 3.4x ● Parquet format compression ratio of 2.8x
  23. 23. Page 23 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 30TB ● 30 TB Scale factor ● ORC Table format ● Fact tables partitioned on *_date_sk ● Explicit partition filters where used for Hive on Spark and Spark-SQL (but not for Hive-on-Tez) ● 20 out of the previously used queries where used, warm query timings reported ● Hive on Tez outperforms Hive on Spark and Spark-SQL by up to 18x ● Hive on Spark completed 15 out of the 20, the remaining 5 queries errored out or where stuck in GC and got cancelled ● Spark-SQL completed 7 out of the 20, the remaining 13 queries either failed within a couple of minutes or errored out after running for hours ● Spark-SQL performance is negatively affected by in-efficient query plans as it lacks a query optimizer Workload config Highlights from 30TB TPC-DS test
  24. 24. Page 24 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 30TB 1,828 10,098
  25. 25. Page 25 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 30TB 1,828 10,098For large data set Hive on Tez is ~5x faster than Hive on Spark
  26. 26. Page 26 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 30TB continued
  27. 27. Page 27 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 30TB continued Failed Spark-SQL queries
  28. 28. Page 28 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 30TB Q17
  29. 29. Page 29 © Hortonworks Inc. 2014 Performance comparison : TPC-DS 30TB Q17 Hive on Tez query ends here
  30. 30. Page 30 © Hortonworks Inc. 2014 Why didn’t Spark take Hive to sub-second? ● Hive is CPU bound for most operations specially after the introduction of columnar file formats (do more with less) ● Spark consumes more CPU, Disk & Network IO than Tez ● Hive on Spark spends a lot of time translating from RDDs to Hive’s “Row Containers”
  31. 31. Page 31 © Hortonworks Inc. 2014 Why didn’t Spark take Hive to sub-second? ● Hive is CPU bound for most operations specially after the introduction of columnar file formats (do more with less) ● Spark consumes more CPU, Disk & Network IO than Tez for relatively large datasets ● Hive on Spark spends a lot of time translating from RDDs to Hive’s “Row Containers” 2x less Disk IO 4x less Network IO6x less CPU
  32. 32. Page 32 © Hortonworks Inc. 2014 I don’t believe what you just said!!! Show me some queries I can understand... Simple queries to understand complex systems Execution engine Primitives
  33. 33. Page 33 © Hortonworks Inc. 2014 Performance comparison : What are those primitives? Group Test case Comment ETL Create table as select * Insert 8 Billion rows, 570 GB of Data Create table as select with Group by Group by and Insert 8 Billion rows, 570 GB of Data Create table as with Group by on all columns followed by cluster by Group by, cluster by and Insert 8 Billion rows, 570 GB of Data Group by Group by on primary key Group by 25 billion distinct keys Group by on column with low NDV* Group by 82 billion rows with 8K distinct keys Map join store_sales x item Map join 28 Billion x 462K store_sales x item x store Map join 28 Billion x 462K x 1.7K store_sales x item x store x customer_demographics Map join 28 Billion x 462K x 1.7K x 1.9 Million Shuffle Join Shuffle join Shuffle join 8.6 Billion x 706 Million rows Shuffle join + Group by on primary key Shuffle join 8.6 Billion x 706 Million rows followed by group by on 675 Million rows NDV* Number of distinct values
  34. 34. Page 34 © Hortonworks Inc. 2014 Performance comparison : CTAS Create table test_table as select * from store_returns; Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 316 Hive on Spark 351 11% Hive on Mapreduce 494 56% Spark-SQL 418 32% Table Scan store_returns 8 Billion rows Table Insert 8 Billion rows 316 351 494 418
  35. 35. Page 35 © Hortonworks Inc. 2014 Performance comparison : CTAS Create table test_table as select * from store_returns; Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 316 Hive on Spark 351 11% Hive on Mapreduce 494 56% Spark-SQL 418 32% Table Scan store_returns 8 Billion rows Table Insert 8 Billion rows 316 351 494 418 Tez is 11% faster than Spark 56% faster than Mapreduce 32% faster than Spark-SQL
  36. 36. Page 36 © Hortonworks Inc. 2014 Performance comparison : CTAS with group by Create table test_table as select * from store_returns group by *; Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 630 Hive on Spark 1,608 155% Hive on Mapreduce 840 33% Spark-SQL 1,202 91% Table Insert 4 Billion rows Shuffle On all columns 8 Billion rows Group by On all columns 7 billion rows Table Scan store_returns 8 Billion rows 630 1,608 840 1,202
  37. 37. Page 37 © Hortonworks Inc. 2014 Performance comparison : CTAS with group by Create table test_table as select * from store_returns group by *; Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 630 Hive on Spark 1,608 155% Hive on Mapreduce 840 33% Spark-SQL 1,202 91% Table Insert 4 Billion rows Shuffle On all columns 8 Billion rows Group by On all columns 7 billion rows Table Scan store_returns 8 Billion rows 630 1,608 840 1,202 This time, execution engine must prepare, shuffle and aggregate data.
  38. 38. Page 38 © Hortonworks Inc. 2014 Performance comparison : CTAS with group by Create table test_table as select * from store_returns group by *; Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 630 Hive on Spark 1,608 155% Hive on Mapreduce 840 33% Spark-SQL 1,202 91% Table Insert 4 Billion rows Shuffle On all columns 8 Billion rows Group by On all columns 7 billion rows Table Scan store_returns 8 Billion rows 630 1,608 840 1,202 Tez is 155% faster than Spark 33% faster than Mapreduce 91% faster than Spark-SQL
  39. 39. Page 39 © Hortonworks Inc. 2014 Performance comparison : Select + group by on PK select count(*) rowcount from store_sales group by ss_item_sk , ss_ticket_number having rowcount > 100000000 Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 457 Hive on Spark 2,966 550% Hive on Mapreduce 893 96% Spark-SQL 862 89% Select 0 rows qualify Shuffle 25 Billion rows Group by 25 billion rows Table Scan 25 Billion rows Filter operator 25 billion rows 457 2,966 893 862
  40. 40. Page 40 © Hortonworks Inc. 2014 Performance comparison : Select + group by on PK select count(*) rowcount from store_sales group by ss_item_sk , ss_ticket_number having rowcount > 100000000 Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 457 Hive on Spark 2,966 550% Hive on Mapreduce 893 96% Spark-SQL 862 89% Select 0 rows qualify Shuffle 25 Billion rows Group by 25 billion rows Table Scan 25 Billion rows Filter operator 25 billion rows 457 2,966 893 862 Group-By performed on all 25 billion distinct keys.
  41. 41. Page 41 © Hortonworks Inc. 2014 Performance comparison : Select + group by on PK select count(*) rowcount from store_sales group by ss_item_sk , ss_ticket_number having rowcount > 100000000 Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 457 Hive on Spark 2,966 550% Hive on Mapreduce 893 96% Spark-SQL 862 89% Select 0 rows qualify Shuffle 25 Billion rows Group by 25 billion rows Table Scan 25 Billion rows Filter operator 25 billion rows 457 2,966 893 862 Tez is 550% faster than Spark 96% faster than Mapreduce 89% faster than Spark-SQL
  42. 42. Page 42 © Hortonworks Inc. 2014 Performance comparison : Select + group by on low NDV select sum(ss_list_price) from store_sales group by ss_sold_date_sk having sum(ss_list_price) = 1 Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 51 Hive on Spark 56 10% Hive on Mapreduce 290 465% Spark-SQL 164 221% Select 0 rows qualify Group by 85 billion rows Table Scan 85 Billion rows Filter operator 8K rows 51 290 56 164
  43. 43. Page 43 © Hortonworks Inc. 2014 Performance comparison : Select + group by on low NDV select sum(ss_list_price) from store_sales group by ss_sold_date_sk having sum(ss_list_price) = 1 Execution engine Elapsed time (Seconds) Tez Gain % Hive on Tez 51 Hive on Spark 56 10% Hive on Mapreduce 290 465% Spark-SQL 164 221% Select 0 rows qualify Group by 85 billion rows Table Scan 85 Billion rows Filter operator 8K rows 51 290 56 164 Hive on Tez and Hive on Spark outperform Spark-SQL
  44. 44. Page 44 © Hortonworks Inc. 2014 select count(*) from store_sales, item, store, customer_demographics where i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and ss_cdemo_sk = cd_demo_sk Performance comparison : Map join with 1,2 & 3 tables Map join 27 Billion rows Map join 27 Billion rows Map join 27 Billion rows Table Scan store_sales 28 Billion rows Table Scan customer_demographic s 1.9 Million rows Table Scan item 472K rows Table Scan Store 1.7K rows Execution engine Map join #1 Map join #2 Map join #3 Tez Join #1 Gain % Tez Join #2 Gain % Tez join #3 Gain % Hive on Tez 108 145 232 Hive on Spark 106 142 289 98% 98% 125% Hive on Mapreduce 247 280 800 228% 193% 345% Spark-SQL 86 117 166 -20% -20% -28%
  45. 45. Page 45 © Hortonworks Inc. 2014 select count(*) from store_sales, item, store, customer_demographics where i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and ss_cdemo_sk = cd_demo_sk Performance comparison : Map join with 1,2 & 3 tables Map join 27 Billion rows Map join 27 Billion rows Map join 27 Billion rows Table Scan store_sales 28 Billion rows Table Scan customer_demographic s 1.9 Million rows Table Scan item 472K rows Table Scan Store 1.7K rows Execution engine Map join #1 Map join #2 Map join #3 Tez Join #1 Gain % Tez Join #2 Gain % Tez join #3 Gain % Hive on Tez 108 145 232 Hive on Spark 106 142 289 98% 98% 125% Hive on Mapreduce 247 280 800 228% 193% 345% Spark-SQL 86 117 166 -20% -20% -28% Spark-SQL is faster than Hive on Tez and Hive on Spark for Map-joins
  46. 46. Page 46 © Hortonworks Inc. 2014 Performance comparison : Shuffle join + group by ● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number ● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number group by ss_item_sk , ss_ticket_number having rowcount > 1 Execution engine Shuffle join Shuffle join + group by Tez Shuffle Gain % Tez Gain % Hive on Tez 400 453 Hive on Spark 1,078 1,120 170% 147% Hive on Mapreduce 756 826 89% 82% Spark-SQL 1,835 1,884 359% 316% Shuffle Join 9 Billion rows Group by 675 Million rows Table Scan 8.6 Billion rows Table Scan 6 Million rows Select 0 rows Filter 675 Million rows 400 1,078 1,120 826 453 756 1,884 1,835
  47. 47. Page 47 © Hortonworks Inc. 2014 Performance comparison : Shuffle join + group by ● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number ● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number group by ss_item_sk , ss_ticket_number having rowcount > 1 Shuffle Join 9 Billion rows Group by 675 Million rows Table Scan 8.6 Billion rows Table Scan 6 Million rows Select 0 rows Filter 675 Million rows 400 1,078 1,120 826 453 756 1,884 1,835 Tez is 170% faster than Spark 89% faster than Mapreduce 359% faster than Spark-SQL Tez is 147% faster than Spark 82% faster than Mapreduce 316% faster than Spark-SQL Execution engine Shuffle join Shuffle join + group by Tez Shuffle Gain % Tez Gain % Hive on Tez 400 453 Hive on Spark 1,078 1,120 170% 147% Hive on Mapreduce 756 826 89% 82% Spark-SQL 1,835 1,884 359% 316%
  48. 48. Page 48 © Hortonworks Inc. 2014 Performance comparison : Shuffle join + group by ● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number ● select count(*) from store_sales a ,store_returns b where a.ss_item_sk = b.sr_item_sk and a.ss_ticket_number = b.sr_ticket_number group by ss_item_sk , ss_ticket_number having rowcount > 1 Shuffle Join 9 Billion rows Group by 675 Million rows Table Scan 8.6 Billion rows Table Scan 6 Million rows Select 0 rows Filter 675 Million rows 400 1,078 1,120 826 453 756 1,884 1,835 Why are shuffles so slow for Hive on Spark and Spark-SQL Execution engine Shuffle join Shuffle join + group by Tez Shuffle Gain % Tez Gain % Hive on Tez 400 453 Hive on Spark 1,078 1,120 170% 147% Hive on Mapreduce 756 826 89% 82% Spark-SQL 1,835 1,884 359% 316%
  49. 49. Page 49 © Hortonworks Inc. 2014 Performance comparison : Shuffle join cluster CPU utilization
  50. 50. Page 50 © Hortonworks Inc. 2014 Performance comparison : Shuffle join cluster CPU utilization Hive on Tez query ends here
  51. 51. Page 51 © Hortonworks Inc. 2014 Performance comparison : Shuffle join cluster CPU utilization Hive on Spark query ends here
  52. 52. Page 52 © Hortonworks Inc. 2014 Performance comparison : Primitive results summary
  53. 53. Page 53 © Hortonworks Inc. 2014 Performance comparison : Performance summary Short running query+ ETL+ Large joins and aggregates+ Slower than Spark-SQL in Map joins High GC Instability SQL support limited compared to Hive Lack of sophisticated query optimizer Efficient resource utilization+ Map join performance+ Large Joins Outperforms Spark-SQL in large join+ Slower than Tez for large joins and aggregates High GC Hive Tez Spark-SQL Hive on Spark MapReduce Promising initial release+
  54. 54. Page 54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Solving Hive’s Top Performance Challenges
  55. 55. Page55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Hive: Modern ArchitectureStorage Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog Engine SQL Engines Row Engine Vector Engine SQL SQL Support SQL:2011 Optimizer HCatalog HiveServer2 Cache Block Cache Linux Cache Distributed Execution Hadoop 1 MapReduce Hadoop 2 Tez Spark Vector Cache LLAP Persistent Server Historical Current In Development Legend
  56. 56. Page56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storage Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog Engine SQL Engines Row Engine Vector Engine SQL SQL Support SQL:2011 Optimizer HCatalog HiveServer2 Apache Hive: Getting to Sub-Second Improvement LLAP: Persistent servers cache vectors and start queries instantly. Pluggable integrations with Tez or Spark. Cache Block Cache Linux Cache Distributed Execution Hadoop 1 MapReduce Hadoop 2 Tez Spark Historical Current In Development Legend Vector Cache LLAP Persistent Server
  57. 57. Page57 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storage Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog Engine SQL Engines Row Engine Vector Engine SQL SQL Support SQL:2011 Optimizer HCatalog HiveServer2 Vectorized Hash Join Solves CPU Boundedness for Hive on Tez or on Spark. Cache Block Cache Linux Cache Distributed Execution Hadoop 1 MapReduce Hadoop 2 Tez Spark Historical Current In Development Legend Apache Hive: Getting to Sub-Second Improvement Vector Cache LLAP Persistent Server
  58. 58. Page58 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storage Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog Engine SQL Engines Row Engine Vector Engine SQL SQL Support SQL:2011 Optimizer HCatalog HiveServer2 Improved metadata catalog allows instant query planning and optimization for any engine. Cache Block Cache Linux Cache Distributed Execution Hadoop 1 MapReduce Hadoop 2 Tez Spark Historical Current In Development Legend Apache Hive: Getting to Sub-Second Improvement Vector Cache LLAP Persistent Server
  59. 59. Page59 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Hive’s Sub-Second Future = Sub-Second Hive Metadata Fast, Scalable Metadata Catalog Persistent Server LLAP + + SQL Engine Vectorized Hash Join Choice of Execution Engines Tez or Spark +
  60. 60. Page60 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Questions? ? Interested? Stop by the Hortonworks booth to learn more
  61. 61. Page61 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Endnotes (1) https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/ (2) https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently- with-corona/10151142560538920 (3) http://hortonworks.com/blog/benchmarking-apache-hive-13-enterprise-hadoop/ (4) http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-tez-and-yarn (5) http://www.slideshare.net/AdamKawa/a-perfect-hive-query-for-a-perfect-meeting-hadoop-summit-2014

×