Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SQL on Hadoop 100TB TPC-DS Benchmark


Published on

Slides presented by Victor Hatinguais at CA Village Big Data & Data Science meetup on the 20th of March 2017

Published in: Technology
  • Be the first to comment

SQL on Hadoop 100TB TPC-DS Benchmark

  1. 1. © 2017 IBM Corporation A Performance Study: SQL-on-Hadoop with TPC-DS queries (Hadoop-DS) Analytics Performance
  2. 2. © 2017 IBM Corporation2 Acknowledgements and Disclaimers Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. © Copyright IBM Corporation 2017. All rights reserved. — U.S. Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM, the IBM logo,, BigInsights, and Big SQL are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or TM), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at ▪“Copyright and trademark information” at ▪TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council ▪Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera. ▪Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries. ▪Other company, product, or service names may be trademarks or service marks of others.
  3. 3. © 2017 IBM Corporation3 What is TPC-DS? ▪ TPC = Transaction Processing Council  Non-profit corporation (vendor independent)  Defines various industry driven database benchmarks…. DS = Decision Support  Models a multi-domain data warehouse environment for a hypothetical retailer Retail Sales Web Sales Inventory Demographics Promotions Multiple scale factors: 100GB, 300GB, 1TB, 3TB, 10TB, 30TB and 100TB 99 Pre-Defined Queries Query Classes: Reporting Ad HocIterative OLAP Data Mining
  4. 4. © 2017 IBM Corporation4 ANALYTICAL SQL ON HADOOP? What’s the Best Solution for
  5. 5. © 2017 IBM Corporation5 Radiant Advisors: Sponsored by Teradata (Q2 2016) Presto, Impala, Hive and Spark SQL (pre-2.0) Thinking of moving BI workloads from Data Warehouse to Hadoop?
  6. 6. © 2017 IBM Corporation6 Publisher Date Product TPC-DS Queries Data Vol Cloudera Sept 2016 Impala 2.6 on AWS Claims 42% more performant than AWS Redshift 70 query subset 3TB Cloudera August 2016 Impala 2.6 Claims 22% faster for TPC-DS than previous version 17 queries referenced Not specified Cloudera April 2016 Impala 2.5 Claims 4.3x faster for TPC-DS than previous version 24 query subset 15TB *1 Hortonworks July 2016 Hive 2.1 with LLAP Claims 25x faster for TPC-DS than Hive 1.2 15 query subset 1TB Latest Benchmarks Direct from Cloudera / Hortonworks SQL are not much better.
  7. 7. © 2017 IBM Corporation7 SPARK RUNS ALL 99 QUERIES But there is good news…
  8. 8. © 2017 IBM Corporation8
  9. 9. © 2017 IBM Corporation9 IBM Leadership in Spark SQL and ML Major focus areas include Spark SQL and ML Statistics as of February 1, 2017
  10. 10. © 2017 IBM Corporation10 IBM Shared Experiences running 99 TPC-DS queries (Oct 2016) @ Spark Summit Brussels 10 TB Scale Factor
  11. 11. © 2017 IBM Corporation11 WHAT WOULD IT TAKE TO RUN 100 TB Spark 2.1 shows continued improvement…. IBM delivers the most complete benchmark by any vendor for SQL on Hadoop with 10X more data
  12. 12. © 2017 IBM Corporation12 100TB TPC-DS is BIG data
  13. 13. © 2017 IBM Corporation13 Benchmark Environment: IBM “F1” Spark SQL Cluster ▪ 28 Nodes Total (Lenovo x3640 M5) ▪ Each configured as: • 2 sockets (18 cores/socket) • 1.5 TB RAM • 8x 2TB SSD ▪ 2 Racks  20x 2U servers per rack (42U racks) ▪ 1 Switch, 100GbE, 32 ports Mellanox SN2700
  14. 14. © 2017 IBM Corporation14 PERFORMANCE SPARK SQL 2.1 HADOOP-DS @ 100TB: AT A GLANCE WORKING QUERIES COMPRESSION 60%SPACE SAVED WITH PARQUET Spark SQL completes more TPC-DS queries than any other open source SQL engine for Hadoop @ 100TB Scale
  15. 15. © 2017 IBM Corporation15 WHAT CAN WE COMPARE IT TO? But… is this a good result?
  16. 16. © 2017 IBM Corporation Big SQL also runs TPC-DS queries… The following benchmark results used the same hardware as Spark SQL F1 Cluster using Big SQL v4.3 Technical Review
  17. 17. © 2017 IBM Corporation17 Query Compliance Through the Scale Factors ▪ SQL compliance is important because Business Intelligence tools generate standard SQL  Rewriting queries is painful and impacts productivity ▪ Spark SQL 2.1 can run all 99 TPC-DS queries but only at lower scale factors ▪ Spark SQL Failures @ 100 TB:  12 runtime errors  4 timeout (> 10 hours) Spark SQL ▪ Big SQL has been successfully executing all 99 queries since Oct 2014 ▪ IBM is the only vendor that has proven SQL compatibility at scale factors up to 100TB Big SQL
  18. 18. © 2017 IBM Corporation18 CPU Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams Spark SQL uses almost 3x more system CPU. These are wasted CPU cycles. Average CPU Utilization: 76.4% Average CPU Utilization: 88.2%
  19. 19. © 2017 IBM Corporation19 I/O Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams Spark SQL required 3.6X more reads 9.5X more writes Big SQL can drive peak I/O nearly 2X more
  20. 20. © 2017 IBM Corporation20 Big SQL is 3.2X faster than Spark 2.1 (4 Concurrent Streams) Big SQL @ 99 queries still outperforms Spark SQL @ 83 queries
  21. 21. © 2017 IBM Corporation24 A LOT OF POTENTIAL And the best part,… Big SQL still has
  22. 22. © 2017 IBM Corporation25 ▪ Big SQL only actively using ~ 1/3rd of memory  More memory could be assigned to bufferpools and sort space etc…  Big SQL could be even faster !!! ▪ Spark SQL is doing a better job at utilizing the available memory, but consequently has less room for improvement via tuning Big SQL Spark SQL Memory Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams
  23. 23. © 2017 IBM Corporation27 BIG SQL + SPARK IS A GREAT COMBINATION But this is not about Big SQL vs. Spark
  24. 24. © 2017 IBM Corporation28 Recommendation: Right Tool for the Right Job Machine Learning Simpler SQL Good Performance Ideal tool for BI Data Analysts and production workloads Ideal tool for Data Scientists and discovery Big SQL Spark SQL Migrating existing workloads to Hadoop Security Many Concurrent Users Best Performance Not Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster
  25. 25. © 2017 IBM Corporation29 HDFS Big SQL Head Node Big SQL Worker Big SQL Worker Big SQL Worker Big SQL Worker Spark Exec. Spark Exec. Spark Exec. Spark Exec. = Fast data transfer over shared memory Big SQL – The ONLY engine with Deep Integration with Spark
  26. 26. © 2017 IBM Corporation30 Summary: IBM is investing on Big SQL and SparkSQL ▪ Only Big SQL completes all 99 queries with concurrency at 100TB ▪ Big SQL completes the workload:  3.2x faster than Spark SQL  With less than 3x the CPU resources  With 11x fewer read ops and 24x fewer write ops ▪ IBM is investing massively in SparkSQL ▪ To learn more:  spark-sql-at-100tb/ 