During the second half of 2016, IBM built a state of the art Hadoop cluster with the aim of running massive scale workloads. The amount of data available to derive insights continues to grow exponentially in this increasingly connected era, resulting in larger and larger data lakes year after year. SQL remains one of the most commonly used languages used to perform such analysis, but how do today’s SQL-over-Hadoop engines stack up to real BIG data? To find out, we decided to run a derivative of the popular TPC-DS benchmark using a 100 TB dataset, which stresses both the performance and SQL support of data warehousing solutions! Over the course of the project, we encountered a number of challenges such as poor query execution plans, uneven distribution of work, out of memory errors, and more. Join this session to learn how we tackled such challenges and the type of tuning that was required to the various layers in the Hadoop stack (including HDFS, YARN, and Spark) to run SQL-on-Hadoop engines such as Spark SQL 2.0 and IBM Big SQL at scale!
Simon Harris, Cognitive Analytics, IBM Research