Impala 2.0 
The Leading Analytic Database for Hadoop 
Justin Erickson | Director, Product Management
© 2014 Cloudera, Inc. All rights reserved. 2 
Notification 
• The information in this document is proprietary to Cloudera. No part of this document may be 
reproduced, copied or transmitted in any form for any purpose without the express prior written 
permission of Cloudera. 
• This document is a preliminary version and not subject to your license agreement or any other 
agreement with Cloudera. This document contains only intended strategies, developments and 
functionalities of Cloudera products and is not intended to be binding upon Cloudera to any 
particular course of business, product strategy and/or development. Please note that this 
document is subject to change and may be changed by Cloudera at any time without notice. 
• Cloudera assumes no responsibility for errors or omissions in this document. Cloudera does not 
warrant the accuracy or completeness of the information, text, graphics, links or other items 
contained within this material. This document is provided without a warranty of any kind, either 
express or implied, including but not limited to the implied warranties of merchantability, fitness 
for a particular purpose or non-infringement. 
• Cloudera shall have no liability for damages of any kind including without limitation direct, 
special, indirect or consequential damages that may result from the use of these materials. The 
limitation shall not apply in cases of gross negligence.
© 2014 Cloudera, Inc. All rights reserved. 3 
Agenda 
• Impala Overview 
• Milestones and 2.0 Features 
• SQL-on-Hadoop Performance Update 
• What’s Next
The Right SQL Engine for the Use Case 
SQL 
©2014 Cloudera, Inc. All rights reserved. © 2014 Cloudera, Inc. All rights reserved. 4 
BI and SQL 
Analytics 
Batch 
Processing 
Spark 
Developers
Analytic Database for Hadoop Requires 
© 2014 Cloudera, Inc. All rights reserved. 5 
Multi-User Interactive 
Performance 
Interaction at the speed of thought 
Compatibility Familiar BI tools/SQL interfaces 
Usability Accessible to broad range of applications 
Flexibility Use SQL along with other Hadoop frameworks 
across all data 
Native in Hadoop Unified resource management, metadata, security, 
and management across frameworks
© 2014 Cloudera, Inc. All rights reserved. 6 
Impala’s Benefits 
Multi-User Interactive 
Performance ✔ • 10x vs alternatives with latest benchmarks 
• Performance advantage increases with multi-user 
Compatibility ✔ • Provides both ANSI SQL and vendor-specific extensions 
• Compatibility with the leading BI partners 
Usability ✔ • Cost-based optimization allows for more users and tools to run a 
broader range of queries 
Flexibility ✔ • Supports the common native Hadoop file formats 
• Parquet provides best-of-breed columnar performance across 
Hadoop frameworks 
Native in Hadoop ✔ • Unified with Hadoop’s resource management, metadata, security, 
and management
Engines 
Resource Management 
© 2014 Cloudera, Inc. All rights reserved. 7 
Most Common Scenarios 
Single Platform for Data Processing and Analytics 
• Interactive BI/analytics on “big data” 
• Data discovery 
• Exploratory analytics 
• Queryable operational data store 
Storage 
Integration 
Metadata 
Batch 
Processing 
MAPREDUCE, 
HIVE & PIG 
… 
Interactive 
SQL 
IMPALA 
Interactive 
Search 
Solr 
HDFS HBase 
TEXT, RCFILE, PARQUET, AVRO… RECORDS 
Management | Support 
Interactive 
Analytics 
SAS, R, …
© 2014 Cloudera, Inc. All rights reserved. 8 
Most Common Use Cases 
Operational Dashboards 
Example: Healthcare Insurance Company 
Goal: 
• Visualizations of current hospital spending 
and comparison to peers and historical data 
• Integrate 1000s of client hospital purchasing 
systems 
Key benefits of Impala: 
• Simplification via unification 
• Saved license $ over traditional DBMS 
• Enabled finer-grain details in source data 
vs. planned summarized extracts 
• 3 nodes of Impala outperformed a rack of 
the traditional RDBMS on their workload 
Data Discovery 
Example: Major Financial Institution 
Goal: 
• Fraud group looking at internal / external fraud 
• Captured internal systems and external 
application/website logs 
Key benefits of Impala: 
• Flexibility to have more data readily 
available without upfront modeling 
• Ability to use existing BI visualization tools 
• Better TCO
Previous Key Milestones and Features 
© 2014 Cloudera, Inc. All rights reserved. 9 
• Impala 1.0 
• ~SQL-92 (minus correlated sub-queries) 
• Native Hadoop file formats (Parquet, Avro, text, Sequence, …) 
• Enterprise-readiness (authentication, ODBC/JDBC drivers, etc) 
• Service-level resource isolation with other Hadoop frameworks 
• Impala 1.1 
• Fine-grained, role-based authorization via Apache Sentry 
• Auditing (Impala 1.1.1 and CM 4.7+) 
• Impala 1.2 
• Custom language extensibility (UDFs, UDAFs) 
• Cost-based join-order optimization 
• On-par performance compared to traditional MPP query engines while maintaining native Hadoop data flexibility 
• Impala 1.3 / CDH 5.0 (also has version for CDH 4.x) 
• Resource management 
• Impala 1.4 / CDH 5.1 (also has version for CDH 4.x) 
• More SQL compatibility (DECIMAL, vendor-specific extensions, ORDER BY without LIMIT, etc) 
• HDFS caching 
• Faster performance (selective queries and compute stats in particular
© 2014 Cloudera, Inc. All rights reserved. 10 
Impala 2.0 Key Updates 
• Same great multi-user interactive performance 
• Removed limits on SQL compatibility 
• SQL:2003 analytic/window functions 
• Subqueries in WHERE clause, EXISTS, and IN 
• Additional data types (CHAR and VARCHAR) 
• GRANT/REVOKE functions via Sentry 
• Additional vendor-specific SQL extensions 
• Removed limits on query size 
• Disk-based query processing
September SQL-on-Hadoop Benchmark: 
Impala, Presto, Stinger, Spark SQL 
© 2014 Cloudera, Inc. All rights reserved. 11 
• Benchmarks on: 
• Impala (1.4.0) 
• Presto (0.74) 
• Stinger (final) phase 3 => aka Hive 0.13.0 
• Spark SQL (1.1) 
• As always, our public benchmarks are: 
• Based on industry standards (TPC) 
• Repeatable (https://github.com/cloudera/impala-tpcds-kit) 
• Methodical testing with multiple runs on same hardware 
• Help competing software put its best foot forward 
• SQL-92 join style for engines without CBO 
• JVM tuning for Presto 
• Run on optimal file formats for each 
• Full details on our blog: http://blog.cloudera.com/blog/2014/09/new-benchmarks-for-sql-on-hadoop- 
impala-1-4-widens-the-performance-gap/
© 2014 Cloudera, Inc. All rights reserved. 12 
Impala’s Multi-User Over 10x Faster: 
Gap widening compared to May’s update
© 2014 Cloudera, Inc. All rights reserved. 13 
Faster = More Work in Less Time: 
Impala enables over 8.7x throughput
© 2014 Cloudera, Inc. All rights reserved. 14 
Performance Takeaways 
• Impala’s advantage expands from 5x single-user to >10x with just 10 user 
• Performance gap is widening since May 
• Single user Presto went from 5x before to 7.5x now 
• Single user Hive/Tez went from 5x before to 9x now 
• Mid-term trends will further favor Impala’s design approach 
• More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap) 
• CPU efficiency will increase in importance 
• Native code enables easy optimizations for CPU instruction sets (e.g. floating point operations, math 
operations, encrypt/decrypt) 
• The Intel joint roadmap helps support these opportunities
© 2014 Cloudera, Inc. All rights reserved. 15 
IBM Research Validation 
• New VLDB academic paper comparing Impala and Hive-based (both MR and Tez) for SQL-on-Hadoop 
• http://www.vldb.org/pvldb/vol7/p1295-floratou.pdf 
• Impala’s significantly more efficient than Hive/Tez or Hive/MR 
• “Impala’s database-like architecture provides significant performance gains, compared to Hive’s MapReduce or Tez based 
runtime” 
• Correctly attributes Impala’s lead to it’s CPU efficiency, IO manager, and overall architecture that resembles a shared-nothing 
parallel database 
• Parquet more efficient than ORC 
• “The Parquet format skips data more efficiently than ORC which tends to pre-fetch unnecessary data especially when a table 
contains a large number of columns” 
• Note: Paper is single-user only. Multi-user would make the gap even wider 
• Our published results show ~5x single-user Impala lead goes to ~10x with just 10 users in our blog: 
http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/ 
• Same CPU efficiency, IO manager, and overall architectural reasons 
• Additional Notes: 
• Impala 2.0 will have disk-based joins and aggregations 
• Impala 1.4 is significantly faster on selective joins than Impala 1.2.2 used in the paper
Impala’s Analytic Database Leadership 
1 > 1 MM downloads since GA 
2 Majority adoption across Cloudera EDH customers 
3 Certification across key application partners: 
© 2014 Cloudera, Inc. All rights reserved. 16 
4 De facto standard with multi-vendor support: 
5 Full Apache Open Source License 
and others
© 2014 Cloudera, Inc. All rights reserved. 17 
What’s Next? 
• Usability 
• Nested data structures for greater data flexibility and expressiveness than 
traditional RDBMS systems 
• Ability to run on data natively stored in Amazon S3 
• Advanced security with lineage tracking and query redaction from logs 
• Built-in abilities for data maintenance and updates 
• Compatibility 
• Continued additions of commonly used vendor-specific built-ins 
• Continued joint-development with BI partners 
• More advanced SQL:2010 set features 
• Multi-User Performance 
• Focus on even better multi-user concurrency 
• Continued performance increases and leadership
Engines 
Resource Management 
© 2014 Cloudera, Inc. All rights reserved. 18 
It’s Not Just About SQL-on-Hadoop 
The Platform for Big Data 
• Single platform for processing & 
analytics 
• Scales to ‘000s of servers 
• No upfront schema 
• 10% the cost per TB 
• Open source platform 
Storage 
Integration 
Metadata 
Batch 
Processing 
MAPREDUCE, 
HIVE & PIG 
… 
Interactive 
SQL 
IMPALA 
Interactive 
Search 
Solr 
HDFS HBase 
TEXT, RCFILE, PARQUET, AVRO… RECORDS 
Management | Support 
Interactive 
Analytics 
SAS, R, …
© 2014 Cloudera, Inc. All rights reserved. 19 
Try Impala Out! 
• 100% Apache-licensed open source 
• Downloads on http://impala.io/: 
• Live online 
• VM 
• Installation 
• Questions/comments? 
• Community: http://impala.io/community 
• Email: impala-user@cloudera.org 
19
Questions?

Impala 2.0 - The Best Analytic Database for Hadoop

  • 1.
    Impala 2.0 TheLeading Analytic Database for Hadoop Justin Erickson | Director, Product Management
  • 2.
    © 2014 Cloudera,Inc. All rights reserved. 2 Notification • The information in this document is proprietary to Cloudera. No part of this document may be reproduced, copied or transmitted in any form for any purpose without the express prior written permission of Cloudera. • This document is a preliminary version and not subject to your license agreement or any other agreement with Cloudera. This document contains only intended strategies, developments and functionalities of Cloudera products and is not intended to be binding upon Cloudera to any particular course of business, product strategy and/or development. Please note that this document is subject to change and may be changed by Cloudera at any time without notice. • Cloudera assumes no responsibility for errors or omissions in this document. Cloudera does not warrant the accuracy or completeness of the information, text, graphics, links or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose or non-infringement. • Cloudera shall have no liability for damages of any kind including without limitation direct, special, indirect or consequential damages that may result from the use of these materials. The limitation shall not apply in cases of gross negligence.
  • 3.
    © 2014 Cloudera,Inc. All rights reserved. 3 Agenda • Impala Overview • Milestones and 2.0 Features • SQL-on-Hadoop Performance Update • What’s Next
  • 4.
    The Right SQLEngine for the Use Case SQL ©2014 Cloudera, Inc. All rights reserved. © 2014 Cloudera, Inc. All rights reserved. 4 BI and SQL Analytics Batch Processing Spark Developers
  • 5.
    Analytic Database forHadoop Requires © 2014 Cloudera, Inc. All rights reserved. 5 Multi-User Interactive Performance Interaction at the speed of thought Compatibility Familiar BI tools/SQL interfaces Usability Accessible to broad range of applications Flexibility Use SQL along with other Hadoop frameworks across all data Native in Hadoop Unified resource management, metadata, security, and management across frameworks
  • 6.
    © 2014 Cloudera,Inc. All rights reserved. 6 Impala’s Benefits Multi-User Interactive Performance ✔ • 10x vs alternatives with latest benchmarks • Performance advantage increases with multi-user Compatibility ✔ • Provides both ANSI SQL and vendor-specific extensions • Compatibility with the leading BI partners Usability ✔ • Cost-based optimization allows for more users and tools to run a broader range of queries Flexibility ✔ • Supports the common native Hadoop file formats • Parquet provides best-of-breed columnar performance across Hadoop frameworks Native in Hadoop ✔ • Unified with Hadoop’s resource management, metadata, security, and management
  • 7.
    Engines Resource Management © 2014 Cloudera, Inc. All rights reserved. 7 Most Common Scenarios Single Platform for Data Processing and Analytics • Interactive BI/analytics on “big data” • Data discovery • Exploratory analytics • Queryable operational data store Storage Integration Metadata Batch Processing MAPREDUCE, HIVE & PIG … Interactive SQL IMPALA Interactive Search Solr HDFS HBase TEXT, RCFILE, PARQUET, AVRO… RECORDS Management | Support Interactive Analytics SAS, R, …
  • 8.
    © 2014 Cloudera,Inc. All rights reserved. 8 Most Common Use Cases Operational Dashboards Example: Healthcare Insurance Company Goal: • Visualizations of current hospital spending and comparison to peers and historical data • Integrate 1000s of client hospital purchasing systems Key benefits of Impala: • Simplification via unification • Saved license $ over traditional DBMS • Enabled finer-grain details in source data vs. planned summarized extracts • 3 nodes of Impala outperformed a rack of the traditional RDBMS on their workload Data Discovery Example: Major Financial Institution Goal: • Fraud group looking at internal / external fraud • Captured internal systems and external application/website logs Key benefits of Impala: • Flexibility to have more data readily available without upfront modeling • Ability to use existing BI visualization tools • Better TCO
  • 9.
    Previous Key Milestonesand Features © 2014 Cloudera, Inc. All rights reserved. 9 • Impala 1.0 • ~SQL-92 (minus correlated sub-queries) • Native Hadoop file formats (Parquet, Avro, text, Sequence, …) • Enterprise-readiness (authentication, ODBC/JDBC drivers, etc) • Service-level resource isolation with other Hadoop frameworks • Impala 1.1 • Fine-grained, role-based authorization via Apache Sentry • Auditing (Impala 1.1.1 and CM 4.7+) • Impala 1.2 • Custom language extensibility (UDFs, UDAFs) • Cost-based join-order optimization • On-par performance compared to traditional MPP query engines while maintaining native Hadoop data flexibility • Impala 1.3 / CDH 5.0 (also has version for CDH 4.x) • Resource management • Impala 1.4 / CDH 5.1 (also has version for CDH 4.x) • More SQL compatibility (DECIMAL, vendor-specific extensions, ORDER BY without LIMIT, etc) • HDFS caching • Faster performance (selective queries and compute stats in particular
  • 10.
    © 2014 Cloudera,Inc. All rights reserved. 10 Impala 2.0 Key Updates • Same great multi-user interactive performance • Removed limits on SQL compatibility • SQL:2003 analytic/window functions • Subqueries in WHERE clause, EXISTS, and IN • Additional data types (CHAR and VARCHAR) • GRANT/REVOKE functions via Sentry • Additional vendor-specific SQL extensions • Removed limits on query size • Disk-based query processing
  • 11.
    September SQL-on-Hadoop Benchmark: Impala, Presto, Stinger, Spark SQL © 2014 Cloudera, Inc. All rights reserved. 11 • Benchmarks on: • Impala (1.4.0) • Presto (0.74) • Stinger (final) phase 3 => aka Hive 0.13.0 • Spark SQL (1.1) • As always, our public benchmarks are: • Based on industry standards (TPC) • Repeatable (https://github.com/cloudera/impala-tpcds-kit) • Methodical testing with multiple runs on same hardware • Help competing software put its best foot forward • SQL-92 join style for engines without CBO • JVM tuning for Presto • Run on optimal file formats for each • Full details on our blog: http://blog.cloudera.com/blog/2014/09/new-benchmarks-for-sql-on-hadoop- impala-1-4-widens-the-performance-gap/
  • 12.
    © 2014 Cloudera,Inc. All rights reserved. 12 Impala’s Multi-User Over 10x Faster: Gap widening compared to May’s update
  • 13.
    © 2014 Cloudera,Inc. All rights reserved. 13 Faster = More Work in Less Time: Impala enables over 8.7x throughput
  • 14.
    © 2014 Cloudera,Inc. All rights reserved. 14 Performance Takeaways • Impala’s advantage expands from 5x single-user to >10x with just 10 user • Performance gap is widening since May • Single user Presto went from 5x before to 7.5x now • Single user Hive/Tez went from 5x before to 9x now • Mid-term trends will further favor Impala’s design approach • More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap) • CPU efficiency will increase in importance • Native code enables easy optimizations for CPU instruction sets (e.g. floating point operations, math operations, encrypt/decrypt) • The Intel joint roadmap helps support these opportunities
  • 15.
    © 2014 Cloudera,Inc. All rights reserved. 15 IBM Research Validation • New VLDB academic paper comparing Impala and Hive-based (both MR and Tez) for SQL-on-Hadoop • http://www.vldb.org/pvldb/vol7/p1295-floratou.pdf • Impala’s significantly more efficient than Hive/Tez or Hive/MR • “Impala’s database-like architecture provides significant performance gains, compared to Hive’s MapReduce or Tez based runtime” • Correctly attributes Impala’s lead to it’s CPU efficiency, IO manager, and overall architecture that resembles a shared-nothing parallel database • Parquet more efficient than ORC • “The Parquet format skips data more efficiently than ORC which tends to pre-fetch unnecessary data especially when a table contains a large number of columns” • Note: Paper is single-user only. Multi-user would make the gap even wider • Our published results show ~5x single-user Impala lead goes to ~10x with just 10 users in our blog: http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/ • Same CPU efficiency, IO manager, and overall architectural reasons • Additional Notes: • Impala 2.0 will have disk-based joins and aggregations • Impala 1.4 is significantly faster on selective joins than Impala 1.2.2 used in the paper
  • 16.
    Impala’s Analytic DatabaseLeadership 1 > 1 MM downloads since GA 2 Majority adoption across Cloudera EDH customers 3 Certification across key application partners: © 2014 Cloudera, Inc. All rights reserved. 16 4 De facto standard with multi-vendor support: 5 Full Apache Open Source License and others
  • 17.
    © 2014 Cloudera,Inc. All rights reserved. 17 What’s Next? • Usability • Nested data structures for greater data flexibility and expressiveness than traditional RDBMS systems • Ability to run on data natively stored in Amazon S3 • Advanced security with lineage tracking and query redaction from logs • Built-in abilities for data maintenance and updates • Compatibility • Continued additions of commonly used vendor-specific built-ins • Continued joint-development with BI partners • More advanced SQL:2010 set features • Multi-User Performance • Focus on even better multi-user concurrency • Continued performance increases and leadership
  • 18.
    Engines Resource Management © 2014 Cloudera, Inc. All rights reserved. 18 It’s Not Just About SQL-on-Hadoop The Platform for Big Data • Single platform for processing & analytics • Scales to ‘000s of servers • No upfront schema • 10% the cost per TB • Open source platform Storage Integration Metadata Batch Processing MAPREDUCE, HIVE & PIG … Interactive SQL IMPALA Interactive Search Solr HDFS HBase TEXT, RCFILE, PARQUET, AVRO… RECORDS Management | Support Interactive Analytics SAS, R, …
  • 19.
    © 2014 Cloudera,Inc. All rights reserved. 19 Try Impala Out! • 100% Apache-licensed open source • Downloads on http://impala.io/: • Live online • VM • Installation • Questions/comments? • Community: http://impala.io/community • Email: impala-user@cloudera.org 19
  • 20.

Editor's Notes

  • #5 Our goal is to provide the best tools for a particular job * Hive is the best for batch, and of course we want to make that experience better. * Impala is purpose built for interactive BI on Hadoop. Latency, concurrency, vendor ecosystem, and partner certification. * Spark SQL is built for supporting an advanced analyst’s direct interactions with data, where you’re mixing Spark and SQL
  • #6 Multi-user performance – enables BI users and analysts to interact with Hadoop data at the speed of thought Compatibility - provides familiar BI tools/applications and SQL interfaces Usability - Accessible to the broad range of business users, analysts, and partner applications Flexibility – Enables users access to more data and the ability to use SQL along with the rest of the Hadoop frameworks across all their data Native in Hadoop - Easier and integrated administration with unified resource management, metadata, security, and management across frameworks
  • #7 Multi-user interactive performance 10x vs alternatives with latest benchmarks Broad SQL compatibility Provides both ANSI SQL and vendor-specific extensions Compatibility with the leading BI partners Usability Cost-based optimization allows for more users and tools to run a broader range of queries Flexibility Supports the common native Hadoop file formats Parquet provides best-of-breed columnar performance across Hadoop frameworks Native in Hadoop Unified with Hadoop’s resource management, metadata, security, and management