Impala 2.0 - The Best Analytic Database for Hadoop

Impala 2.0
The Leading Analytic Database for Hadoop
Justin Erickson | Director, Product Management

© 2014 Cloudera, Inc. All rights reserved. 2
Notification
• The information in this document is proprietary to Cloudera. No part of this document may be
reproduced, copied or transmitted in any form for any purpose without the express prior written
permission of Cloudera.
• This document is a preliminary version and not subject to your license agreement or any other
agreement with Cloudera. This document contains only intended strategies, developments and
functionalities of Cloudera products and is not intended to be binding upon Cloudera to any
particular course of business, product strategy and/or development. Please note that this
document is subject to change and may be changed by Cloudera at any time without notice.
• Cloudera assumes no responsibility for errors or omissions in this document. Cloudera does not
warrant the accuracy or completeness of the information, text, graphics, links or other items
contained within this material. This document is provided without a warranty of any kind, either
express or implied, including but not limited to the implied warranties of merchantability, fitness
for a particular purpose or non-infringement.
• Cloudera shall have no liability for damages of any kind including without limitation direct,
special, indirect or consequential damages that may result from the use of these materials. The
limitation shall not apply in cases of gross negligence.

Agenda
• Impala Overview
• Milestones and 2.0 Features
• SQL-on-Hadoop Performance Update
• What’s Next

The Right SQL Engine for the Use Case
SQL
©2014 Cloudera, Inc. All rights reserved. © 2014 Cloudera, Inc. All rights reserved. 4
BI and SQL
Analytics
Batch
Processing
Spark
Developers

Analytic Database for Hadoop Requires
Multi-User Interactive
Performance
Interaction at the speed of thought
Compatibility Familiar BI tools/SQL interfaces
Usability Accessible to broad range of applications
Flexibility Use SQL along with other Hadoop frameworks
across all data
Native in Hadoop Unified resource management, metadata, security,
and management across frameworks

Impala’s Benefits
Multi-User Interactive
Performance ✔ • 10x vs alternatives with latest benchmarks
• Performance advantage increases with multi-user
Compatibility ✔ • Provides both ANSI SQL and vendor-specific extensions
• Compatibility with the leading BI partners
Usability ✔ • Cost-based optimization allows for more users and tools to run a
broader range of queries
Flexibility ✔ • Supports the common native Hadoop file formats
• Parquet provides best-of-breed columnar performance across
Hadoop frameworks
Native in Hadoop ✔ • Unified with Hadoop’s resource management, metadata, security,
and management

Engines
Resource Management
Most Common Scenarios
Single Platform for Data Processing and Analytics
• Interactive BI/analytics on “big data”
• Data discovery
• Exploratory analytics
• Queryable operational data store
Storage
Integration
Metadata
Batch
Processing
MAPREDUCE,
HIVE & PIG
…
Interactive
SQL
IMPALA
Interactive
Search
Solr
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO… RECORDS
Management | Support
Interactive
Analytics
SAS, R, …

Most Common Use Cases
Operational Dashboards
Example: Healthcare Insurance Company
Goal:
• Visualizations of current hospital spending
and comparison to peers and historical data
• Integrate 1000s of client hospital purchasing
systems
Key benefits of Impala:
• Simplification via unification
• Saved license $ over traditional DBMS
• Enabled finer-grain details in source data
vs. planned summarized extracts
• 3 nodes of Impala outperformed a rack of
the traditional RDBMS on their workload
Data Discovery
Example: Major Financial Institution
Goal:
• Fraud group looking at internal / external fraud
• Captured internal systems and external
application/website logs
Key benefits of Impala:
• Flexibility to have more data readily
available without upfront modeling
• Ability to use existing BI visualization tools
• Better TCO

Previous Key Milestones and Features
• Impala 1.0
• ~SQL-92 (minus correlated sub-queries)
• Native Hadoop file formats (Parquet, Avro, text, Sequence, …)
• Enterprise-readiness (authentication, ODBC/JDBC drivers, etc)
• Service-level resource isolation with other Hadoop frameworks
• Impala 1.1
• Fine-grained, role-based authorization via Apache Sentry
• Auditing (Impala 1.1.1 and CM 4.7+)
• Impala 1.2
• Custom language extensibility (UDFs, UDAFs)
• Cost-based join-order optimization
• On-par performance compared to traditional MPP query engines while maintaining native Hadoop data flexibility
• Impala 1.3 / CDH 5.0 (also has version for CDH 4.x)
• Resource management
• Impala 1.4 / CDH 5.1 (also has version for CDH 4.x)
• More SQL compatibility (DECIMAL, vendor-specific extensions, ORDER BY without LIMIT, etc)
• HDFS caching
• Faster performance (selective queries and compute stats in particular

Impala 2.0 Key Updates
• Same great multi-user interactive performance
• Removed limits on SQL compatibility
• SQL:2003 analytic/window functions
• Subqueries in WHERE clause, EXISTS, and IN
• Additional data types (CHAR and VARCHAR)
• GRANT/REVOKE functions via Sentry
• Additional vendor-specific SQL extensions
• Removed limits on query size
• Disk-based query processing

September SQL-on-Hadoop Benchmark:
Impala, Presto, Stinger, Spark SQL
• Benchmarks on:
• Impala (1.4.0)
• Presto (0.74)
• Stinger (final) phase 3 => aka Hive 0.13.0
• Spark SQL (1.1)
• As always, our public benchmarks are:
• Based on industry standards (TPC)
• Repeatable (https://github.com/cloudera/impala-tpcds-kit)
• Methodical testing with multiple runs on same hardware
• Help competing software put its best foot forward
• SQL-92 join style for engines without CBO
• JVM tuning for Presto
• Run on optimal file formats for each
• Full details on our blog: http://blog.cloudera.com/blog/2014/09/new-benchmarks-for-sql-on-hadoop-
impala-1-4-widens-the-performance-gap/

Impala’s Multi-User Over 10x Faster:
Gap widening compared to May’s update

Faster = More Work in Less Time:
Impala enables over 8.7x throughput

Performance Takeaways
• Impala’s advantage expands from 5x single-user to >10x with just 10 user
• Performance gap is widening since May
• Single user Presto went from 5x before to 7.5x now
• Single user Hive/Tez went from 5x before to 9x now
• Mid-term trends will further favor Impala’s design approach
• More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap)
• CPU efficiency will increase in importance
• Native code enables easy optimizations for CPU instruction sets (e.g. floating point operations, math
operations, encrypt/decrypt)
• The Intel joint roadmap helps support these opportunities

IBM Research Validation
• New VLDB academic paper comparing Impala and Hive-based (both MR and Tez) for SQL-on-Hadoop
• http://www.vldb.org/pvldb/vol7/p1295-floratou.pdf
• Impala’s significantly more efficient than Hive/Tez or Hive/MR
• “Impala’s database-like architecture provides significant performance gains, compared to Hive’s MapReduce or Tez based
runtime”
• Correctly attributes Impala’s lead to it’s CPU efficiency, IO manager, and overall architecture that resembles a shared-nothing
parallel database
• Parquet more efficient than ORC
• “The Parquet format skips data more efficiently than ORC which tends to pre-fetch unnecessary data especially when a table
contains a large number of columns”
• Note: Paper is single-user only. Multi-user would make the gap even wider
• Our published results show ~5x single-user Impala lead goes to ~10x with just 10 users in our blog:
http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/
• Same CPU efficiency, IO manager, and overall architectural reasons
• Additional Notes:
• Impala 2.0 will have disk-based joins and aggregations
• Impala 1.4 is significantly faster on selective joins than Impala 1.2.2 used in the paper

Impala’s Analytic Database Leadership
1 > 1 MM downloads since GA
2 Majority adoption across Cloudera EDH customers
3 Certification across key application partners:
4 De facto standard with multi-vendor support:
5 Full Apache Open Source License
and others

What’s Next?
• Usability
• Nested data structures for greater data flexibility and expressiveness than
traditional RDBMS systems
• Ability to run on data natively stored in Amazon S3
• Advanced security with lineage tracking and query redaction from logs
• Built-in abilities for data maintenance and updates
• Compatibility
• Continued additions of commonly used vendor-specific built-ins
• Continued joint-development with BI partners
• More advanced SQL:2010 set features
• Multi-User Performance
• Focus on even better multi-user concurrency
• Continued performance increases and leadership

Engines
Resource Management
It’s Not Just About SQL-on-Hadoop
The Platform for Big Data
• Single platform for processing &
analytics
• Scales to ‘000s of servers
• No upfront schema
• 10% the cost per TB
• Open source platform
Storage
Integration
Metadata
Batch
Processing
MAPREDUCE,
HIVE & PIG
…
Interactive
SQL
IMPALA
Interactive
Search
Solr
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO… RECORDS
Management | Support
Interactive
Analytics
SAS, R, …

Try Impala Out!
• 100% Apache-licensed open source
• Downloads on http://impala.io/:
• Live online
• VM
• Installation
• Questions/comments?
• Community: http://impala.io/community
• Email: impala-user@cloudera.org
19

Impala 2.0 - The Best Analytic Database for Hadoop

More Related Content

What's hot

Viewers also liked

Similar to Impala 2.0 - The Best Analytic Database for Hadoop

More from Cloudera, Inc.

Recently uploaded

Impala 2.0 - The Best Analytic Database for Hadoop

Editor's Notes