Bay Area Impala User Group Meetup (Sept 16 2014)

Impala Product Update
Justin Erickson | Director, Product Management
September 2014
©2014 Cloudera, Inc. All Rights
Reserved.
1

Agenda
• Impala releases
• Impala roadmap
• Perf update
Reserved.
2

Key Milestones and Features
• Impala 1.0
• ~SQL-92 (minus correlated sub-queries)
• Native Hadoop file formats (Parquet, Avro, text, Sequence, …)
• Enterprise-readiness (authentication, ODBC/JDBC drivers, etc)
• Service-level resource isolation with other Hadoop frameworks
• Impala 1.1
• Fine-grained, role-based authorization via Apache Sentry
• Auditing (Impala 1.1.1 and CM 4.7+)
• Impala 1.2
• Custom language extensibility (UDFs, UDAFs)
• Cost-based join-order optimization
• On-par performance compared to traditional MPP query engines while maintaining native
Hadoop data flexibility
• Impala 1.3 / CDH 5.0 (also has version for CDH 4.x)
• Resource management
Reserved.
3

Just Released
Impala 1.4 / CDH 5.1 (also with version for CDH 4.x)
• Additional SQL:
• DECIMAL data type
• Additional built-in functions from EDW
• ORDER BY without LIMIT
• Continued performance gains:
• HDFS caching support (CDH 5 only)
• Faster selective joins
• Faster COMPUTE STATS
4
Reserved.

Impala near-term roadmap
Targeted for Impala 2.0 (fall 2014):
• Additional SQL:
• Analytic/window functions
• Subqueries in the WHERE clause
• Additional data types (VARCHAR, CHAR)
• Disk-based joins and aggregations
• GRANT/REVOKE
Considerations for Impala 2.x (priority and inclusion based on your feedback):
• Nested/complex types (next highest priority)
• Navigator Lineage
• Updates via MERGE
• Incremental stats
• Additional SQL functions (GROUPING, ROLLUP, CUBE, MINUS, INTERSECT built-ins, etc)
• UDTFs
• Intra-node parallel joins and aggregations
• Even faster performance
• S3 integration
Reserved.
5

SQL-on-Hadoop benchmark:
Impala, Presto, Stinger, Spark SQL
• Upcoming benchmarks on latest versions of:
• Impala (1.4.0)
• Presto (0.74)
• Stinger (final) phase 3 => aka Hive 0.13.0
• Spark SQL (1.1)
• Published with smaller memory configuration (64 GB / node)
• Demonstrates leadership is independent of memory size
• Dropped Shark given retirement for Hive-on-Spark
• As always, our public benchmarks are:
• Based on industry standards (TPC)
• Repeatable (https://github.com/cloudera/impala-tpcds-kit)
• Methodical testing with multiple runs on same hardware
• Help competing software put its best foot forward
• SQL-92 join style for engines without CBO
• JVM tuning for Presto
• Run on optimal file formats for each
Reserved.
6

Impala’s Multi-User over 10x faster:
Gap widening compared to May’s update
Reserved.
7

Faster = more work in less time:
Impala enables over 8.7x throughput
Reserved.
8

Performance Takeaways
• Impala’s advantage expands from 5x single-user to >10x with just 10 user
• Performance gap is widening since May
• Single user Presto went from 5x before to 7.5x now
• Single user Hive/Tez went from 5x before to 9x now
• Mid-term trends will further favor Impala’s design approach
• More data sets move to memory (HDFS caching, in-memory joins, Intel joint
roadmap)
• CPU efficiency will increase in importance
• Native code enables easy optimizations for CPU instruction sets (e.g. floating
point operations, math operations, encrypt/decrypt)
• The Intel joint roadmap helps support these opportunities
Reserved.
9

Try It Out!
• 100% Apache-licensed open source
• Downloads on http://impala.io/:
• Live online
• VM
• Installation
• Questions/comments?
• Community: http://impala.io/community
• Email: impala-user@cloudera.org
Reserved.
10

Reserved.
11

Real Time
Audience
Dashboard
September 2014

Introduction
13
Tubular Labs
SAAS Platform for online Video
Audience Development
(e.g. Big Data for YouTube videos)
David Koblas
VP Engineering, Tubular Labs

Overview
14
This presentation will talk about the work
Tubular Labs has done to use Impala as
one of the core components to our SAAS
platform. We'll go through the pipeline
for getting data into the system, to how
we've distributed responsibility across
AWS instances, and other tips and tricks
for getting real-time responses to our
end-user queries over billions of data
points.

User Story: Audience Also Watches
15
For any YouTube video can we figure out
who the audience is and what other
videos and channels they are watching.
Also to have the ability to slice the
audience by demographic information.
…and have it all run interactively from a
web SAAS platform.

Technology Options
17
• Pre-compute (e.g. Map/Reduce)
• MySQL or similar
• Data Warehouse
• Impala or Redshift
• Homebrew

Impala 0.7
18
Now we have a technology
…
Make it interactive
…
and make a bet on Cloudera

Now We Have A Technology
Time To Make It Fast
and Economical
19
Source: Tubular Labs

Pipeline
20
Loading
• Sqoop
- collect data from MySQL
• Hive
- preprocess data
Query
• Impala
- interactive display
• Python
- REST endpoint

AWS EC2: Node types
21
• m1.xlarge
- 1.6TB of Instance Storage
- slow IO
• hi1.4xlarge
- 2TB of SSD
- expensive
Note: this would be an i2.4xlarge instance today

Managing costs
22
Problem
• hi1.4xlarge - expensive
• m1.xlarge - slow IO
Solution – HDFS rack replication for separation
• One copy of data on both racks
• Hive creates tables on m1.xlarge instances
• Impala queries on hi1.4xlarge instances

Interactive Performance
23
Problem
• Large tables take time to scan
• No indexes
• Need to deliver results in < 1second
Solution – partitioning (duh!)
• Partitions are targeted to be between 100…200MB
• The query log is your friend

Summary
25
Impala can back your SAAS application
• We’re now running version 1.3
• We’re “spinning” 10TB of data
• Delivering queries in < 2seconds
We’re hiring – but you already knew that.

Bay Area Impala User Group Meetup (Sept 16 2014)

Bay Area Impala User Group Meetup (Sept 16 2014)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Bay Area Impala User Group Meetup (Sept 16 2014)

Similar to Bay Area Impala User Group Meetup (Sept 16 2014) (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Bay Area Impala User Group Meetup (Sept 16 2014)