Cloudera Impala
Real Time Query for HDFS and HBase
Alexander Alten-Lorenz, Cloudera INC
Tuesday, July 2, 13
2
Beyond Batch
What is Impala
Capability
Architecture
Demo
Tuesday, July 2, 13
Beyond Batch
3
For some things MapReduce is just too slow
Apache Hive:
MapReduce execution engine
High-latency, low throughput
High runtime overhead
Google realized this early on
	

 Analysts wanted fast, interactive results
Tuesday, July 2, 13
Dremel
4
Google paper (2010)
“scalable, interactive ad-hoc query system for
analysis of read-only nested data”
Columnar storage format
Distributed scalable aggregation
“capable of running aggregation queries over
trillion-row tables in seconds”
http://research.google.com/pubs/pub36632.html
Tuesday, July 2, 13
Impala: Goals
5
General-purpose SQL query engine for Hadoop
For analytical and transactional workloads
Support queries that take μs to hours
Run directly with Hadoop
Collocated daemons
Same file formats
Same storage managers (NN, metastore)
Tuesday, July 2, 13
Impala: Goals
6
High performance
C++
runtime code generation (LLVM)
direct access to data (no MapReduce)
Retain user experience
	

 easy for Hive users to migrate
100% open-source
Tuesday, July 2, 13
Impala: Capability
7
HiveQL (subset of SQL92)
select, project, join, union, subqueries,
aggregation, insert, order by (with limit)
DDL
Directly queries data in HDFS & HBase
Text files (compressed)
Sequence files (snappy/gzip)
Avro &Trevni
GA features
Tuesday, July 2, 13
Impala: Capability
8
Familiar and unified platform
Uses Hive’s metastore
Submit queries via ODBC | BeeswaxThrift API
Query is distributed to nodes with relevant data
Process-to-process data exchange
Kerberos authentication
No fault tolerance
Tuesday, July 2, 13
Impala: Performance
9
Greater disk throughput
~100MB/sec/disk
I/O-bound workloads faster by 3-4x
Queries that require multiple map-reduce phases
in Hive are significantly faster in Impala (up to 45x)
Queries that run against in-memory cached data
see a significant speedup (up to 90x)
Tuesday, July 2, 13
Impala:Architecture
10
impalad
runs on every node
handles client requests (ODBC, thrift)
handles query planning & execution
statestored
provides name service
metadata distribution
used for finding data
Tuesday, July 2, 13
Impala:Architecture
11
Tuesday, July 2, 13
Impala:Architecture
12
Tuesday, July 2, 13
Impala:Architecture
13
Tuesday, July 2, 13
Impala:Architecture
14
Tuesday, July 2, 13
Current limitations
15
1.0.1 (available since May 2013)
No SerDes
No User Defined Functions (UDF’s)
impalad’s only read statestored metadata at
startup
Tuesday, July 2, 13
Futures
16
DDL support (CREATE)
Rudimentary cost-based optimizer (CBO)
metadata distribution through statestored
Doug Cutting’sTrevni
Columnar storage format like Dremel’s
Impala +Trevni = Dremel superset
Tuesday, July 2, 13
Demo
17
impala-user@cloudera.com
alexander@cloudera.com
@mapredit
mapredit.blogspot.com
Web: http://goo.gl/7sxdp
Tuesday, July 2, 13
Tuesday, July 2, 13

Cloudera Impala - HUG Karlsruhe, July 04, 2013