Cloudera Impala
Real Time Query for HDFS and HBase
Alexander Alten-Lorenz, Cloudera INC
Tuesday, July 2, 13
2
Beyond Batch
What is Impala
Capability
Architecture
Demo
Tuesday, July 2, 13
Beyond Batch
3
For some things MapReduce is just too slow
Apache Hive:
MapReduce execution engine
High-latency, low throug...
Dremel
4
Google paper (2010)
“scalable, interactive ad-hoc query system for
analysis of read-only nested data”
Columnar st...
Impala: Goals
5
General-purpose SQL query engine for Hadoop
For analytical and transactional workloads
Support queries tha...
Impala: Goals
6
High performance
C++
runtime code generation (LLVM)
direct access to data (no MapReduce)
Retain user exper...
Impala: Capability
7
HiveQL (subset of SQL92)
select, project, join, union, subqueries,
aggregation, insert, order by (wit...
Impala: Capability
8
Familiar and unified platform
Uses Hive’s metastore
Submit queries via ODBC | BeeswaxThrift API
Query ...
Impala: Performance
9
Greater disk throughput
~100MB/sec/disk
I/O-bound workloads faster by 3-4x
Queries that require mult...
Impala:Architecture
10
impalad
runs on every node
handles client requests (ODBC, thrift)
handles query planning & executio...
Impala:Architecture
11
Tuesday, July 2, 13
Impala:Architecture
12
Tuesday, July 2, 13
Impala:Architecture
13
Tuesday, July 2, 13
Impala:Architecture
14
Tuesday, July 2, 13
Current limitations
15
1.0.1 (available since May 2013)
No SerDes
No User Defined Functions (UDF’s)
impalad’s only read sta...
Futures
16
DDL support (CREATE)
Rudimentary cost-based optimizer (CBO)
metadata distribution through statestored
Doug Cutt...
Demo
17
impala-user@cloudera.com
alexander@cloudera.com
@mapredit
mapredit.blogspot.com
Web: http://goo.gl/7sxdp
Tuesday, ...
Tuesday, July 2, 13
Upcoming SlideShare
Loading in...5
×

Cloudera Impala - HUG Karlsruhe, July 04, 2013

2,011

Published on

Low latency data processing with Impala

Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), JDBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,011
On Slideshare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
28
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Cloudera Impala - HUG Karlsruhe, July 04, 2013

  1. 1. Cloudera Impala Real Time Query for HDFS and HBase Alexander Alten-Lorenz, Cloudera INC Tuesday, July 2, 13
  2. 2. 2 Beyond Batch What is Impala Capability Architecture Demo Tuesday, July 2, 13
  3. 3. Beyond Batch 3 For some things MapReduce is just too slow Apache Hive: MapReduce execution engine High-latency, low throughput High runtime overhead Google realized this early on Analysts wanted fast, interactive results Tuesday, July 2, 13
  4. 4. Dremel 4 Google paper (2010) “scalable, interactive ad-hoc query system for analysis of read-only nested data” Columnar storage format Distributed scalable aggregation “capable of running aggregation queries over trillion-row tables in seconds” http://research.google.com/pubs/pub36632.html Tuesday, July 2, 13
  5. 5. Impala: Goals 5 General-purpose SQL query engine for Hadoop For analytical and transactional workloads Support queries that take μs to hours Run directly with Hadoop Collocated daemons Same file formats Same storage managers (NN, metastore) Tuesday, July 2, 13
  6. 6. Impala: Goals 6 High performance C++ runtime code generation (LLVM) direct access to data (no MapReduce) Retain user experience easy for Hive users to migrate 100% open-source Tuesday, July 2, 13
  7. 7. Impala: Capability 7 HiveQL (subset of SQL92) select, project, join, union, subqueries, aggregation, insert, order by (with limit) DDL Directly queries data in HDFS & HBase Text files (compressed) Sequence files (snappy/gzip) Avro &Trevni GA features Tuesday, July 2, 13
  8. 8. Impala: Capability 8 Familiar and unified platform Uses Hive’s metastore Submit queries via ODBC | BeeswaxThrift API Query is distributed to nodes with relevant data Process-to-process data exchange Kerberos authentication No fault tolerance Tuesday, July 2, 13
  9. 9. Impala: Performance 9 Greater disk throughput ~100MB/sec/disk I/O-bound workloads faster by 3-4x Queries that require multiple map-reduce phases in Hive are significantly faster in Impala (up to 45x) Queries that run against in-memory cached data see a significant speedup (up to 90x) Tuesday, July 2, 13
  10. 10. Impala:Architecture 10 impalad runs on every node handles client requests (ODBC, thrift) handles query planning & execution statestored provides name service metadata distribution used for finding data Tuesday, July 2, 13
  11. 11. Impala:Architecture 11 Tuesday, July 2, 13
  12. 12. Impala:Architecture 12 Tuesday, July 2, 13
  13. 13. Impala:Architecture 13 Tuesday, July 2, 13
  14. 14. Impala:Architecture 14 Tuesday, July 2, 13
  15. 15. Current limitations 15 1.0.1 (available since May 2013) No SerDes No User Defined Functions (UDF’s) impalad’s only read statestored metadata at startup Tuesday, July 2, 13
  16. 16. Futures 16 DDL support (CREATE) Rudimentary cost-based optimizer (CBO) metadata distribution through statestored Doug Cutting’sTrevni Columnar storage format like Dremel’s Impala +Trevni = Dremel superset Tuesday, July 2, 13
  17. 17. Demo 17 impala-user@cloudera.com alexander@cloudera.com @mapredit mapredit.blogspot.com Web: http://goo.gl/7sxdp Tuesday, July 2, 13
  18. 18. Tuesday, July 2, 13
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×