Impala: Real-time Queries in Hadoop

Cloudera
Impala

Jus/n
Erickson
|
Product
Manager

November
2012

Why
Data
Scien/sts
Love
Hadoop

•  Massive
volumes
of
data

•  Data
prepara/on
&
analy/cs
in
1
environment

•  Highly
ﬂexible
environment
for
crea/ng
&
tes/ng
machine
learning
models

•  10%
the
cost/TB
under
management

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Hadoop
Use
Cases
Moving
to
Real-‐Time

Already
query
Already
load
data
into
Already
use
HBase
for

Hadoop
using
Hive
CDH
every
90
mins
or
less

real-‐/me
data
access

Source:
Cloudera
customer
survey
August
2012

©2012
Cloudera,
Inc.
All
Rights
Reserved.

But
Hadoop
Isn’t
Fast
Enough

Need
faster
Move
data
from

See
value
today
in

queries
on
Hadoop
to
RDBMS
for
consolida/ng
to
a

Hadoop
data
interac/ve
SQL
single
plaYorm

Source:
Cloudera
customer
survey
August
2012

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Beyond
Batch
–
The
Next
Stage
for
Hadoop

HADOOP
TODAY
IS
TOO
SLOW

MapReduce
is
batch

Simple
queries
can
take
minutes
/
tens
of
minutes

CURRENT
DATA
MANAGEMENT
IS
TOO
COMPLEX

Op/mized
for
rigid
schemas
&

special
purpose
applica/ons

Redundant
data
storage
&
processes

Very
expensive
systems:
$20K-‐150K
/
TB

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Cloudera
Enterprise
RTQ

Real-‐Time
Query
for
Data
Stored
in
Hadoop

Powered
by
Cloudera
Impala.

Supports
Hive
SQL

4-‐30X
faster
than
Hive
over
MapReduce

Supports
mul/ple
storage
engines
&

ﬁle
formats

Uses
exis/ng
drivers,
integrates
with
exis/ng

metastore,
works
with
leading
BI
tools

Flexible,
cost-‐eﬀec/ve,
no
lock-‐in

Deploy
&
operate
with
Cloudera
Manager

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Cloudera
Now
Powered
by
Impala

BEFORE
IMPALA
WITH
IMPALA

USER
INTERFACE

BATCH
PROCESSING
REAL-‐TIME
ACCESS

•  Unified
Storage:
•  With
Impala:

Supports
HDFS
and
HBase
Real-‐/me
SQL
queries

Flexible
file
formats
Na/ve
distributed
query
engine

•  Unified
Metastore
Op/mized
for
low-‐latency

•  Unified
Security
•  Provides:

•  Unified
Client
Interfaces:
Answers
as
fast
as
you
can
ask

ODBC,
SQL
syntax,
Hue
Beeswax
Everyone
to
ask
ques/ons
for
all
data

Big
data
storage
and
analy/cs
together

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Impala
beta
features

Today
(Cloudera
Impala
0.1):

•  Nearly
all
of
Hive's
SQL,
including
insert,
join,
and
subqueries

•  Query
results
4-‐30X
faster
than
Hive

•  Same
open
Hive
metadata
model
=>
easy
to
create
&
change
schema

•  Support
for
HDFS
and
HBase
storage

•  HDFS
ﬁle
formats:
TextFile,
SequenceFile

•  HDFS
compression:
Snappy,
GZIP,
BZIP

•  Common
ODBC
driver
and
Hue
Beeswax
with
Hive

•  Separate
CLI
than
Hive

Next
few
months:

•  Support
for
Avro,
RCFile
&
LZO
compressed
text

•  Addi/onal
OS
support

•  Trevni
columnar
format

•  JDBC
driver

•  DDL

•  Straggler
handling

•  Increased
join
perf

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Impala
v0.1
SQL
(HiveQL)

•  Select

–  Boolean,
/nyint,
smallint,
int,
bigint,
ﬂoat,
double,
/mestamp,
string

–  All,
dis/nct

–  Subqueries
(in
from
clause)

–  Where,
group
by,
having

–  Order
by
(with
limit
ini/ally)

–  Joins
(ler,
right,
full,
outer),
mul/-‐table,
subquery

–  Union
all

–  Limit

–  External
tables

–  Rela/onal,
arithme/c,
logical
operators

–  Math,
collec/on,
cast,
date,
condi/onal,
string,
/mestamp
built-‐ins
(e.g.
count,
sum,
cast,
case,
like,

in,
between,
coalesce)

•  Insert
into

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Cloudera
Impala
Details

Common
Hive
SQL
and
interface
Uniﬁed
metadata
and
scheduler

SQL
App
Hive
State

Metastore
YARN
HDFS
NN
Store

ODBC

Query
Planner
Query
Planner
Fully
MPP
Query
Planner

Query
Coordinator
Query
Coordinator
Distributed
Query
Coordinator

Query
Exec
Engine
Query
Exec
Engine
Query
Exec
Engine

HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase

Local
Direct
Reads

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Cloudera
Impala
Details

Common
Hive
SQL
and
interface

SQL
App
Hive
State

Metastore
YARN
HDFS
NN
Store

ODBC

SQL
Request

Query
Planner
Query
Planner
Query
Planner

Query
Coordinator
Query
Coordinator
Query
Coordinator

Query
Exec
Engine
Query
Exec
Engine
Query
Exec
Engine

HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Cloudera
Impala
Details

Uniﬁed
metadata
and
scheduler

SQL
App
Hive
State

Metastore
YARN
HDFS
NN
Store

ODBC

Query
Planner
Query
Planner
Query
Planner

Query
Coordinator
Query
Coordinator
Query
Coordinator

Query
Exec
Engine
Query
Exec
Engine
Query
Exec
Engine

HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Cloudera
Impala
Details

SQL
App
Hive
State

Metastore
YARN
HDFS
NN
Store

ODBC

Query
Planner
Query
Planner
Fully
MPP
Query
Planner

Query
Coordinator
Query
Coordinator
Distributed
Query
Coordinator

Query
Exec
Engine
Query
Exec
Engine
Query
Exec
Engine

HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Cloudera
Impala
Details

SQL
App
Hive
State

Metastore
YARN
HDFS
NN
Store

ODBC

Query
Planner
Query
Planner
Query
Planner

Query
Coordinator
Query
Coordinator
Query
Coordinator

Query
Exec
Engine
Query
Exec
Engine
Query
Exec
Engine

HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase

Local
Direct
Reads

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Cloudera
Impala
Details

SQL
App
Hive
State

Metastore
YARN
HDFS
NN
Store

ODBC

SQL
Results

Query
Planner
Query
Planner
In
Memory
Query
Planner

Query
Coordinator
Query
Coordinator
Transfers
Query
Coordinator

Query
Exec
Engine
Query
Exec
Engine
Query
Exec
Engine

HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Impala
and
Hive

•  Shared
with
Hive:

–  Metadata
(table
deﬁni/ons)

–  ODBC
driver

–  Hue
Beeswax

–  SQL
syntax
(HiveQL)

–  Flexible
ﬁle
formats

–  Machine
pool

•  Improvements:

–  Purpose-‐built
query
engine
direct
on
HDFS
and
HBase

–  No
JVM
and
MapReduce

–  In-‐memory
data
transfers

–  Low-‐latency
scheduler

–  Na/ve
distributed
rela/onal
query
engine

–  Trevni
columnar
format
(arer
v0.1)

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Advantages
of
Our
Approach

•  No
high-‐latency
MapReduce
batch
processing

•  Local
processing
avoids
network
botlenecks

•  No
costly
data
format
conversion
overhead

•  All
data
immediately
query-‐able

•  Single
machine
pool
to
scale

•  All
machines
available
to
both
Impala
and
MapReduce

•  Single,
open,
and
uniﬁed
metadata
and
scheduler

MapReduce
Remote
Query
Side
Storage

Query
Query
Query
Query

Node
Node
Node
Node
Query
MR

Hive
Engine

MR
OR
MR
DN

NN

DN
HDFS

DN
DN
DN

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Google
Dremel
and
Impala

•  What
is
Dremel:

–  Columnar
storage
for
data
with
nested
structures

–  Distributed
scalable
aggrega/on
on
top
of
that

•  Columnar
storage
in
Hadoop:
Trevni

–  New
columnar
format
created
by
Doug
Cuung

–  Stores
data
in
appropriate
na/ve/binary
types

–  Will
also
store
nested
structures
similar
to
Dremel's
ColumnIO

•  Distributed
aggrega/on:
Impala

•  Impala
plus
Trevni:
a
superset
of
the
published
version
of
Dremel
(which
didn't

support
joins)

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Beneﬁts
of
Cloudera
Impala

Real-‐Time
Query
for
Data
Stored
in
Hadoop

• Get
answers
as
fast
as
you
can
ask
ques/ons

• Interac/ve
analy/cs
directly
on
source
data

• No
jumping
between
data
silos

• Reduce
duplicate
storage
with
EDW

• Reduce
data
movement
for
interac/ve
analysis

• Leverage
exis/ng
tools
and
employee
skills

• Ask
ques/ons
of
all
your
data

• No
informa/on
loss
from
aggrega/on
or

conforming
to

rela/onal
schemas
for
analysis

• Single
metadata
store
from
origina/on

through
analysis

• No
need
to
hunt
through
mul/ple
data
silos

©2012
Cloudera,
Inc.
All
Rights
Reserved.

Impala: Real-time Queries in Hadoop

Impala: Real-time Queries in Hadoop

More Related Content

What's hot

Viewers also liked

Similar to Impala: Real-time Queries in Hadoop

More from Cloudera, Inc.

Recently uploaded

Impala: Real-time Queries in Hadoop