Presentations from the Cloudera Impala meetup on Aug 20 2013

1
Parquet
Update/UDFs
in
Impala

Nong
Li

So:ware
Engineer,
Cloudera

Agenda

2
•  Parquet

•  File
format
descripBon

•  Benchmark
Results
in
Impala

•  Parquet
2.0

•  UDF/UDAs

Data
Pages

8
•  Values
are
stored
in
data
pages
as
a
triple:

DeﬁniBon
Level,
RepeBBon
Level
and
Value.

•  These
are
stored
conBguous
on
disk
=>
1
seek
to
read
a

column
regardless
of
nesBng.

•  Data
pages
are
stored
with
diﬀerent

encodings:

•  Bit
packing
and
Run
Length
Encoding
(RLE)

•  DicBonary
for
strings

•  Extended
to
all
types
in
Parquet
1.1

•  Plain
(liWle
endian
encoding)
for
naBve
types.

Parquet
2.0

9
•  AddiBonal
Encodings

•  Group
VarInt
(for
small
ints)

•  Improved
string
storage
format

•  Delta
Encoding
(for
strings
and
ints)

•  AddiBonal
Metadata

•  Sorted
ﬁles

•  Page/Column/File
StaBsBcs

•  Expected
to
further
reduce
on
disk
size
and

allow
for
skipping
values
on
the
read
path.

Hardware
Setup

10
•  10
Nodes

•  16
Core
Xeon

•  48
GB
Ram

•  12
Disks

•  CDH4.3

•  Impala
1.1

TPC-‐H
lineitem
table
@
1TB
scale
factor

11
0

100

200

300

400

500

600

700

800

Text
Text
w/
Lzo
Seq
w/
Snappy
Avro
w/
Snappy
RcFile
w/
Snappy
Parquet
w/
Snappy
Seq
w/
Gzip

Size
(GB)

Query
Times
on
TPC-‐H
lineitem
table

12
0

100

200

300

400

500

600

700

800

1
Column
3
Columns
5
Columns
16
(all)
Columns
5
Columns,
3

Clients

Tpch
Q1
(7

Columns)

Bytes
Read
Q1

(GB)

Text

Seq
w/
Snappy

Avro
w/
Snappy

RcFile
w/
Snappy

Parquet
w/
Snappy

Query
Times
on
TPCDS
Queries

13
0

50

100

150

200

250

300

350

400

450

500

Q27
Q34
Q42
Q43
Q46
Q52
Q55
Q59
Q65
Q73
Q79
Q96

Seconds

Text

Seq
w/
Snappy

RC
w/Snappy

Parquet
w/Snappy

Average
Times
(Geometric
Mean)

•  Text:
224
seconds

•  Seq
Snappy:
257
seconds

•  RC
Snappy:
150
seconds

•  Parquet:
61
seconds

Agenda

14
•  Parquet

•  File
format
descripBon

•  Benchmark
Results
in
Impala

•  What’s
Next

•  UDF/UDAs
(Work
in
Progress)

Terminology

15
•  UDF:
Tuple
-‐>
Scalar

user-‐defined
funcBon

•  E.g.
Substring

•  UDA/UDAF:
{Tuple}
-‐>
Scalar

user-‐defined
aggregate
funcBon

•  E.g.
Min

•  UDTF:
{Tuple}
-‐>
{Tuple}

user-‐defined
table
funcBon

Impala
1.2

16
•  Support
Hive
UDFs
(java)

•  ExisBng
hive
jars
will
run
without
a
recompile.

•  Add
Impala
(naBve)
UDFs
and
UDAs.

•  New
interface
designed
to
execute
as
eﬃciently
as

possible
for
Impala.

•  Similar
interface
as
Postgres
UDFs/UDAs

•  UDF/UDA
registered
for
impala
service
in

metadata
catalog

•  i.e.
CREATE
FUNCTION/CREATE
AGGREGATE

Example
UDF

17
//
This
UDF
adds
two
ints
and
returns
an
int.

IntVal
AddUdf(UdfContext*
context,

const
IntVal&
arg1,

const
IntVal&
arg2)
{

if
(arg1.is_null
||
arg2.is_null)
return
IntVal::null();

return
IntVal(arg1.val
+
arg2.val);

}

DDL

18
CREATE
statement
will
need
to
specify
the

UDF/UDA
signature,
the
locaBon
of
the

binary
and
the
symbol
for
the
UDF
funBon.

CREATE
FUNCTION
substring(string,
int,
int)

RETURNS
string
LOCATION
“hdfs://path”

“com.me.Substring”

CREATE
FUNCTION
log(anytype)
RETURNS
anytype

LOCATION
“hdfs:://path2”
“Log”

UDFs

19
•  Support
for
variadic
args

•  Support
for
polymorphic
types

UDAs

20
•  UDA
must
implement
typical
state

machine:

•  Init()

•  Update()

•  Serialize()

•  Merge()

•  Finalize()

•  Data
movement
handled
by
Impala

UDA
Example

21
//
This
is
a
sample
of
implementing
the
COUNT
aggregate
function.

void
Init(UdfContext*
context,
BigIntVal*
val)
{

val-‐>is_null
=
false;

val-‐>val
=
0;

}

void
Update(UdfContext*
context,
const
AnyVal&
input,
BigIntVal*
val)
{

if
(input.is_null)
return;

++val-‐>val;

}

void
Merge(UdfContext*
context,
const
BigIntVal&
src,
BigIntVal*
dst)
{

dst-‐>val
+=
src.val;

}

BigIntVal
Finalize(UdfContext*
context,
const
BigIntVal&
val)
{

return
val;

}

RunBme
Code-‐GeneraBon

22
•  Impala
uses
LLVM
to,
at
runBme,

generate
code
to
run
the
query.

•  Takes
into
account
constants
that
that
are
only

known
a:er
query
analysis.

•  Greatly
improves
CPU
efficiency

•  NaBve
UDFs/UDAs
can
benefit
from
this
as

well.

•  Instead
of
providing
the
UDF/UDA
as
a
shared
object,

compile
it
(with
CLANG)
with
an
addiBonal
flag
and

Impala
to
LLVM
IR

•  IR
will
be
integrated
with
the
query
execuBon.

•  No
funcBon
call
overhead
for
UDF/UDAs

LimitaBons

23
•  Hive
UDAs/UDTFs
not
supported

•  No
UDTFs
in
naBve
interface

•  Can’t
run
out
of
process

•  NaBve
interface
is
designed
to
support
this,

will
be
able
to
run
without
a
recompile

•  We’re
planning
to
address
this
in
Impala

1.3

Thanks!

24
•  We’d
love
your
feedback
for
UDFs/UDAs

•  QuesBons?

Performance
Considerations
for Cloudera
Impala
Henry Robinson
henry@cloudera.com / @henryr
Impala Meetup 2013-08-20

Agenda
● The basics: Performance Checklist
● Review: How does Impala execute queries?
● What makes queries fast (or slow)?
● How can I debug my queries?

Impala Performance Checklist
● Verify – Simple count * query on a relatively big table
and verify:
○ Data locality, block locality, and NO check-summing (“Testing Impala
Performance”)
○ Optimal IO throughput of HDFS scans (typically ~100 MB/s per disk)
● Stats – BOTH table and column stats, especially for:
○ Joining two large tables
○ Insert into as select through Impala
● Join table ordering – will be automatic in the Impala 2.0
wave. Until then:
○ Largest table first
○ Then most selective to least selective
● Monitor - monitor Impala queries to pinpoint slow
queries and drill into potential issues
○ CM 4.6 adds query monitoring
○ CM 5.0 will have the next big enhancements

Part 1: How does Impala
execute queries?

The basic idea
● Every Impala query runs across a cluster of
multiple nodes, with lots of available CPU
cores, memory and disk
● Best query speeds usually come when every
node in the cluster has something to do
● Impala solves two basic problems:
○ Figure out what every node should do (compilation)
○ Make them do it really quickly! (execution)

Query compilation
● a.k.a. ‘figuring out what every node should do’
● Impala compiles a SQL query into a plan describing
what to execute, and where
● A plan is shaped like a tree. Data flows up from the
leaves of the tree to the root.
● Each node in the tree is a query operator
● Impala chops this tree up into plan fragments
● Each node gets one or more plan fragments

Query execution
● Once started, each query operator can run
independently of any other operator
● Every operator can be doing something
at the same time
● This is the not-so-secret sauce for all
massively parallel query execution engines

Part 2: What makes
queries fast (or... slow)?

What determines performance?
● Data size
● Per-operator execution efficiency
● Available parallelism
● Available concurrency
● Hardware
● Schema design and file format

Data size
● More data means more work
● Not just the size of the disk-based data at plan leaves,
but size of internal data flowing in to any operator
● How can you help?
○ Partition your data
○ SELECT with LIMIT in subqueries
○ Push predicates down
○ Use correct JOIN order
■ Gather table statistics
○ Use the right file format

● Tables are joined in the order listed in the
FROM clause
● Impala uses left-deep trees for nested joins
● “Largest” table should be listed first
○ largest = returning most rows before join filtering
○ In a star schema, this is often the fact table
● Then list tables in order of most selective
join filter to least selective
○ Filter the most rows as early as possible
Table Ordering

Join Types
● Two types of join strategy are supported
○ Broadcast
○ Shuffle/Partitioned
● Broadcast
○ Each node receives a full copy of the right table
○ Per node memory usage = size of right table
● Shuffle
○ Both sides of the join are partitioned
○ Matching partitions sent to same node
○ Per node memory usage = 1/nodes x size of right table
● Without column statistics, all joins are broadcast

Per-operator execution efficiency
● Impala is fast, and getting faster
● LLVM-based improvements
● More efficient disk scanners
● More modern algorithms from the DB
literature
● How can you help?
○ Upgrade to the latest version

Available parallelism
● Parallelism: number of resources available to use at
once
● More hardware means more parallelism
● Impala will take advantage of more cores, disks and
memory where possible
● Easiest (but most expensive!) way to improve
performance of large class of queries
● You can scale up incrementally

Available concurrency
● Concurrency: how well can a query take advantage of
available parallelism?
● Impala will take care of this mostly for you
● But some operators naturally don’t parallelise well in
certain conditions
● For example: joining two huge tables together.
○ The hash-node operators have to wait for one side to be read
completely before reading much of the other side
● How you can help:
○ Read the profiles, look for obvious bottlenecks, rephrase if possible

Hardware
● Designed for modern hardware
○ Leverages SSE 4.2 (Intel Nehalem or newer)
○ LLVM Compiler Infrastructure
○ Runtime Code Generation
○ In-memory execution pipelines
● Today’s hardware
○ 2 x Xeon E5 6 core CPUs
○ 12 x 3 TB HDD
○ 128 GB RAM
● How you can help:
○ Use the supported platforms, with Cloudera’s
packages

Schema design
● PARTITION BY is an easy win
● In general, string is slower than fixed-width
types (particularly for aggregations etc)
● File formats are crucial
○ Experiment with Parquet for performance
○ Avoid text

Supported File Formats
● Various HDFS file formats
○ Text File (read/write)
○ Avro (read)
○ SequenceFile (read)
○ RCFile (read)
○ ParquetFile (read/write)
● Various compression codecs
○ Snappy (ParquetFile, RCFile, SequenceFile, Avro)
○ LZO (Text)
○ Bzip (ParquetFile, RCFile, SequenceFile, Avro)
○ Gzip (ParquetFile, RCFile, SequenceFile, Avro)
● HBase also supported

Partitioning Considerations
● Single largest performance feature
○ Skips unnecessary data
○ Requires queries contain partition keys as filters
● Choose a reasonable number of partitions
○ Lots of small files becomes an issue
○ Metadata overhead on NameNode
○ Metadata overhead for Hive Metastore
○ Impala caches this, but first load may take long

The Debug Pages
● Every impalad exports a lot of useful
information on http://<impalad>:25000 (by
default), including:
○ Last 25 queries
○ Active sessions
○ Known tables
○ Last 1MB of the log
○ System metrics
○ Query profiles
● Information-dense - not for the faint of heart!

Thanks! Questions?
Try It Out!
● Apache-licensed open source
○ Impala 1.1 released 7/24/2013
○ Impala 1.0 GA released 4/30/2013
● Questions/comments?
○ Download: cloudera.com/impala
○ Email: impala-user@cloudera.org
○ Join: groups.cloudera.org
○ MeetUp: meetup.com/Bay-Area-Impala-Users-
Group/

Presentations from the Cloudera Impala meetup on Aug 20 2013

More Related Content

What's hot

Viewers also liked

Similar to Presentations from the Cloudera Impala meetup on Aug 20 2013

More from Cloudera, Inc.

Recently uploaded

Presentations from the Cloudera Impala meetup on Aug 20 2013