Graham Mossman - SQL and high performance computing on Hadoop

SQL
and
high
performance
compu3ng
on
Hadoop
Graham
Mossman,
Senior
Solu;on
Engineer,
EXASOL
©
2014
EXASOL
AG

©
2014
EXASOL
AG
I
Love
My
Lawnmower
...
...
because
it
cuts
my
grass
well

©
2014
EXASOL
AG
But
...
...
it‘s
quite
a
struggle
cuBng
my
hedge

©
2014
EXASOL
AG
And
...
...
it
isn‘t
good
at
making
apple
sauce

©
2014
EXASOL
AG
And
don‘t
even
thinking
about...
...
using
it
to
cut
hair

©
2014
EXASOL
AG
Hadoop
today
is
…
§ S;ll
Open
Source
!
§ Began
with
HDFS
and
Map/Reduce
§ Now
comprises
a
number
of
addi;onal
technologies
§ File
systems
§ (e.g.
Tachyon)
§ Cluster
Managers
§ (e.g.
YARN
+
Mesos)
§ Execu;on
Engines
§ (e.g.
Tez,
Spark
etc.)
§ Analy;cal
Layer
and
Applica;ons
§
(e.g.
Hive,
Pig,
various
SQL
on
Hadoop)

©
2014
EXASOL
AG
Hadoop
With
Everything
§ Hadoop
was
invented
to
more
easily
distribute
the
Nutch
and
Lucene
applica;ons
across
a
cluster
of
machines.
§ Map/Reduce
–
distributed
processing
§ HDFS
–
distributed
file
system
§ Began
to
be
used
for
….
just
about
everything.
§ But
not
all
processing
tasks
are
like
indexing
the
Internet
§ Hadoop
started
to
acract
cri;cism
§ But
usually
when
it
was
being
used
for
something
it
wasn’t
designed
for

©
2014
EXASOL
AG
Definitely
NOT
jobs
for
Hadoop
§ Word
processing
§ Payroll
system
§ Anything
on
a
single
computer
§ Anything
with
“small”
data

©
2014
EXASOL
AG
Analy3cal
Queries
§ “GROUP
BY“
logic
§ i.e.
not
concerned
with
individual
data
items
§ Analy;cal
Func;ons
§ MAX,
MEDIAN,
MIN,
SUM,
COUNT,
STANDARD
DEVIATION
…
§ Table
joins,
nested
sub-‐queries
Usually
short-‐running,
ad-‐hoc
and
submiced
many
at
a
;me.

Map/Reduce
and
HDFS
:
the
wrong
tools
for
Analy3cs
?
§ Queries
tend
to
be
short
:
fault
tolerance
is
less
important
©
2014
EXASOL
AG
§ If
chance
of
failure
in
a
5
hour
batch
is
1
in
300
§ Chance
of
failure
in
a
5
second
query
is
1
in
1,000,000
§ Queries
tend
to
be
short
:
start-‐up
;me
is
significant
§ a
20
second
start-‐up
;me
is
NOT
OK
on
a
5
second
query
§ A
number
of
projects
started
to
address
these
issues
§ e.g.
“Hot
containers”
in
Hive
on
Tez
to
reduce
start-‐up
;me

Map/Reduce:
the
wrong
language
for
Analy3cs
?
Example taken from Reynold Xin’s 2012 “Shark: Hive (SQL) on Spark” presentation
©
2014
EXASOL
AG
Stage 0: Map-Shuffle-Reduce
Mapper(row) {
fields = row.split("t")
emit(fields[0],
fields[1]);
}
Reducer(key, values) {
sum = 0;
for (value in values) {
sum += value;
}
emit(key, sum);
}
Stage 1: Map-Shuffle
Mapper(row) {
...
emit(page_views,
page_name);
}
... shuffle
Stage 2: Local
data = open("stage1.out")
for (i in 0 to 10) {
print(data.getNext())
}

©
2014
EXASOL
AG
Equivalent
in
SQL
SELECT
page_name,
SUM(page_views) views
FROM wikistats
GROUP BY page_name
ORDER BY views DESC
LIMIT 10;

©
2014
EXASOL
AG
The
SQL
language
§ Portable
§ Well-‐defined
standards
exist
§ No
detailed
knowledge
of
the
plaporm
required
§ e.g.
you
don’t
need
to
manage
memory
§ SQL
is
assumed
by
a
lot
of
repor;ng
tools
§ Widely
used
and
understood
even
by
non-‐technical
people

©
2014
EXASOL
AG
I‘m
not
saying
that
SQL
is
perfect
• Try writing the simple Hadoop “Word
Count” example in pure SQL
• Or try to “sessionise” weblog data
• Or anything with data that is not
structured
• “Which part of STRUCTURED Query Language
don’t you understand …?!”
• All I’m saying is that is an excellent
language for analytical queries.

Hadoop
could
handle
SQL
(via
Hive),
but
historically
…
©
2014
EXASOL
AG
§ High
Latency
§ Restricted
SQL
op;ons
§ All
but
simple
table
joins
were
difficult
§ Licle
support
for
compression
&
indexing
§ Merv
Adrian
(Gartner
Research
-‐
2014)
§ “What
is
remarkable
is
that
Hadoop
does
SQL.
Just
don’t
expect
it
to
do
it
well”
§ Result
:
EVERYTHING
looked
good
compared
to
Hive

©
2014
EXASOL
AG
Everyone
s3ll
likes
to
compare
themselves
to
Hive

©
2014
EXASOL
AG
EXASOL
being
no
excep3on
!

©
2014
EXASOL
AG
Hive
con3nues
to
be
improved
…
§ Completed
§ Views
(HIVE-‐1143)
§ Par;;oned
Views
(HIVE-‐1941)
§ Storage
Handlers
(HIVE-‐705)
§ HBase
Integra;on
§ HBase
Bulk
Load
§ Locking
(HIVE-‐1293)
§ Indexes
(HIVE-‐417)
§ Bitmap
Indexes
(HIVE-‐1803)
§ Filter
Pushdown
(HIVE-‐279)
§ Table-‐level
Sta;s;cs
(HIVE-‐1361)
§ Dynamic
Par;;ons
§ Binary
Data
Type
(HIVE-‐2380)
§ Decimal
Precision
and
Scale
Support
§ HCatalog
§ HiveServer2
(HIVE-‐2935)
§ Column
Sta;s;cs
in
Hive
(HIVE-‐1362)
§ List
Bucke;ng
(HIVE-‐3026)
§ Group
By
With
Rollup
(HIVE-‐2397)
§ Enhanced
Aggrega;on,
Cube,
Grouping
and
Rollup
(HIVE-‐3433)
§ Op;mizing
Skewed
Joins
(HIVE-‐3086)
§ Correla;on
Op;mizer
(HIVE-‐2206)
§ Hive
on
Tez
(HIVE-‐4660)
§ Vectorized
Query
Execu;on
(HIVE-‐4160)
§ In
Progress
§ Atomic
Insert/Update/Delete
(HIVE-‐5317)
§ Transac;on
Manager
(HIVE-‐5843)
§ Cost
Based
Op;mizer
in
Hive
(HIVE-‐5775)
§ Proposed
§ Spa;al
Queries
§ Theta
Join
(HIVE-‐556)
§ JDBC
Storage
Handler
§ MapJoin
Op;miza;on
§ Proposal
to
standardize
and
expand
Authoriza;on
in
Hive
§ Dependent
Tables
(HIVE-‐3466)
§ AccessServer
§ Type
Qualifiers
in
Hive
§ MapJoin
&
Par;;on
Pruning
(HIVE-‐5119)
§ SQL
Standard
based
secure
authoriza;on
(HIVE-‐5837)
§ Updatable
Views
(HIVE-‐1143)
§ Hive
on
Spark
(HIVE-‐7292)

The
dream
data
architecture
for
analy3cs
…
§ Based
on
the
SQL
language
§ but
leverages
Hadoop’s
extreme
scalability
§ and
Hadoop’s
fault
tolerance
§ while
not
compromising
on
speed.
©
2014
EXASOL
AG
Could
it
please
also
have
some
maturity
?
And
be
easy
to
use
?

©
2014
EXASOL
AG
The
current
reality
§ SQL
on
SQL,
which
is
arguably
§ Less
scalable
§ Less
fault
tolerant
§ Less
good
with
unstructured
data
§ SQL
on
Hadoop,
which
is
arguably
§ Less
mature
§ Less
easy
to
use
§ Slower

©
2014
EXASOL
AG
Choices
for
SQL
and
Hadoop
§ SQL
AND
HADOOP
§ A
Connector
§ HADOOP
ON
SQL
§ User
Defined
Func;ons
§ SQL
ON
HADOOP
§ Something
like
Hive,
but
becer

Op3on
1
–
SQL
AND
HADOOP
Run
SQL-‐on-‐SQL
and
Hadoop-‐on-‐Hadoop
and
use
a
connector
to
join
the
two
systems
Pros
§ Minimal
impact
(SQL
and
Hadoop
worlds
can
func;on
as
before)
§ Easier
to
implement
Cons
§ Network
!
§ Challenge
of
op;mising
across
two
technologies
©
2014
EXASOL
AG

©
2014
EXASOL
AG
Op3on
2
–
HADOOP
ON
SQL
§ Bring
Map/Reduce
into
the
Parallel
database
§ For
example
using
Java
User
Defined
Func;ons
select
my_java_map_func1on(words)
a_word,
count(*)
word_count
from
DOCUMENTS
group
by
1
§ Doesn’t
benefit
from
Hadoop’s
storage
advantages

©
2014
EXASOL
AG
Op3on
3
-‐
SQL
ON
HADOOP
Build
a
rela;onal
database
on
Hadoop
storage
§ Impala
(Cloudera)
§ S;nger
(Hortonworks)
§ Presto
(Facebook)
§ SparkSQL
(UC
Berkeley)
§ HAWQ
(Pivotal)
§ BigSQL
(IBM)
§ Apache
Phoenix
(for
HBase)
§ Apache
Tajo
§ Apache
Drill
§ etc
etc
etc
….
AND
DON‘T
FORGET
HIVE
!

Four
possible
market
outcomes…
§ Hadoop
and
SQL
databases
are
on
a
collision
course
–
only
one
will
survive
©
2014
EXASOL
AG
§ No
sign
of
that
so
far
§ They
are
complementary
–
both
will
survive
§ Probably
-‐
the
challenge
is
how
to
make
them
work
together
§ They
will
merge
and
become
one
§ Some
indica;ons
this
is
already
star;ng
to
happen
§ Something
even
more
amazing
will
come
along
and
replace
them
both
§ Some;mes
this
happens
–
Spark
?

©
2014
EXASOL
AG
My
Personal
Opinionated
Opinion
Becer
to
use
a
tool
that
has
been
made
for
the
job
A
purpose-‐built
tool
will
always
beat
one
made
originally
for
another
purpose.

Graham Mossman - SQL and high performance computing on Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Graham Mossman - SQL and high performance computing on Hadoop

Similar to Graham Mossman - SQL and high performance computing on Hadoop (20)

More from huguk

More from huguk (20)

Recently uploaded

Recently uploaded (20)

Graham Mossman - SQL and high performance computing on Hadoop