Spark SQL Deep Dive @ Melbourne Spark Meetup

Spark SQL Deep Dive
Michael Armbrust
Melbourne Spark Meetup – June 1st 2015

What is Apache Spark?
Fast and general cluster computing system,
interoperable with Hadoop, included in all major
distros
Improves efficiency through:
>  In-memory computing primitives
>  General computation graphs
Improves usability through:
>  Rich APIs in Scala, Java, Python
>  Interactive shell
Up to 100× faster
(2-10× on disk)
2-5× less code

Spark Model
Write programs in terms of transformations on
distributed datasets
Resilient Distributed Datasets (RDDs)
>  Collections of objects that can be stored in memory
or disk across a cluster
>  Parallel functional transformations (map, filter, …)
>  Automatically rebuilt on failure

More than Map & Reduce
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...

5

On-Disk Sort Record:
Time to sort 100TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1PB in 4 hours

6

Spark “Hall of Fame”
LARGEST SINGLE-DAY INTAKE LONGEST-RUNNING JOB
LARGEST SHUFFLE MOST INTERESTING APP
Tencent
(1PB+ /day)
Alibaba
(1 week on 1PB+ data)
Databricks PB Sort
(1PB)
Jeremy Freeman
Mapping the Brain at Scale
(with lasers!)
LARGEST CLUSTER
Tencent
(8000+ nodes)
Based on Reynold Xin’s personal knowledge

Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
val
lines
=
spark.textFile(“hdfs://...”)

val
errors
=
lines.filter(_
startswith
“ERROR”)

val
messages
=
errors.map(_.split(“t”)(2))

messages.cache()
lines
Block 1
lines
Block 2
lines
Block 3
Worker
Worker
Worker
Driver
messages.filter(_
contains
“foo”).count()

messages.filter(_
contains
“bar”).count()

. . .
tasks
results
messages
Cache 1
messages
Cache 2
messages
Cache 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in
<1 sec (vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec#
(vs 170 sec for on-disk data)

A General Stack
Spark
Spark
Streaming#
real-time
Spark
SQL
GraphX
graph
MLlib
machine
learning
…
Spark
SQL

0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala (SQL)
Giraph
(Graph)
Spark
non-test, non-example source lines
Powerful Stack – Agile Development

0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala (SQL)
Giraph
(Graph)
Spark
Streaming

0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala (SQL)
Giraph
(Graph)
Spark
SparkSQL
Streaming

0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala (SQL)
Giraph
(Graph)
Spark
GraphX
Streaming
SparkSQL

0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala (SQL)
Giraph
(Graph)
Spark
GraphX
Streaming
SparkSQL
Your App?

About SQL
Spark SQL
>  Part of the core distribution since Spark 1.0
(April 2014)
0
50
100
150
200
250
# Of Commits Per Month
0
50
100
150
200
2014-03
2014-04
2014-05
2014-06
2014-07
2014-08
2014-09
2014-10
2014-11
2014-12
2015-01
2015-02
2015-03
2015-04
2015-05
2015-06
# of Contributors

SELECT
COUNT(*)

FROM
hiveTable

WHERE
hive_udf(data)

Spark SQL
(April 2014)
>  Runs SQL / HiveQL queries including UDFs
UDAFs and SerDes
About SQL

Spark SQL
(April 2014)
UDAFs and SerDes
>  Connect existing BI tools to Spark through
JDBC
About SQL

Spark SQL
(April 2014)
UDAFs and SerDes
>  Connect existing BI tools to Spark through
JDBC
>  Bindings in Python, Scala, and Java
About SQL

The not-so-secret truth…
is not about SQL.
SQL

SQL: The whole story
Create and Run Spark Programs Faster:
>  Write less code
>  Read less data
>  Let the optimizer do the hard work

DataFrame
noun – [dey-tuh-freym]
1.  A distributed collection of rows
organized into named columns.
2.  An abstraction for selecting, filtering,
aggregating and plotting structured
data (cf. R, Pandas).
3.  Archaic: Previously SchemaRDD
(cf. Spark < 1.3).

Write Less Code: Input & Output
Spark SQL’s Data Source API can read and write
DataFrames using a variety of formats.
21

{ JSON }
Built-In External
JDBC
and more…

Unified interface to reading/writing data in a
variety of formats:
df
=
sqlContext.read

.format("json")

.option("samplingRatio",
"0.1")

.load("/home/michael/data.json")

df.write

.format("parquet")

.mode("append")

.partitionBy("year")

.saveAsTable("fasterData")

variety of formats:
df
=
sqlContext.read

.format("json")

"0.1")


df.write

.format("parquet")

.mode("append")



read and write

functions create
new builders for
doing I/O

variety of formats:
df
=
sqlContext.read

.format("json")

"0.1")


df.write

.format("parquet")

.mode("append")



Builder methods
are used to specify:
•  Format
•  Partitioning
•  Handling of
existing data
•  and more

variety of formats:
df
=
sqlContext.read

.format("json")

"0.1")


df.write

.format("parquet")

.mode("append")



load(…), save(…) or
saveAsTable(…)

functions create
new builders for
doing I/O

ETL Using Custom Data Sources
sqlContext.read

.format("com.databricks.spark.git")

.option("url",
"https://github.com/apache/spark.git")

.option("numPartitions",
"100")

.option("branches",
"master,branch-‐1.3,branch-‐1.2")

.load()

.repartition(1)

.write

.format("json")

.save("/home/michael/spark.json")

Write Less Code: Powerful Operations
Common operations can be expressed concisely
as calls to the DataFrame API:
•  Selecting required columns
•  Joining different data sources
•  Aggregation (count, sum, average, etc)
•  Filtering
27

Write Less Code: Compute an Average
private
IntWritable
one
=

new
IntWritable(1)

private
IntWritable
output
=

new
IntWritable()

proctected
void
map(

LongWritable
key,

Text
value,

Context
context)
{

String[]
fields
=
value.split("t")

output.set(Integer.parseInt(fields[1]))

context.write(one,
output)

}

IntWritable
one
=
new
IntWritable(1)

DoubleWritable
average
=
new
DoubleWritable()

protected
void
reduce(

IntWritable
key,

Iterable<IntWritable>
values,

Context
context)
{

int
sum
=
0

int
count
=
0

for(IntWritable
value
:
values)
{

sum
+=
value.get()

count++

}

average.set(sum
/
(double)
count)

context.Write(key,
average)

}

data
=
sc.textFile(...).split("t")

data.map(lambda
x:
(x[0],
[x.[1],
1]))

.reduceByKey(lambda
x,
y:
[x[0]
+
y[0],
x[1]
+
y[1]])

.map(lambda
x:
[x[0],
x[1][0]
/
x[1][1]])

.collect()

Write Less Code: Compute an Average
Using RDDs

data
=
sc.textFile(...).split("t")

data.map(lambda
x:
(x[0],
[int(x[1]),
1]))

.reduceByKey(lambda
x,
y:
[x[0]
+
y[0],
x[1]
+
y[1]])

.map(lambda
x:
[x[0],
x[1][0]
/
x[1][1]])

.collect()

Using DataFrames

sqlCtx.table("people")

.groupBy("name")

.agg("name",
avg("age"))

.collect()

Using SQL

SELECT
name,
avg(age)

FROM
people

GROUP
BY
name

Not Just Less Code: Faster
Implementations
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)

31

Demo: Data Frames
Using Spark SQL to read, write, slice and dice
your data using a simple functions

Read Less Data
Spark SQL can help you read less data
automatically:
•  Converting to more eﬀicient formats
•  Using columnar formats (i.e. parquet)
•  Using partitioning (i.e., /year=2014/month=02/…)1
•  Skipping data using statistics (i.e., min, max)2
•  Pushing predicates into storage systems (i.e., JDBC)

Optimization happens as late as
possible, therefore Spark SQL can
optimize across functions.
33

34
def
add_demographics(events):

u
=
sqlCtx.table("users")

#
Load
Hive
table

events

.join(u,
events.user_id
==
u.user_id)

#
Join
on
user_id

.withColumn("city",
zipToCity(df.zip))

#
udf
adds
city
column

events
=
add_demographics(sqlCtx.load("/data/events",
"json"))

training_data
=
events.where(events.city
==
"Palo
Alto")

.select(events.timestamp).collect()

Logical Plan
filter
join
events file
users
table
expensive
only join
relevant users
Physical Plan
join
scan
(events)
filter
scan
(users)

35
def
add_demographics(events):

u
=
sqlCtx.table("users")

#
Load
partitioned
Hive
table

events

.join(u,
events.user_id
==
u.user_id)

#
Join
on
user_id

.withColumn("city",
zipToCity(df.zip))

#
Run
udf
to
add
city
column

Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events
=
add_demographics(sqlCtx.load("/data/events",
"parquet"))

training_data
=
events.where(events.city
==
"Palo
Alto")

.select(events.timestamp).collect()

Logical Plan
filter
join
events file
users
table
Physical Plan
join
scan
(events)
filter
scan
(users)

Machine Learning Pipelines
tokenizer
=
Tokenizer(inputCol="text", outputCol="words”)

hashingTF
=
HashingTF(inputCol="words", outputCol="features”)

lr
=
LogisticRegression(maxIter=10,
regParam=0.01)

pipeline
=
Pipeline(stages=[tokenizer,
hashingTF,
lr])

df
=
sqlCtx.load("/path/to/data")
model
=
pipeline.fit(df)
df0 df1 df2 df3tokenizer hashingTF lr.model
lr
Pipeline Model

Set Footer from Insert Dropdown Menu 37
So how does it all work?

Plan Optimization & Execution
Set Footer from Insert Dropdown Menu 38
SQL AST
DataFrame
Unresolved
Logical
Plan
Logical
Plan
Optimized
Logical
Plan
RDDs
Selected
Physical
Plan
Analysis Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Code
Generation

An example query
SELECT
name

FROM
(

SELECT
id,
name

FROM
People)
p

WHERE
p.id
=
1

Project
name
Project
id,name
Filter
id = 1
People
Logical
Plan
39

Naïve Query Planning
SELECT
name

FROM
(

SELECT
id,
name

FROM
People)
p

WHERE
p.id
=
1

Project
name
Project
id,name
Filter
id = 1
People
Logical
Plan
Project
name
Project
id,name
Filter
id = 1
TableScan
People
Physical
Plan
40

Optimized Execution
Writing imperative code
to optimize all possible
patterns is hard.

Project
name
Project
id,name
Filter
id = 1
People
Logical
Plan
Project
name
Project
id,name
Filter
id = 1
People
IndexLookup
id = 1
return: name
Logical
Plan
Physical
Plan
Instead write simple rules:
•  Each rule makes one change
•  Run many rules together to
ﬁxed point.
41

Prior Work: #
Optimizer Generators
Volcano / Cascades:
•  Create a custom language for expressing
rules that rewrite trees of relational
operators.
•  Build a compiler that generates executable
code for these rules.
Cons:
Developers
need
to
learn
this
custom

language.
Language
might
not
be
powerful

enough.
42

TreeNode Library
Easily transformable trees of operators
•  Standard collection functionality - foreach,

map,collect,etc.
•  transform function – recursive modiﬁcation
of tree fragments that match a pattern.
•  Debugging support – pretty printing,
splicing, etc.
43

Tree Transformations
Developers express tree transformations as
PartialFunction[TreeType,TreeType]
1.  If the function does apply to an operator, that
operator is replaced with the result.
2.  When the function does not apply to an
operator, that operator is left unchanged.
3.  The transformation is applied recursively to all
children.
44

Writing Rules as Tree
Transformations
1.  Find ﬁlters on top of
projections.
2.  Check that the ﬁlter
can be evaluated
without the result of
the project.
3.  If so, switch the
operators.
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
45

Filter Push Down Transformation

val
newPlan
=
queryPlan
transform
{

case
f
@
Filter(_,
p
@
Project(_,
grandChild))

if(f.references
subsetOf
grandChild.output)
=>

p.copy(child
=
f.copy(child
=
grandChild)

}

46


val
newPlan
=
queryPlan
transform
{

case
f
@
Filter(_,
p
@
Project(_,
grandChild))

if(f.references
subsetOf
grandChild.output)
=>

p.copy(child
=
f.copy(child
=
grandChild)

}

Partial Function
Tree
47


val
newPlan
=
queryPlan
transform
{

case
f
@
Filter(_,
p
@
Project(_,
grandChild))

if(f.references
subsetOf
grandChild.output)
=>

p.copy(child
=
f.copy(child
=
grandChild)

}

Find Filter on Project
48


val
newPlan
=
queryPlan
transform
{

case
f
@
Filter(_,
p
@
Project(_,
grandChild))

if(f.references
subsetOf
grandChild.output)
=>

p.copy(child
=
f.copy(child
=
grandChild)

}

Check that the ﬁlter can be evaluated without the result of the project.
49


val
newPlan
=
queryPlan
transform
{

case
f
@
Filter(_,
p
@
Project(_,
grandChild))

if(f.references
subsetOf
grandChild.output)
=>

p.copy(child
=
f.copy(child
=
grandChild)

}

If so, switch the order.
50


val
newPlan
=
queryPlan
transform
{

case
f
@
Filter(_,
p
@
Project(_,
grandChild))

if(f.references
subsetOf
grandChild.output)
=>

p.copy(child
=
f.copy(child
=
grandChild)

}

Scala: Pattern Matching
51


val
newPlan
=
queryPlan
transform
{

case
f
@
Filter(_,
p
@
Project(_,
grandChild))

if(f.references
subsetOf
grandChild.output)
=>

p.copy(child
=
f.copy(child
=
grandChild)

}

Catalyst: Attribute Reference Tracking
52


val
newPlan
=
queryPlan
transform
{

case
f
@
Filter(_,
p
@
Project(_,
grandChild))

if(f.references
subsetOf
grandChild.output)
=>

p.copy(child
=
f.copy(child
=
grandChild)

}

Scala: Copy Constructors
53

Optimizing with Rules
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
Project
name
Filter
id = 1
People
Combine
Projection
IndexLookup
id = 1
return: name
Physical
Plan
54

Future Work – Project Tungsten
Consider “abcd” – 4 bytes with UTF8 encoding
java.lang.String
object
internals:

OFFSET

SIZE

TYPE
DESCRIPTION

VALUE

0

4

(object
header)

...

4

4

(object
header)

...

8

4

(object
header)

...

12

4
char[]
String.value

[]

16

4

int
String.hash

0

20

4

int
String.hash32

0

Instance
size:
24
bytes
(reported
by
Instrumentation
API)

Project Tungsten
Overcome JVM limitations:
•  Memory Management and Binary Processing:
leveraging application semantics to manage
memory explicitly and eliminate the overhead of
JVM object model and garbage collection
•  Cache-aware computation: algorithms and data
structures to exploit memory hierarchy
•  Code generation: using code generation to exploit
modern compilers and CPUs

Questions?
Learn more at: http://spark.apache.org/docs/latest/
Get Involved: https://github.com/apache/spark

Spark SQL Deep Dive @ Melbourne Spark Meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark SQL Deep Dive @ Melbourne Spark Meetup

Similar to Spark SQL Deep Dive @ Melbourne Spark Meetup (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Spark SQL Deep Dive @ Melbourne Spark Meetup