Best Angular 17 Classroom & Online training - Naresh IT
Spark SQL Deep Dive @ Melbourne Spark Meetup
1. Spark SQL Deep Dive
Michael Armbrust
Melbourne Spark Meetup – June 1st 2015
2. What is Apache Spark?
Fast and general cluster computing system,
interoperable with Hadoop, included in all major
distros
Improves efficiency through:
> In-memory computing primitives
> General computation graphs
Improves usability through:
> Rich APIs in Scala, Java, Python
> Interactive shell
Up to 100× faster
(2-10× on disk)
2-5× less code
3. Spark Model
Write programs in terms of transformations on
distributed datasets
Resilient Distributed Datasets (RDDs)
> Collections of objects that can be stored in memory
or disk across a cluster
> Parallel functional transformations (map, filter, …)
> Automatically rebuilt on failure
4. More than Map & Reduce
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
5. 5
On-Disk Sort Record:
Time to sort 100TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1PB in 4 hours
6. 6
Spark “Hall of Fame”
LARGEST SINGLE-DAY INTAKE LONGEST-RUNNING JOB
LARGEST SHUFFLE MOST INTERESTING APP
Tencent
(1PB+ /day)
Alibaba
(1 week on 1PB+ data)
Databricks PB Sort
(1PB)
Jeremy Freeman
Mapping the Brain at Scale
(with lasers!)
LARGEST CLUSTER
Tencent
(8000+ nodes)
Based on Reynold Xin’s personal knowledge
7. Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
val
lines
=
spark.textFile(“hdfs://...”)
val
errors
=
lines.filter(_
startswith
“ERROR”)
val
messages
=
errors.map(_.split(“t”)(2))
messages.cache()
lines
Block 1
lines
Block 2
lines
Block 3
Worker
Worker
Worker
Driver
messages.filter(_
contains
“foo”).count()
messages.filter(_
contains
“bar”).count()
. . .
tasks
results
messages
Cache 1
messages
Cache 2
messages
Cache 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in
<1 sec (vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec#
(vs 170 sec for on-disk data)
14. About SQL
Spark SQL
> Part of the core distribution since Spark 1.0
(April 2014)
0
50
100
150
200
250
# Of Commits Per Month
0
50
100
150
200
2014-03
2014-04
2014-05
2014-06
2014-07
2014-08
2014-09
2014-10
2014-11
2014-12
2015-01
2015-02
2015-03
2015-04
2015-05
2015-06
# of Contributors
15. SELECT
COUNT(*)
FROM
hiveTable
WHERE
hive_udf(data)
Spark SQL
> Part of the core distribution since Spark 1.0
(April 2014)
> Runs SQL / HiveQL queries including UDFs
UDAFs and SerDes
About SQL
16. Spark SQL
> Part of the core distribution since Spark 1.0
(April 2014)
> Runs SQL / HiveQL queries including UDFs
UDAFs and SerDes
> Connect existing BI tools to Spark through
JDBC
About SQL
17. Spark SQL
> Part of the core distribution since Spark 1.0
(April 2014)
> Runs SQL / HiveQL queries including UDFs
UDAFs and SerDes
> Connect existing BI tools to Spark through
JDBC
> Bindings in Python, Scala, and Java
About SQL
19. SQL: The whole story
Create and Run Spark Programs Faster:
> Write less code
> Read less data
> Let the optimizer do the hard work
20. DataFrame
noun – [dey-tuh-freym]
1. A distributed collection of rows
organized into named columns.
2. An abstraction for selecting, filtering,
aggregating and plotting structured
data (cf. R, Pandas).
3. Archaic: Previously SchemaRDD
(cf. Spark < 1.3).
21. Write Less Code: Input & Output
Spark SQL’s Data Source API can read and write
DataFrames using a variety of formats.
21
{ JSON }
Built-In External
JDBC
and more…
22. Write Less Code: Input & Output
Unified interface to reading/writing data in a
variety of formats:
df
=
sqlContext.read
.format("json")
.option("samplingRatio",
"0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
23. Write Less Code: Input & Output
Unified interface to reading/writing data in a
variety of formats:
df
=
sqlContext.read
.format("json")
.option("samplingRatio",
"0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
read and write
functions create
new builders for
doing I/O
24. Write Less Code: Input & Output
Unified interface to reading/writing data in a
variety of formats:
df
=
sqlContext.read
.format("json")
.option("samplingRatio",
"0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
Builder methods
are used to specify:
• Format
• Partitioning
• Handling of
existing data
• and more
25. Write Less Code: Input & Output
Unified interface to reading/writing data in a
variety of formats:
df
=
sqlContext.read
.format("json")
.option("samplingRatio",
"0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
load(…), save(…) or
saveAsTable(…)
functions create
new builders for
doing I/O
26. ETL Using Custom Data Sources
sqlContext.read
.format("com.databricks.spark.git")
.option("url",
"https://github.com/apache/spark.git")
.option("numPartitions",
"100")
.option("branches",
"master,branch-‐1.3,branch-‐1.2")
.load()
.repartition(1)
.write
.format("json")
.save("/home/michael/spark.json")
27. Write Less Code: Powerful Operations
Common operations can be expressed concisely
as calls to the DataFrame API:
• Selecting required columns
• Joining different data sources
• Aggregation (count, sum, average, etc)
• Filtering
27
28. Write Less Code: Compute an Average
private
IntWritable
one
=
new
IntWritable(1)
private
IntWritable
output
=
new
IntWritable()
proctected
void
map(
LongWritable
key,
Text
value,
Context
context)
{
String[]
fields
=
value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one,
output)
}
IntWritable
one
=
new
IntWritable(1)
DoubleWritable
average
=
new
DoubleWritable()
protected
void
reduce(
IntWritable
key,
Iterable<IntWritable>
values,
Context
context)
{
int
sum
=
0
int
count
=
0
for(IntWritable
value
:
values)
{
sum
+=
value.get()
count++
}
average.set(sum
/
(double)
count)
context.Write(key,
average)
}
data
=
sc.textFile(...).split("t")
data.map(lambda
x:
(x[0],
[x.[1],
1]))
.reduceByKey(lambda
x,
y:
[x[0]
+
y[0],
x[1]
+
y[1]])
.map(lambda
x:
[x[0],
x[1][0]
/
x[1][1]])
.collect()
29. Write Less Code: Compute an Average
Using RDDs
data
=
sc.textFile(...).split("t")
data.map(lambda
x:
(x[0],
[int(x[1]),
1]))
.reduceByKey(lambda
x,
y:
[x[0]
+
y[0],
x[1]
+
y[1]])
.map(lambda
x:
[x[0],
x[1][0]
/
x[1][1]])
.collect()
Using DataFrames
sqlCtx.table("people")
.groupBy("name")
.agg("name",
avg("age"))
.collect()
Using SQL
SELECT
name,
avg(age)
FROM
people
GROUP
BY
name
30. Not Just Less Code: Faster
Implementations
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
31. 31
Demo: Data Frames
Using Spark SQL to read, write, slice and dice
your data using a simple functions
32. Read Less Data
Spark SQL can help you read less data
automatically:
• Converting to more efficient formats
• Using columnar formats (i.e. parquet)
• Using partitioning (i.e., /year=2014/month=02/…)1
• Skipping data using statistics (i.e., min, max)2
• Pushing predicates into storage systems (i.e., JDBC)
33. Optimization happens as late as
possible, therefore Spark SQL can
optimize across functions.
33
34. 34
def
add_demographics(events):
u
=
sqlCtx.table("users")
#
Load
Hive
table
events
.join(u,
events.user_id
==
u.user_id)
#
Join
on
user_id
.withColumn("city",
zipToCity(df.zip))
#
udf
adds
city
column
events
=
add_demographics(sqlCtx.load("/data/events",
"json"))
training_data
=
events.where(events.city
==
"Palo
Alto")
.select(events.timestamp).collect()
Logical Plan
filter
join
events file
users
table
expensive
only join
relevant users
Physical Plan
join
scan
(events)
filter
scan
(users)
35. 35
def
add_demographics(events):
u
=
sqlCtx.table("users")
#
Load
partitioned
Hive
table
events
.join(u,
events.user_id
==
u.user_id)
#
Join
on
user_id
.withColumn("city",
zipToCity(df.zip))
#
Run
udf
to
add
city
column
Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events
=
add_demographics(sqlCtx.load("/data/events",
"parquet"))
training_data
=
events.where(events.city
==
"Palo
Alto")
.select(events.timestamp).collect()
Logical Plan
filter
join
events file
users
table
Physical Plan
join
scan
(events)
filter
scan
(users)
36. Machine Learning Pipelines
tokenizer
=
Tokenizer(inputCol="text", outputCol="words”)
hashingTF
=
HashingTF(inputCol="words", outputCol="features”)
lr
=
LogisticRegression(maxIter=10,
regParam=0.01)
pipeline
=
Pipeline(stages=[tokenizer,
hashingTF,
lr])
df
=
sqlCtx.load("/path/to/data")
model
=
pipeline.fit(df)
df0 df1 df2 df3tokenizer hashingTF lr.model
lr
Pipeline Model
37. Set Footer from Insert Dropdown Menu 37
So how does it all work?
38. Plan Optimization & Execution
Set Footer from Insert Dropdown Menu 38
SQL AST
DataFrame
Unresolved
Logical
Plan
Logical
Plan
Optimized
Logical
Plan
RDDs
Selected
Physical
Plan
Analysis Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Code
Generation
39. An example query
SELECT
name
FROM
(
SELECT
id,
name
FROM
People)
p
WHERE
p.id
=
1
Project
name
Project
id,name
Filter
id = 1
People
Logical
Plan
39
40. Naïve Query Planning
SELECT
name
FROM
(
SELECT
id,
name
FROM
People)
p
WHERE
p.id
=
1
Project
name
Project
id,name
Filter
id = 1
People
Logical
Plan
Project
name
Project
id,name
Filter
id = 1
TableScan
People
Physical
Plan
40
41. Optimized Execution
Writing imperative code
to optimize all possible
patterns is hard.
Project
name
Project
id,name
Filter
id = 1
People
Logical
Plan
Project
name
Project
id,name
Filter
id = 1
People
IndexLookup
id = 1
return: name
Logical
Plan
Physical
Plan
Instead write simple rules:
• Each rule makes one change
• Run many rules together to
fixed point.
41
42. Prior Work: #
Optimizer Generators
Volcano / Cascades:
• Create a custom language for expressing
rules that rewrite trees of relational
operators.
• Build a compiler that generates executable
code for these rules.
Cons:
Developers
need
to
learn
this
custom
language.
Language
might
not
be
powerful
enough.
42
43. TreeNode Library
Easily transformable trees of operators
• Standard collection functionality - foreach,
map,collect,etc.
• transform function – recursive modification
of tree fragments that match a pattern.
• Debugging support – pretty printing,
splicing, etc.
43
44. Tree Transformations
Developers express tree transformations as
PartialFunction[TreeType,TreeType]
1. If the function does apply to an operator, that
operator is replaced with the result.
2. When the function does not apply to an
operator, that operator is left unchanged.
3. The transformation is applied recursively to all
children.
44
45. Writing Rules as Tree
Transformations
1. Find filters on top of
projections.
2. Check that the filter
can be evaluated
without the result of
the project.
3. If so, switch the
operators.
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
45
46. Filter Push Down Transformation
val
newPlan
=
queryPlan
transform
{
case
f
@
Filter(_,
p
@
Project(_,
grandChild))
if(f.references
subsetOf
grandChild.output)
=>
p.copy(child
=
f.copy(child
=
grandChild)
}
46
47. Filter Push Down Transformation
val
newPlan
=
queryPlan
transform
{
case
f
@
Filter(_,
p
@
Project(_,
grandChild))
if(f.references
subsetOf
grandChild.output)
=>
p.copy(child
=
f.copy(child
=
grandChild)
}
Partial Function
Tree
47
48. Filter Push Down Transformation
val
newPlan
=
queryPlan
transform
{
case
f
@
Filter(_,
p
@
Project(_,
grandChild))
if(f.references
subsetOf
grandChild.output)
=>
p.copy(child
=
f.copy(child
=
grandChild)
}
Find Filter on Project
48
49. Filter Push Down Transformation
val
newPlan
=
queryPlan
transform
{
case
f
@
Filter(_,
p
@
Project(_,
grandChild))
if(f.references
subsetOf
grandChild.output)
=>
p.copy(child
=
f.copy(child
=
grandChild)
}
Check that the filter can be evaluated without the result of the project.
49
50. Filter Push Down Transformation
val
newPlan
=
queryPlan
transform
{
case
f
@
Filter(_,
p
@
Project(_,
grandChild))
if(f.references
subsetOf
grandChild.output)
=>
p.copy(child
=
f.copy(child
=
grandChild)
}
If so, switch the order.
50
51. Filter Push Down Transformation
val
newPlan
=
queryPlan
transform
{
case
f
@
Filter(_,
p
@
Project(_,
grandChild))
if(f.references
subsetOf
grandChild.output)
=>
p.copy(child
=
f.copy(child
=
grandChild)
}
Scala: Pattern Matching
51
52. Filter Push Down Transformation
val
newPlan
=
queryPlan
transform
{
case
f
@
Filter(_,
p
@
Project(_,
grandChild))
if(f.references
subsetOf
grandChild.output)
=>
p.copy(child
=
f.copy(child
=
grandChild)
}
Catalyst: Attribute Reference Tracking
52
53. Filter Push Down Transformation
val
newPlan
=
queryPlan
transform
{
case
f
@
Filter(_,
p
@
Project(_,
grandChild))
if(f.references
subsetOf
grandChild.output)
=>
p.copy(child
=
f.copy(child
=
grandChild)
}
Scala: Copy Constructors
53
54. Optimizing with Rules
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
Project
name
Filter
id = 1
People
Combine
Projection
IndexLookup
id = 1
return: name
Physical
Plan
54
55. Future Work – Project Tungsten
Consider “abcd” – 4 bytes with UTF8 encoding
java.lang.String
object
internals:
OFFSET
SIZE
TYPE
DESCRIPTION
VALUE
0
4
(object
header)
...
4
4
(object
header)
...
8
4
(object
header)
...
12
4
char[]
String.value
[]
16
4
int
String.hash
0
20
4
int
String.hash32
0
Instance
size:
24
bytes
(reported
by
Instrumentation
API)
56. Project Tungsten
Overcome JVM limitations:
• Memory Management and Binary Processing:
leveraging application semantics to manage
memory explicitly and eliminate the overhead of
JVM object model and garbage collection
• Cache-aware computation: algorithms and data
structures to exploit memory hierarchy
• Code generation: using code generation to exploit
modern compilers and CPUs
57. Questions?
Learn more at: http://spark.apache.org/docs/latest/
Get Involved: https://github.com/apache/spark