SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
In this talk I describe how you can use Spark SQL DataFrames to speed up Spark programs, even without writing any SQL. By writing programs using the new DataFrame API you can write less code, read less data and let the optimizer do the hard work.
In this talk I describe how you can use Spark SQL DataFrames to speed up Spark programs, even without writing any SQL. By writing programs using the new DataFrame API you can write less code, read less data and let the optimizer do the hard work.
1.
Beyond SQL:
Speeding up Spark with DataFrames
Michael Armbrust - @michaelarmbrust
March 2015 – Spark Summit East
2.
2
0
50
100
150
# of Unique Contributors
0
50
100
150
200
# Of Commits Per Month
Graduated
from Alpha
in 1.3
About Me and SQL
Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
3.
3
SELECT
COUNT(*)
FROM
hiveTable
WHERE
hive_udf(data)
About Me and SQL
Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
4.
4
About Me and SQL
Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
• Connect existing BI tools to Spark through JDBC
5.
Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
• Connect existing BI tools to Spark through JDBC
• Bindings in Python, Scala, and Java
5
About Me and SQL
6.
Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
• Connect existing BI tools to Spark through JDBC
• Bindings in Python, Scala, and Java
@michaelarmbrust
• Lead developer of Spark SQL @databricks
About Me and
6
SQL
7.
The not-so-secret truth...
7
is not about SQL.
SQL
9.
The not-so-secret truth...
9
is about more than SQL.
SQL
10.
Spark SQL: The whole story
Creating and Running Spark Programs Faster:
• Write less code
• Read less data
• Let the optimizer do the hard work
10
11.
DataFrame
noun – [dey-tuh-freym]
11
1. A distributed collection of rows organized into
named columns.
2. An abstraction for selecting, filtering, aggregating
and plotting structured data (cf. R, Pandas).
3. Archaic: Previously SchemaRDD (cf. Spark < 1.3).
12.
Write Less Code: Input & Output
Spark SQL’s Data Source API can read and write DataFrames
using a variety of formats.
12
{ JSON }
Built-In External
JDBC
and more…
13.
Write Less Code: High-Level Operations
Common operations can be expressed concisely as calls
to the DataFrame API:
• Selecting required columns
• Joining different data sources
• Aggregation (count, sum, average, etc)
• Filtering
13
14.
Write Less Code: Compute an Average
private
IntWritable
one
=
new
IntWritable(1)
private
IntWritable
output
=
new
IntWritable()
proctected
void
map(
LongWritable
key,
Text
value,
Context
context)
{
String[]
fields
=
value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one,
output)
}
IntWritable
one
=
new
IntWritable(1)
DoubleWritable
average
=
new
DoubleWritable()
protected
void
reduce(
IntWritable
key,
Iterable<IntWritable>
values,
Context
context)
{
int
sum
=
0
int
count
=
0
for(IntWritable
value
:
values)
{
sum
+=
value.get()
count++
}
average.set(sum
/
(double)
count)
context.Write(key,
average)
}
data
=
sc.textFile(...).split("t")
data.map(lambda
x:
(x[0],
[x.[1],
1]))
.reduceByKey(lambda
x,
y:
[x[0]
+
y[0],
x[1]
+
y[1]])
.map(lambda
x:
[x[0],
x[1][0]
/
x[1][1]])
.collect()
15.
Write Less Code: Compute an Average
15
Using RDDs
data
=
sc.textFile(...).split("t")
data.map(lambda
x:
(x[0],
[int(x[1]),
1]))
.reduceByKey(lambda
x,
y:
[x[0]
+
y[0],
x[1]
+
y[1]])
.map(lambda
x:
[x[0],
x[1][0]
/
x[1][1]])
.collect()
Using DataFrames
sqlCtx.table("people")
.groupBy("name")
.agg("name",
avg("age"))
.collect()
Full API Docs
• Python
• Scala
• Java
16.
Not Just Less Code: Faster Implementations
16
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
17.
17
Demo: Data Sources API
Using Spark SQL to read, write, and transform data in a variety of
formats.
http://people.apache.org/~marmbrus/talks/dataframe.demo.pdf
18.
Read Less Data
The fastest way to process big data is to never read it.
Spark SQL can help you read less data automatically:
1Only supported for Parquet and Hive, more support coming in Spark 1.4 - 2Turned off by default in Spark 1.3 18
• Converting to more efficient formats
• Using columnar formats (i.e. parquet)
• Using partitioning (i.e., /year=2014/month=02/…)1
• Skipping data using statistics (i.e., min, max)2
• Pushing predicates into storage systems (i.e., JDBC)
19.
Plan Optimization & Execution
19
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
20.
Optimization happens as late as possible, therefore
Spark SQL can optimize even across functions.
20
21.
21
def
add_demographics(events):
u
=
sqlCtx.table("users")
#
Load
Hive
table
events
.join(u,
events.user_id
==
u.user_id)
#
Join
on
user_id
.withColumn("city",
zipToCity(df.zip))
#
Run
udf
to
add
city
column
events
=
add_demographics(sqlCtx.load("/data/events",
"json"))
training_data
=
events.where(events.city
==
"New
York").select(events.timestamp).collect()
Logical Plan
filter
join
events file users table
expensive
only join
relevent users
Physical Plan
join
scan
(events)
filter
scan
(users)
22.
22
def
add_demographics(events):
u
=
sqlCtx.table("users")
#
Load
partitioned
Hive
table
events
.join(u,
events.user_id
==
u.user_id)
#
Join
on
user_id
.withColumn("city",
zipToCity(df.zip))
#
Run
udf
to
add
city
column
Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events
=
add_demographics(sqlCtx.load("/data/events",
"parquet"))
training_data
=
events.where(events.city
==
"New
York").select(events.timestamp).collect()
Logical Plan
filter
join
events file users table
Physical Plan
join
scan
(events)
filter
scan
(users)
23.
Machine Learning Pipelines
23
tokenizer
=
Tokenizer(inputCol="text",
outputCol="words”)
hashingTF
=
HashingTF(inputCol="words",
outputCol="features”)
lr
=
LogisticRegression(maxIter=10,
regParam=0.01)
pipeline
=
Pipeline(stages=[tokenizer,
hashingTF,
lr])
df
=
sqlCtx.load("/path/to/data")
model
=
pipeline.fit(df)
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
Pipeline Model
24.
Create and Run Spark Programs Faster:
• Write less code
• Read less data
• Let the optimizer do the hard work
SQL
Questions?