DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks

DataFrames:
Spark’s new abstraction for data
science
Reynold Xin @rxin
LA meetup – July 9, 2015

Reynold Xin
Spark committer
Databrick co-founder
BerkeleyAMPLab PhD on leave
2

Spark’s Growth
4
Google Trends for “Apache Spark”

Google Trends for “dataframe”
Single-node tabulardata structure, with API for
relational algebra (filter, join,…)
math and stats
input/output(CSV, JSON, …)
ad infinitum

Data frame: lingua franca for “small data”
head(flights)
#> Source: local data frame [6 x 16]
#>
#> year month day dep_time dep_delay arr_time arr_delay carrier tailnum
#> 1 2013 1 1 517 2 830 11 UA N14228
#> 2 2013 1 1 533 4 850 20 UA N24211
#> 3 2013 1 1 542 2 923 33 AA N619AA
#> 4 2013 1 1 544 -‐1 1004 -‐18 B6 N804JB
#> .. ... ... ... ... ... ... ... ... ...

Spark DataFrame
• > head(filter(df, df$waiting < 50)) # an example in R
• ## eruptions waiting
• ##1 1.750 47
• ##2 1.750 47
• ##3 1.867 48
Distributed data frame for Java, Python, R, Scala
Similar APIs as single-nodetools (Pandas, dplyr), i.e. easy to learn

Question: When does
Reynold usually go to
bed?
8

Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df = sqlContext.read
.format("json")
.option("samplingRatio", "0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
9

.format("json")
df.write
.format("parquet")
.mode("append")
read and write
functions create
new builders for
doing I/O
10

Builder methods
specify:
• Format
• Partitioning
• Handling of
existing data
.format("json")
df.write
.format("parquet")
.mode("append")
11

load(…), save(…) or
saveAsTable(…)
finish the I/O
specification
.format("json")
df.write
.format("parquet")
.mode("append")
12

DataFrame can read and write a variety of formats.
13
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/

Machine Learning Pipelines
14
tokenizer = Tokenizer(inputCol="text",
outputCol="words”)
hashingTF = HashingTF(inputCol="words",
outputCol="features”)
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
df = sqlCtx.load("/path/to/data")
model = pipeline.fit(df)
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
Pipeline Model

data size
KB MB GB TB PB
Existing
Single-node
Data Frames
Spark
DataFrame

It is not Spark vs Python/R,
but Spark and Python/R.

Spark and Python/R
Spark
DF
scalability
multi-core
multi-machines
Python/R
DF
Viz
Machine
Learning
Stats
wealth
of
libraries

Spark RDD Execution
Java/Scala
API
JVM
Execution
Python
API
Python
Execution
opaque closures
(user-defined functions)

Spark DataFrame Execution
DataFrame
Logical Plan
Physical
Execution
Catalyst
optimizer
Intermediate representationfor computation

Spark DataFrame Execution
Python
DF
Logical Plan
Physical
Execution
Catalyst
optimizer
Java/Scala
DF
R
DF
Intermediate representationfor computation
Simple wrappers to create logical plan

Benefit of Logical Plan: Simpler Frontend
• Python : ~2000 line of code (built over a weekend)
• R : ~1000 line of code
• i.e. much easier to add newlanguagebindings (Julia,
Clojure, …)

Performance
0 2 4 6 8 10
Java/Scala
Python
Runtime for an example aggregationworkload
RDD

Benefit of Logical Plan:
Performance Parity Across Languages
0 2 4 6 8 10
Java/Scala
Python
Java/Scala
Python
R
SQL
Runtime for an example aggregationworkload (secs)
DataFrame
RDD

25
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= ”2015-01-01”)
logical plan
filter
join
scan
(users)
scan
(events)
physical plan
join
scan
(users)
filter
scan
(events)
this join is expensive à

More Than Naïve Scans
• Data SourcesAPI can automatically prune columns
and pushfilters to the source
– Parquet: skip irrelevantcolumnsand blocksof data; turn
string comparison into integercomparisons for dictionary
encoded data
– JDBC: Rewrite queriesto push predicatesdown
• The fastest way to processdata is to skip it.
26

27
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date > ”2015-01-01”)
logical plan
filter
join
scan
(users)
scan
(events)
optimized plan
join
scan
(users)
filter
scan
(events)
optimized plan
with intelligent data sources
join
scan
(users)
filterscan
(events)

From DataFrame to Tungsten
Python
DF
Logical Plan
Java/Scala
DF
R
DF
Tungsten
Execution
Code generation
Cache-efficientalgorithms
Binary processing

Initial Performance Results
0
200
400
600
800
1000
1200
1x 2x 4x 8x
Runtime(seconds)
Data set size (relative)
Tungsten-off
Tungsten-on

Tungsten Execution
PythonSQL R Streaming
DataFrame
Advanced
Analytics

DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks

Similar to DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks