A DataFrame Abstraction
Layer for SparkR
Chris Freeman
Agenda
• What is SparkR?
• History of DataFrames
• Why DataFrames?
• How do DataFrames work?
• Demo
• On the Roadmap
• Questions
2
What is SparkR?
• New R language API for Spark and SparkSQL
• Exposes existing Spark functionality in an R-
friendly syntax via the DataFrame API
• Has its own shell, but can also be imported like a
standard R package and used with Rstudio.
3
What is SparkR?
• An opportunity to make Spark accessible to the
large community of R developers who already
have clear ideas about how to do analytics in R
• No need to learn a new programming paradigm
when working with Spark
4
History of DataFrames
• SparkR began as an R package that ported
Spark’s core functionality (RDDs) to the R
language.
• The next logical step was to add SparkSQL and
SchemaRDDs.
• Initial implementation of SQLContext and
SchemaRDDs working in SparkR
5
History of DataFrames
6
History of DataFrames
7
History of DataFrames
Me:
8
History of DataFrames
Me:
9
Reynold:
Maybe this isn’t such a bad thing…
10
How can I use Spark to do something
simple?
"Michael, 29"
"Andy, 30"
"Justin, 19"
"Bob, 22"
"Chris, 28"
"Garth, 36"
"Tasha, 24"
"Mac, 30"
"Neil, 32"
11
Let’s say we wanted to do this with regular RDDs. What would that look like?
How can I use Spark to do something
simple?
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
12
How can I use Spark to do something
simple?
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
ageInt <- lapply(lines,
function(line) {
as.numeric(line[2])
})
13
How can I use Spark to do something
simple?
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
ageInt <- lapply(lines,
function(line) {
as.numeric(line[2])
})
sum <- reduce(ageInt,function(x,y) {x+y})
14
How can I use Spark to do something
simple?
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
ageInt <- lapply(lines,
function(line) {
as.numeric(line[2])
})
sum <- reduce(ageInt,function(x,y) {x+y})
avg <- sum / count(peopleRDD)
15
How can I use Spark to do something
simple?
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
ageInt <- lapply(lines,
function(line) {
as.numeric(line[2])
})
sum <- reduce(ageInt,function(x,y) {x+y})
avg <- sum / count(peopleRDD)
16
There’s got to be a better way.
17
What I’d hoped to see
{"name":"Michael", "age":29}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
{"name":"Bob", "age":22}
{"name":"Chris", "age":28}
{"name":"Garth", "age":36}
{"name":"Tasha", "age":24}
{"name":"Mac", "age":30}
{"name":"Neil", "age":32}
18
What I’d hoped to see
df <- read.df(sqlCtx, “people.json”, “json”)
19
What I’d hoped to see
df <- read.df(sqlCtx, “people.json”, “json”)
avg <- select(df, avg(df$age))
20
Why DataFrames?
• Uses the distributed, parallel capabilities offered
by RDDs, but imposes a schema on the data
• More structure == Easier access and
manipulation
• Natural extension of existing R conventions since
DataFrames are already the standard
21
Why DataFrames?
• Super awesome distributed, in-memory
collections
22
Why DataFrames?
• Super awesome distributed, in-memory
collections
• Schemas == metadata, structure, declarative
instead of imperative
23
Why DataFrames?
• Super awesome distributed, in-memory
collections
• Schemas == metadata, structure, declarative
instead of imperative
• ????
24
Why DataFrames?
• Super awesome distributed, in-memory
collections
• Schemas == metadata, structure, declarative
instead of imperative
• ????
• Profit
25
DataFrames in SparkR
• Multiple Components:
– A set of native S4 classes and methods that
live inside a standard R package
– A SparkR backend that passes data
structures and method calls to the JVM
– A set of “helper” methods written in Scala
26
Why does the structure matter?
• Native R classes allow us to extend the existing
DataFrame API by adding R-like syntax and
interactions
• Handoff to the JVM gives us full access to
Spark’s DAG capabilities and Catalyst
optimizations, e.g. constant-folding, predicate
pushdown, and code generation.
27
SparkR DataFrame Features
• Column access using ‘$’ or ‘[ ]’ just like in R
• dplyr-like DataFrame manipulation:
– filter
– groupBy
– summarize
– mutate
• Access to external R packages that extend R
syntax
28
Demo Time!
29
On the Roadmap
• Spark 1.4: SparkR becomes an official API
– Primarily focused on SparkSQL/DataFrame
implementation
• Spark 1.5: Extend SparkR to include machine learning
capabilities (e.g. sparkML)
• For more information, be sure to check out “SparkR: The
Past, Present, and Future” at 4:30 on the Data Science
track.
30
Integration with
• Drag-and-drop GUI for data analysis
• Spark functionality built directly into existing
tools using SparkR
• Interact with a remote Spark cluster from your
desktop via Alteryx Designer
• Combine local and in-database data sources in
one workflow.
31
Developer Community
• SparkR originated at UC Berkeley AMPLAB, with
additional contributions from Alteryx, Intel,
Databricks, and others.
• Working on integration with Spark Packages
– Easily extend Spark with new functionality and
distribute via the Spark Package repository
32
Questions?
Slides, Demo, and Data available on GitHub at:
https://github.com/cafreeman/
SparkR_DataFrame_Demo
@15lettermax
cafreeman
33

A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)

  • 1.
    A DataFrame Abstraction Layerfor SparkR Chris Freeman
  • 2.
    Agenda • What isSparkR? • History of DataFrames • Why DataFrames? • How do DataFrames work? • Demo • On the Roadmap • Questions 2
  • 3.
    What is SparkR? •New R language API for Spark and SparkSQL • Exposes existing Spark functionality in an R- friendly syntax via the DataFrame API • Has its own shell, but can also be imported like a standard R package and used with Rstudio. 3
  • 4.
    What is SparkR? •An opportunity to make Spark accessible to the large community of R developers who already have clear ideas about how to do analytics in R • No need to learn a new programming paradigm when working with Spark 4
  • 5.
    History of DataFrames •SparkR began as an R package that ported Spark’s core functionality (RDDs) to the R language. • The next logical step was to add SparkSQL and SchemaRDDs. • Initial implementation of SQLContext and SchemaRDDs working in SparkR 5
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    Maybe this isn’tsuch a bad thing… 10
  • 11.
    How can Iuse Spark to do something simple? "Michael, 29" "Andy, 30" "Justin, 19" "Bob, 22" "Chris, 28" "Garth, 36" "Tasha, 24" "Mac, 30" "Neil, 32" 11 Let’s say we wanted to do this with regular RDDs. What would that look like?
  • 12.
    How can Iuse Spark to do something simple? peopleRDD <- textFile(sc, “people.txt”) lines <- flatMap(peopleRDD, function(line) { strsplit(line, ", ") }) 12
  • 13.
    How can Iuse Spark to do something simple? peopleRDD <- textFile(sc, “people.txt”) lines <- flatMap(peopleRDD, function(line) { strsplit(line, ", ") }) ageInt <- lapply(lines, function(line) { as.numeric(line[2]) }) 13
  • 14.
    How can Iuse Spark to do something simple? peopleRDD <- textFile(sc, “people.txt”) lines <- flatMap(peopleRDD, function(line) { strsplit(line, ", ") }) ageInt <- lapply(lines, function(line) { as.numeric(line[2]) }) sum <- reduce(ageInt,function(x,y) {x+y}) 14
  • 15.
    How can Iuse Spark to do something simple? peopleRDD <- textFile(sc, “people.txt”) lines <- flatMap(peopleRDD, function(line) { strsplit(line, ", ") }) ageInt <- lapply(lines, function(line) { as.numeric(line[2]) }) sum <- reduce(ageInt,function(x,y) {x+y}) avg <- sum / count(peopleRDD) 15
  • 16.
    How can Iuse Spark to do something simple? peopleRDD <- textFile(sc, “people.txt”) lines <- flatMap(peopleRDD, function(line) { strsplit(line, ", ") }) ageInt <- lapply(lines, function(line) { as.numeric(line[2]) }) sum <- reduce(ageInt,function(x,y) {x+y}) avg <- sum / count(peopleRDD) 16
  • 17.
    There’s got tobe a better way. 17
  • 18.
    What I’d hopedto see {"name":"Michael", "age":29} {"name":"Andy", "age":30} {"name":"Justin", "age":19} {"name":"Bob", "age":22} {"name":"Chris", "age":28} {"name":"Garth", "age":36} {"name":"Tasha", "age":24} {"name":"Mac", "age":30} {"name":"Neil", "age":32} 18
  • 19.
    What I’d hopedto see df <- read.df(sqlCtx, “people.json”, “json”) 19
  • 20.
    What I’d hopedto see df <- read.df(sqlCtx, “people.json”, “json”) avg <- select(df, avg(df$age)) 20
  • 21.
    Why DataFrames? • Usesthe distributed, parallel capabilities offered by RDDs, but imposes a schema on the data • More structure == Easier access and manipulation • Natural extension of existing R conventions since DataFrames are already the standard 21
  • 22.
    Why DataFrames? • Superawesome distributed, in-memory collections 22
  • 23.
    Why DataFrames? • Superawesome distributed, in-memory collections • Schemas == metadata, structure, declarative instead of imperative 23
  • 24.
    Why DataFrames? • Superawesome distributed, in-memory collections • Schemas == metadata, structure, declarative instead of imperative • ???? 24
  • 25.
    Why DataFrames? • Superawesome distributed, in-memory collections • Schemas == metadata, structure, declarative instead of imperative • ???? • Profit 25
  • 26.
    DataFrames in SparkR •Multiple Components: – A set of native S4 classes and methods that live inside a standard R package – A SparkR backend that passes data structures and method calls to the JVM – A set of “helper” methods written in Scala 26
  • 27.
    Why does thestructure matter? • Native R classes allow us to extend the existing DataFrame API by adding R-like syntax and interactions • Handoff to the JVM gives us full access to Spark’s DAG capabilities and Catalyst optimizations, e.g. constant-folding, predicate pushdown, and code generation. 27
  • 28.
    SparkR DataFrame Features •Column access using ‘$’ or ‘[ ]’ just like in R • dplyr-like DataFrame manipulation: – filter – groupBy – summarize – mutate • Access to external R packages that extend R syntax 28
  • 29.
  • 30.
    On the Roadmap •Spark 1.4: SparkR becomes an official API – Primarily focused on SparkSQL/DataFrame implementation • Spark 1.5: Extend SparkR to include machine learning capabilities (e.g. sparkML) • For more information, be sure to check out “SparkR: The Past, Present, and Future” at 4:30 on the Data Science track. 30
  • 31.
    Integration with • Drag-and-dropGUI for data analysis • Spark functionality built directly into existing tools using SparkR • Interact with a remote Spark cluster from your desktop via Alteryx Designer • Combine local and in-database data sources in one workflow. 31
  • 32.
    Developer Community • SparkRoriginated at UC Berkeley AMPLAB, with additional contributions from Alteryx, Intel, Databricks, and others. • Working on integration with Spark Packages – Easily extend Spark with new functionality and distribute via the Spark Package repository 32
  • 33.
    Questions? Slides, Demo, andData available on GitHub at: https://github.com/cafreeman/ SparkR_DataFrame_Demo @15lettermax cafreeman 33