SlideShare a Scribd company logo
SparkR
- Play Spark Using R
Gil Chen
@HadoopCon 2016
Demo: http://goo.gl/VF77ad
about me
• R, Python & Matlab User
• Taiwan R User Group
• Taiwan Spark User Group
• Co-founder
• Data Scientist @
HadoopCon 2015
Outline
• Introduction to SparkR
• Demo
• Starting to use SparkR
• DataFrames: dplyr style, SQL style
• RDD v.s. DataFrames
• MLlib: GLM, K-means
• User Case
• Median: approxQuantile()
• ID Match: dplyr style, SQL style, SparkR function
• SparkR + Shiny
• The Future of SparkR
Introduction to SparkR
Spark Origin
• Apache Spark is an open source cluster computing
framework
• Originally developed at the University of California,
Berkeley's AMPLab
• The first 2 contributors of SparkR:

Shivaram Venkataraman & Zongheng Yang
https://amplab.cs.berkeley.edu/
Spark History
https://en.wikipedia.org/wiki/Apache_Spark
SparkR
DataFrames
PySpark
Key Advantages of Spark & R
+
Fast!
Flexible
Scalable
Statistical!
Interactive
Packages
https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf
ggplot2
Google Search: ggplot2
ggplot2 is a plotting system for R, based on the grammar of graphics.
Shiny
http://shiny.rstudio.com/gallery/
and more impressive dashboard…
A web application framework for R
Turn your analyses into interactive web applications
No HTML, CSS, or JavaScript knowledge required
Performance
https://amplab.cs.berkeley.edu/announcing-sparkr-r-on-spark/
The runtime performance of running group-by
aggregation on 10 million integer pairs on a single
machine in R, Python and Scala.
(using the same dataset as https://goo.gl/iMLXnh)
https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
RDD (Resilient Distributed Dataset)
https://spark.apache.org/docs/2.0.0/api/scala/#org.apache.spark.rdd.RDD
Internally, each RDD is characterized
by five main properties:
1. A list of partitions
2. A function for computing each split
3. A list of dependencies on other
RDDs
4. Optionally, a Partitioner for key-value
RDDs (e.g. to say that the RDD is
hash-partitioned)
5. Optionally, a list of preferred
locations to compute each split on
(e.g. block locations for an HDFS
file)
https://docs.cloud.databricks.com/docs/latest/courses
RDD dependencies
• Narrow dependency: Each partition of the parent RDD is used by at most
one partition of the child RDD. This means the task can be executed
locally and we don’t have to shuffle. (Eg: map, flatMap, Filter, sample etc.)
• Wide dependency: Multiple child partitions may depend on one partition
of the parent RDD. This means we have to shuffle data unless the parents
are hash-partitioned (Eg: sortByKey, reduceByKey, groupByKey, join etc.)
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
Job Scheduling
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
Black: if they are already in memory
Transformations
RDD Operations
map()
flatmap()
filter()
mapPartitions()
sample()
union()
intersection()
distinct()
groupByKey()
reduceByKey()
sortByKey()
join()
cogroup()
…
Actions
reduce()
collect()
count()
first()
take(num)
takeSample()
takeOrdered()
saveAsTextFile()
saveAsSequenceFile()
saveAsObjectFile()
countByValue()
countByKey()
foreach()
…
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
narrow
dep.
wide
dep.
lazy evaluation
RDD Example
RDDRDDRDDRDD
Transformations
Action Value
rdd <- SparkR:::textFile(sc, "txt")
words <- SparkR:::flatMap(rdd, function())
wordCount <- SparkR:::lapply(words, function(word))
counts <- SparkR:::reduceByKey(wordCount, "+", 1)
op <- SparkR:::collect(counts)
R shell
RDD
SparkR
RDD & DataFrames
before v1.6
since v2.0
array
data.frame
+ schema
SparkDataFrame
+ schema
General
Action
Transformation
DataFrames are Faster!
http://scala-phase.org/talks/rdds-dataframes-datasets-2016-06-16/#/
Beyond SQL: Speeding up Spark with DataFrames
http://www.slideshare.net/databricks/spark-sqlsse2015public
Spark Stack
https://www.safaribooksonline.com/library/view/data-analytics-with/9781491913734/ch04.html
Storage
Cluster
Manager
Processing
Engine
Access &
Interfaces
How does sparkR works?
https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
Upgrading From SparkR 1.6 to 2.0
Before 1.6.2 Since 2.0.0
data type naming DataFrame SparkDataFrame
read csv
Package from
Databricks
built-in
function
(like approxQuantile)
X O
ML function glm
more

(or use sparklyr)
SQLContext
/ HiveContext
sparkRSQL.init(sc)
merge in
sparkR.session()
Execute Message very detailed simple
Launch on EC2 API X
https://spark.apache.org/docs/latest/sparkr.html
Demo
http://goo.gl/VF77ad
Easy Setting
1. Download
2. Decompress and Give a Path
3. Set Path and Launch SparkR in R
Documents
• If you have to use RDD, refer to AMP-Lab github:

http://amplab-extras.github.io/SparkR-pkg/rdocs/1.2/

and use “:::”

e.g. SparkR:::textFile, SparkR:::lapply
• Otherwise, refer to SparkR official documents:

https://spark.apache.org/docs/2.0.0/api/R/index.html
Starting to Use SparkR (v1.6.2)
# Set Spark path
Sys.setenv(SPARK_HOME="/usr/local/spark-1.6.2-bin-hadoop2.6/")
# Load SparkR library into your R session
library(SparkR,
lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
# Initialize SparkContext, sc:sparkContext
sc <- sparkR.init(appName = "Demo_SparkR")
# Initialize SQLContext
sqlContext <- sparkRSQL.init(sc)
# your sparkR script
# ...
# ...
sparkR.stop()
Starting to Use SparkR (v2.0.0)
# Set Spark path
Sys.setenv(SPARK_HOME="/usr/local/spark-2.0.0-bin-hadoop2.7/")
# Load SparkR library into your R session
library(SparkR,
lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
# Initialize SparkContext, sc: sparkContext
sc <- sparkR.session(appName = "Demo_SparkR")
# Initialize SQLContext (don’t need anymore since 2.0.0)
# sqlContext <- sparkRSQL.init(sc)
# your sparkR script
# ...
# ...
sparkR.stop()
DataFrames
# Load the flights CSV file using read.df
sdf <- read.df(sqlContext,"data_flights.csv",
"com.databricks.spark.csv", header = "true")
# Filter flights from JFK
jfk_flights <- filter(sdf, sdf$origin == "JFK")
# Group and aggregate flights to each destination
dest_flights <- summarize(
groupBy(jfk_flights, jfk_flights$dest),
count = n(jfk_flights$dest))
# Running SQL Queries
registerTempTable(sdf, "tempTable")
training <- sql(sqlContext,
"SELECT dest, count(dest) as cnt FROM tempTable
WHERE dest = 'JFK' GROUP BY dest")
Word Count
# read data into RDD
rdd <- SparkR:::textFile(sc, "data_word_count.txt")
# split word
words <- SparkR:::flatMap(rdd, function(line) {
strsplit(line, " ")[[1]]
})
# map: give 1 for each word
wordCount <- SparkR:::lapply(words, function(word) {
list(word, 1)
})
# reduce: count the value by key(word)
counts <- SparkR:::reduceByKey(wordCount, "+", 2)
# convert RDD to list
op <- SparkR:::collect(counts)
RDD v.s. DataFrames
flights_SDF <- read.df(sqlContext, "data_flights.csv",
source = "com.databricks.spark.csv", header = "true")
SDF_op <- flights_SDF %>%
group_by(flights_SDF$hour) %>%
summarize(sum(flights_SDF$dep_delay)) %>%
collect()
flights_RDD <- SparkR:::textFile(sc, "data_flights.csv")
RDD_op <- flights_RDD %>%
SparkR:::filterRDD(function (x) { x >= 1 }) %>%
SparkR:::lapply(function(x) {
y1 <- as.numeric(unlist(strsplit(x, ","))[2])
y2 <- as.numeric(unlist(strsplit(x, ","))[6])
return(list(y1,y2))}) %>%
SparkR:::reduceByKey(function(x,y) x + y, 1) %>%
SparkR:::collect()
DataFrames
RDD
SparkR on MLlib
SparkR supports a subset of the available R formula
operators for model fitting, including ~ . : + - ,
e.g. y ~ x1 + x2
Generalized Linear Model, GLM
# read data and cache
flights_SDF <- read.df("data_flights.csv", source = "csv",

header = "true", schema) %>% cache()
# drop NA
flights_SDF_2 <- dropna(flights_SDF, how = "any")
# split train/test dataset
train <- sample(flights_SDF_2, withReplacement = FALSE,
fraction = 0.5, seed = 42)
test <- except(flights_SDF_2, train)
# building model
gaussianGLM <- spark.glm(train, arr_delay ~ dep_delay + dist, 

family = "gaussian")
summary(gaussianGLM)
# prediction
preds <- predict(gaussianGLM, test)
K-means
# read data and cache
flights_SDF <- read.df("data_flights.csv", source = "csv",

header = "true", schema) %>% cache()
# drop NA
flights_SDF_2 <- dropna(flights_SDF, how = "any")
# clustering
kmeansModel <- spark.kmeans(flights_SDF_2, ~ arr_delay + 

dep_delay + dist + flight + dest + cancelled + 

time + dist, k = 15)
summary(kmeansModel)
cluster_op <- fitted(kmeansModel)
# clustering result
kmeansPredictions <- predict(kmeansModel, flights_SDF_2)
User Case
Median (approxQuantile)
gdf <- seq(1,10,1) %>% data.frame()
colnames(gdf) <- "seq"
sdf <- createDataFrame(gdf)
median_val <- approxQuantile(sdf, "seq", 0.5, 0) %>% print()
Calculate Median using SQL query…so complicated…
http://www.1keydata.com/tw/sql/sql-median.html
ID Match
##### method 1 : like dplyr + pipeline
join_id_m1 <- join(sdf_1, sdf_2,
sdf_1$id1 == sdf_2$id2, "inner") %>%
select("id2") %>%
collect()
##### method 2 : sql query
createOrReplaceTempView(sdf_1, "table1")
createOrReplaceTempView(sdf_2, "table2")
qry_str <- "SELECT table2.id2 FROM table1
JOIN table2 ON table1.id1 = table2.id2"
join_id_m2 <- sql(qry_str)
##### method 3 : SparkR function
join_id_m2 <- intersect(sdf_1, sdf_2) %>%
collect()
Play Pokemon Go Data
with SparkR !!
Application on SparkR
Interactive MapsWeb FrameworkCompute Engine
+
Where is the
Dragonite nest ?
+
Port: 8080 - Cluster Monitor
Capacity of each worker
Port: 4040
Jobs Monitor
cache(SparkDataFrame), long run time for first time
Advanced performance
Status of each worker
Some Tricks
• Customize spark config for launch
• cache()
• Some codes can’t run in Rstudio, try to use terminal
• Packages from 3rd party, like package of read csv
file from databricks
The Future of SparkR
• More MLlib API
• Advanced User Define Function
• package(“sparklyr”) from Rstudio
Reference
• SparkR: Scaling R Programs with Spark, Shivaram Venkataraman, Zongheng Yang, Davies Liu,
Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica,
and Matei Zaharia. SIGMOD 2016. June 2016.

https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
• SparkR: Interactive R programs at Scale, Shivaram Venkataraman, Zongheng Yang. Spark
Summit, June 2014, San Francisco.

https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf
• Apache Spark Official Research

http://spark.apache.org/research.html

- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

- http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
• Apache Spark Official Document

http://spark.apache.org/docs/latest/api/scala/
• AMPlab UC Berkeley - SparkR Project

https://github.com/amplab-extras/SparkR-pkg
• Databricks Official Blog

https://databricks.com/blog/category/engineering/spark
• R-blogger: Launch Apache Spark on AWS EC2 and Initialize SparkR Using Rstudio

https://www.r-bloggers.com/launch-apache-spark-on-aws-ec2-and-initialize-sparkr-using-rstudio-2/
Rstudio in Amazon EC2
Join Us
• Fansboard
• Web Designer (php & JavaScript)
• Editor w/ facebook & instagram
• Vpon - Data Scientist
• Taiwan Spark User Group
• Taiwan R User Group
Thanks for your attention
& Taiwan Spark User Group
& Vpon Data Team

More Related Content

What's hot

Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
DataWorks Summit
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Summit
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
Databricks
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Spark Summit
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
Databricks
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
Sahan Bulathwela
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
Spark Summit
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
Tudor Lapusan
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Spark Summit
 

What's hot (20)

Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
PySaprk
PySaprkPySaprk
PySaprk
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 

Viewers also liked

Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
Kien Dang
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Spark
felixcss
 
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming on R for Machine Learning with Felix CheungSSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
Databricks
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 

Viewers also liked (8)

Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Spark
 
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming on R for Machine Learning with Felix CheungSSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
 

Similar to SparkR - Play Spark Using R (20160909 HadoopCon)

Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
Michael Spector
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
Data Con LA
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
Ivan Morozov
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
Data Con LA
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
Brian O'Neill
 
Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?
Erik-Berndt Scheper
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
Christopher Batey
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
PROIDEA
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
Craig Warman
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
Denis Dus
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 

Similar to SparkR - Play Spark Using R (20160909 HadoopCon) (20)

Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 

Recently uploaded

一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 

Recently uploaded (20)

一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 

SparkR - Play Spark Using R (20160909 HadoopCon)

  • 1. SparkR - Play Spark Using R Gil Chen @HadoopCon 2016 Demo: http://goo.gl/VF77ad
  • 2. about me • R, Python & Matlab User • Taiwan R User Group • Taiwan Spark User Group • Co-founder • Data Scientist @
  • 4.
  • 5. Outline • Introduction to SparkR • Demo • Starting to use SparkR • DataFrames: dplyr style, SQL style • RDD v.s. DataFrames • MLlib: GLM, K-means • User Case • Median: approxQuantile() • ID Match: dplyr style, SQL style, SparkR function • SparkR + Shiny • The Future of SparkR
  • 7. Spark Origin • Apache Spark is an open source cluster computing framework • Originally developed at the University of California, Berkeley's AMPLab • The first 2 contributors of SparkR:
 Shivaram Venkataraman & Zongheng Yang https://amplab.cs.berkeley.edu/
  • 9. Key Advantages of Spark & R + Fast! Flexible Scalable Statistical! Interactive Packages https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf
  • 10. ggplot2 Google Search: ggplot2 ggplot2 is a plotting system for R, based on the grammar of graphics.
  • 11. Shiny http://shiny.rstudio.com/gallery/ and more impressive dashboard… A web application framework for R Turn your analyses into interactive web applications No HTML, CSS, or JavaScript knowledge required
  • 12. Performance https://amplab.cs.berkeley.edu/announcing-sparkr-r-on-spark/ The runtime performance of running group-by aggregation on 10 million integer pairs on a single machine in R, Python and Scala. (using the same dataset as https://goo.gl/iMLXnh) https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
  • 13. RDD (Resilient Distributed Dataset) https://spark.apache.org/docs/2.0.0/api/scala/#org.apache.spark.rdd.RDD Internally, each RDD is characterized by five main properties: 1. A list of partitions 2. A function for computing each split 3. A list of dependencies on other RDDs 4. Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) 5. Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) https://docs.cloud.databricks.com/docs/latest/courses
  • 14. RDD dependencies • Narrow dependency: Each partition of the parent RDD is used by at most one partition of the child RDD. This means the task can be executed locally and we don’t have to shuffle. (Eg: map, flatMap, Filter, sample etc.) • Wide dependency: Multiple child partitions may depend on one partition of the parent RDD. This means we have to shuffle data unless the parents are hash-partitioned (Eg: sortByKey, reduceByKey, groupByKey, join etc.) http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  • 17. RDD Example RDDRDDRDDRDD Transformations Action Value rdd <- SparkR:::textFile(sc, "txt") words <- SparkR:::flatMap(rdd, function()) wordCount <- SparkR:::lapply(words, function(word)) counts <- SparkR:::reduceByKey(wordCount, "+", 1) op <- SparkR:::collect(counts)
  • 18. R shell RDD SparkR RDD & DataFrames before v1.6 since v2.0 array data.frame + schema SparkDataFrame + schema General Action Transformation
  • 19. DataFrames are Faster! http://scala-phase.org/talks/rdds-dataframes-datasets-2016-06-16/#/ Beyond SQL: Speeding up Spark with DataFrames http://www.slideshare.net/databricks/spark-sqlsse2015public
  • 21. How does sparkR works? https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
  • 22. Upgrading From SparkR 1.6 to 2.0 Before 1.6.2 Since 2.0.0 data type naming DataFrame SparkDataFrame read csv Package from Databricks built-in function (like approxQuantile) X O ML function glm more
 (or use sparklyr) SQLContext / HiveContext sparkRSQL.init(sc) merge in sparkR.session() Execute Message very detailed simple Launch on EC2 API X https://spark.apache.org/docs/latest/sparkr.html
  • 24. Easy Setting 1. Download 2. Decompress and Give a Path 3. Set Path and Launch SparkR in R
  • 25. Documents • If you have to use RDD, refer to AMP-Lab github:
 http://amplab-extras.github.io/SparkR-pkg/rdocs/1.2/
 and use “:::”
 e.g. SparkR:::textFile, SparkR:::lapply • Otherwise, refer to SparkR official documents:
 https://spark.apache.org/docs/2.0.0/api/R/index.html
  • 26. Starting to Use SparkR (v1.6.2) # Set Spark path Sys.setenv(SPARK_HOME="/usr/local/spark-1.6.2-bin-hadoop2.6/") # Load SparkR library into your R session library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) # Initialize SparkContext, sc:sparkContext sc <- sparkR.init(appName = "Demo_SparkR") # Initialize SQLContext sqlContext <- sparkRSQL.init(sc) # your sparkR script # ... # ... sparkR.stop()
  • 27. Starting to Use SparkR (v2.0.0) # Set Spark path Sys.setenv(SPARK_HOME="/usr/local/spark-2.0.0-bin-hadoop2.7/") # Load SparkR library into your R session library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) # Initialize SparkContext, sc: sparkContext sc <- sparkR.session(appName = "Demo_SparkR") # Initialize SQLContext (don’t need anymore since 2.0.0) # sqlContext <- sparkRSQL.init(sc) # your sparkR script # ... # ... sparkR.stop()
  • 28. DataFrames # Load the flights CSV file using read.df sdf <- read.df(sqlContext,"data_flights.csv", "com.databricks.spark.csv", header = "true") # Filter flights from JFK jfk_flights <- filter(sdf, sdf$origin == "JFK") # Group and aggregate flights to each destination dest_flights <- summarize( groupBy(jfk_flights, jfk_flights$dest), count = n(jfk_flights$dest)) # Running SQL Queries registerTempTable(sdf, "tempTable") training <- sql(sqlContext, "SELECT dest, count(dest) as cnt FROM tempTable WHERE dest = 'JFK' GROUP BY dest")
  • 29. Word Count # read data into RDD rdd <- SparkR:::textFile(sc, "data_word_count.txt") # split word words <- SparkR:::flatMap(rdd, function(line) { strsplit(line, " ")[[1]] }) # map: give 1 for each word wordCount <- SparkR:::lapply(words, function(word) { list(word, 1) }) # reduce: count the value by key(word) counts <- SparkR:::reduceByKey(wordCount, "+", 2) # convert RDD to list op <- SparkR:::collect(counts)
  • 30. RDD v.s. DataFrames flights_SDF <- read.df(sqlContext, "data_flights.csv", source = "com.databricks.spark.csv", header = "true") SDF_op <- flights_SDF %>% group_by(flights_SDF$hour) %>% summarize(sum(flights_SDF$dep_delay)) %>% collect() flights_RDD <- SparkR:::textFile(sc, "data_flights.csv") RDD_op <- flights_RDD %>% SparkR:::filterRDD(function (x) { x >= 1 }) %>% SparkR:::lapply(function(x) { y1 <- as.numeric(unlist(strsplit(x, ","))[2]) y2 <- as.numeric(unlist(strsplit(x, ","))[6]) return(list(y1,y2))}) %>% SparkR:::reduceByKey(function(x,y) x + y, 1) %>% SparkR:::collect() DataFrames RDD
  • 31. SparkR on MLlib SparkR supports a subset of the available R formula operators for model fitting, including ~ . : + - , e.g. y ~ x1 + x2
  • 32. Generalized Linear Model, GLM # read data and cache flights_SDF <- read.df("data_flights.csv", source = "csv",
 header = "true", schema) %>% cache() # drop NA flights_SDF_2 <- dropna(flights_SDF, how = "any") # split train/test dataset train <- sample(flights_SDF_2, withReplacement = FALSE, fraction = 0.5, seed = 42) test <- except(flights_SDF_2, train) # building model gaussianGLM <- spark.glm(train, arr_delay ~ dep_delay + dist, 
 family = "gaussian") summary(gaussianGLM) # prediction preds <- predict(gaussianGLM, test)
  • 33. K-means # read data and cache flights_SDF <- read.df("data_flights.csv", source = "csv",
 header = "true", schema) %>% cache() # drop NA flights_SDF_2 <- dropna(flights_SDF, how = "any") # clustering kmeansModel <- spark.kmeans(flights_SDF_2, ~ arr_delay + 
 dep_delay + dist + flight + dest + cancelled + 
 time + dist, k = 15) summary(kmeansModel) cluster_op <- fitted(kmeansModel) # clustering result kmeansPredictions <- predict(kmeansModel, flights_SDF_2)
  • 35. Median (approxQuantile) gdf <- seq(1,10,1) %>% data.frame() colnames(gdf) <- "seq" sdf <- createDataFrame(gdf) median_val <- approxQuantile(sdf, "seq", 0.5, 0) %>% print() Calculate Median using SQL query…so complicated… http://www.1keydata.com/tw/sql/sql-median.html
  • 36. ID Match ##### method 1 : like dplyr + pipeline join_id_m1 <- join(sdf_1, sdf_2, sdf_1$id1 == sdf_2$id2, "inner") %>% select("id2") %>% collect() ##### method 2 : sql query createOrReplaceTempView(sdf_1, "table1") createOrReplaceTempView(sdf_2, "table2") qry_str <- "SELECT table2.id2 FROM table1 JOIN table2 ON table1.id1 = table2.id2" join_id_m2 <- sql(qry_str) ##### method 3 : SparkR function join_id_m2 <- intersect(sdf_1, sdf_2) %>% collect()
  • 37. Play Pokemon Go Data with SparkR !!
  • 38. Application on SparkR Interactive MapsWeb FrameworkCompute Engine + Where is the Dragonite nest ? +
  • 39. Port: 8080 - Cluster Monitor Capacity of each worker
  • 40. Port: 4040 Jobs Monitor cache(SparkDataFrame), long run time for first time Advanced performance Status of each worker
  • 41.
  • 42. Some Tricks • Customize spark config for launch • cache() • Some codes can’t run in Rstudio, try to use terminal • Packages from 3rd party, like package of read csv file from databricks
  • 43. The Future of SparkR • More MLlib API • Advanced User Define Function • package(“sparklyr”) from Rstudio
  • 44. Reference • SparkR: Scaling R Programs with Spark, Shivaram Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica, and Matei Zaharia. SIGMOD 2016. June 2016.
 https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf • SparkR: Interactive R programs at Scale, Shivaram Venkataraman, Zongheng Yang. Spark Summit, June 2014, San Francisco.
 https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf • Apache Spark Official Research
 http://spark.apache.org/research.html
 - Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
 - http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf • Apache Spark Official Document
 http://spark.apache.org/docs/latest/api/scala/ • AMPlab UC Berkeley - SparkR Project
 https://github.com/amplab-extras/SparkR-pkg • Databricks Official Blog
 https://databricks.com/blog/category/engineering/spark • R-blogger: Launch Apache Spark on AWS EC2 and Initialize SparkR Using Rstudio
 https://www.r-bloggers.com/launch-apache-spark-on-aws-ec2-and-initialize-sparkr-using-rstudio-2/
  • 46. Join Us • Fansboard • Web Designer (php & JavaScript) • Editor w/ facebook & instagram • Vpon - Data Scientist • Taiwan Spark User Group • Taiwan R User Group
  • 47. Thanks for your attention & Taiwan Spark User Group & Vpon Data Team