SlideShare a Scribd company logo
Data Science in Spark
with sparklyr
Cheat Sheet
Intro
sparklyris an R interface for
Apache Spark™, it provides a
complete dplyr backend and
the option to query directly
using Spark SQL statement. With sparklyr, you
can orchestrate distributed machine learning
using either Spark’s MLlib or H2O Sparkling
Water.
www.rstudio.com
sparklyr
Data Science Toolchain with Spark + sparklyr
fd
Import
• Export an R
DataFrame
• Read a file
• Read existing
Hive table
fd
Tidy
• dplyr verb
• Direct Spark
SQL (DBI)
• SDF function
(Scala API)
fd
Transform
Transformer
function
fd
Model
• Spark MLlib
• H2O Extension
fd
Visualize
Collect data into
R for plotting
Communicate
• Collect data
into R
• Share plots,
documents,
and apps
R for Data Science, Grolemund & Wickham
Getting started
1. Install RStudio Server or RStudio Pro on one
of the existing nodes, preferably an edge
node
On a YARN Managed Cluster
3. Open a connection
2. Locate path to the cluster’s Spark Home
Directory, it normally is “/usr/lib/spark”
spark_connect(master=“yarn-client”,
version = “1.6.2”, spark_home = [Cluster’s
Spark path])
1. Install RStudio Server or Pro on one of the
existing nodes
On a Mesos Managed Cluster
3. Open a connection
2. Locate path to the cluster’s Spark directory
spark_connect(master=“[mesos URL]”,
version = “1.6.2”, spark_home = [Cluster’s
Spark path])
Disconnect
Open the
Spark UI
Spark & Hive Tables
Open connection log
Preview
1K rows
RStudio Integrates with sparklyr
Starting with version 1.044, RStudio Desktop,
Server and Pro include integrated support for
the sparklyr package. You can create and
manage connections to Spark clusters and local
Spark instances from inside the IDE.
Using sparklyr
library(sparklyr); library(dplyr); library(ggplot2);
library(tidyr);
set.seed(100)
spark_install("2.0.1")
sc <- spark_connect(master = "local")
import_iris <- copy_to(sc, iris, "spark_iris",
overwrite = TRUE)
partition_iris <- sdf_partition(
import_iris,training=0.5, testing=0.5)
sdf_register(partition_iris,
c("spark_iris_training","spark_iris_test"))
tidy_iris <- tbl(sc,"spark_iris_training") %>%
select(Species, Petal_Length, Petal_Width)
model_iris <- tidy_iris %>%
ml_decision_tree(response="Species",
features=c("Petal_Length","Petal_Width"))
test_iris <- tbl(sc,"spark_iris_test")
pred_iris <- sdf_predict(
model_iris, test_iris) %>%
collect
pred_iris %>%
inner_join(data.frame(prediction=0:2,
lab=model_iris$model.parameters$labels)) %>%
ggplot(aes(Petal_Length, Petal_Width, col=lab)) +
geom_point()
spark_disconnect(sc)
Partition
data
Install Spark locally
Connect to local version
Copy data to Spark memory
Create a hive metadata for each partition
Bring data back
into R memory
for plotting
A brief example of a data analysis using
Apache Spark, R and sparklyr in local mode
Spark ML
Decision Tree
Model
Create
reference to
Spark table
• spark.executor.heartbeatInterval
• spark.network.timeout
• spark.executor.memory
• spark.executor.cores
• spark.executor.extraJavaOptions
• spark.executor.instances
• sparklyr.shell.executor-memory
• sparklyr.shell.driver-memory
Important Tuning Parameters
with defaults continuedconfig <- spark_config()
config$spark.executor.cores <- 2
config$spark.executor.memory <- "4G"
sc <- spark_connect (master = "yarn-
client", config = config, version = "2.0.1")
Cluster Deployment
Example Configuration
• spark.yarn.am.cores
• spark.yarn.am.memory
Important Tuning Parameters
with defaults
512m
10s
120s
1g
1
Managed Cluster
Driver Node
fd
Worker Nodes
YARN
Mesos
or
fd
fd
Driver Node
fd
fd
fd
Worker NodesCluster
Manager
Stand Alone Cluster
Cluster Deployment Options
Understand
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more at spark.rstudio.com • package version 0.5 • Updated: 12/21/16
Disconnect
1. Install RStudio Server or RStudio Pro on
one of the existing nodes or a server in the
same LAN
On a Spark Standalone Cluster
3. Open a connection
2. Install a local version of Spark:
spark_install (version = “2.0.1")
spark_connect(master=“spark://
host:port“, version = "2.0.1",
spark_home = spark_home_dir())
1. The Livy REST application should be running
on the cluster
2. Connect to the cluster
sc <- spark_connect(master = “http://host:port” ,
method = “livy")
Using Livy (Experimental)
Tuning Spark
Local Mode
2. Open a connection
1. Install a local version of Spark:
spark_install ("2.0.1")
Easy setup; no cluster required
sc <- spark_connect (master = "local")
Wrangle
Visualize & Communicate
Download data to R memory
dplyr::collect(x)
r_table <- collect(my_table)
plot(Petal_Width~Petal_Length, data=r_table)
sdf_read_column(x, column)
Returns contents of a single column to R
my_var <- tbl_cache(sc,
name= "hive_iris")
From a table in Hive
tbl_cache(sc, name, force = TRUE)
Loads the table into memory
my_var <- dplyr::tbl(sc,
name= "hive_iris")
dplyr::tbl(scr, …)
Creates a reference to the table
without loading it into memory
Import
sdf_copy_to(sc, x, name, memory, repartition,
overwrite)
sdf_copy_to(sc, iris, "spark_iris")
DBI::dbWriteTable(

sc, "spark_iris", iris)
Spark SQL commands
DBI::dbWriteTable(conn, name,
value)
Copy a DataFrame into Spark
Model (MLlib)
ml_als_factorization(x, rating.column = "rating", user.column =
"user", item.column = "item", rank = 10L, regularization.parameter =
0.1, iter.max = 10L, ml.options = ml_options())
ml_decision_tree(x, response, features, max.bins = 32L, max.depth
= 5L, type = c("auto", "regression", "classification"), ml.options =
ml_options())
ml_generalized_linear_regression(x, response, features,
intercept = TRUE, family = gaussian(link = "identity"), iter.max =
100L, ml.options = ml_options())
Same options for: ml_gradient_boosted_trees
ml_kmeans(x, centers, iter.max = 100, features = dplyr::tbl_vars(x),
compute.cost = TRUE, tolerance = 1e-04, ml.options = ml_options())
ml_lda(x, features = dplyr::tbl_vars(x), k = length(features), alpha =
(50/k) + 1, beta = 0.1 + 1, ml.options = ml_options())
ml_linear_regression(x, response, features, intercept = TRUE,
alpha = 0, lambda = 0, iter.max = 100L, ml.options = ml_options())
Same options for: ml_logistic_regression
ml_multilayer_perceptron(x, response, features, layers, iter.max
= 100, seed = sample(.Machine$integer.max, 1), ml.options =
ml_options())
ml_naive_bayes(x, response, features, lambda = 0, ml.options =
ml_options())
ml_one_vs_rest(x, classifier, response, features, ml.options =
ml_options())
ml_pca(x, features = dplyr::tbl_vars(x), ml.options = ml_options())
ml_random_forest(x, response, features, max.bins = 32L,
max.depth = 5L, num.trees = 20L, type = c("auto", "regression",
"classification"), ml.options = ml_options())
ml_survival_regression(x, response, features, intercept =
TRUE,censor = "censor", iter.max = 100L, ml.options =
ml_options())
ml_binary_classification_eval(predicted_tbl_spark, label,
score, metric = "areaUnderROC")
ml_classification_eval(predicted_tbl_spark, label, predicted_lbl,
metric = "f1")
ml_tree_feature_importance(sc, model)
ml_decision_tree(my_table , response=“Species", features=
c(“Petal_Length" , "Petal_Width"))
Wrangle
Spark SQL via dplyr verbs
Translates into Spark SQL statements
my_table <- my_var %>%
filter(Species=="setosa") %>%
sample_n(10)
my_table <- DBI::dbGetQuery( sc , ”SELECT *
FROM iris LIMIT 10")
Direct Spark SQL commands
DBI::dbGetQuery(conn, statement)
Scala API via SDF functions
sdf_mutate(.data)
Works like dplyr mutate function
sdf_partition(x, ..., weights = NULL, seed
= sample (.Machine$integer.max, 1))
sdf_partition(x, training = 0.5, test = 0.5)
sdf_register(x, name = NULL)
Gives a Spark DataFrame a table name
sdf_sample(x, fraction = 1, replacement =
TRUE, seed = NULL)
sdf_sort(x, columns)
Sorts by >=1 columns in ascending order
sdf_with_unique_id(x, id = "id")
Add unique ID column
ft_binarizer(threshold = 0.5)
ft_bucketizer(splits)
ft_discrete_cosine_transform(invers
e = FALSE)
ft_elementwise_product(scaling.col)
ft_index_to_string()
ft_one_hot_encoder()
ft_quantile_discretizer( n.buckets
= 5L)
ft_sql_transformer(sql)
ft_string_indexer( params = NULL)
ft_vector_assembler()
Combine vectors into a single row-
vector
Column of labels into a column of
label indices.
Continuous to binary vectors
Continuous to binned categorical
values
Index labels back to label as strings
Element-wise product between 2
columns
Time domain to frequency domain
Numeric column to discretized column
Assigned values based on threshold
ft_binarizer(my_table,input.col=“Petal_
Length”, output.col="petal_large",
threshold=1.2)
Arguments that apply to all functions:
x, input.col = NULL, output.col = NULL
Reading & Writing from Apache Spark
spark_read_<fmt>
sdf_copy_to
DBI::dbWriteTable
dplyr::collect
sdf_read_column
spark_write_<fmt>
tbl_cache
dplyr::tbl
File
System
Download a Spark DataFrame to an R DataFrame
spark_jobj()
Extensions
Create an R package that calls the full Spark API &
provide interfaces to Spark packages.
Core Types
spark_connection() Connection between R and the
Spark shell process
Instance of a remote Spark object
spark_dataframe()
Instance of a remote Spark
DataFrame object
Call Spark from R
invoke() Call a method on a Java object
invoke_new() Create a new object by invoking a
constructor
invoke_static() Call a static method on an object
spark_jobj()
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more at spark.rstudio.com • package version 0.5 • Updated: 12/21/16
spark_read_csv( header = TRUE,
delimiter = ",", quote = """, escape = "",
charset = "UTF-8", null_value = NULL)
spark_read_json(mode = NULL)
spark_read_parquet(mode = NULL)
Arguments that apply to all functions: x, path
Save from Spark to File System
CSV
JSON
PARQUET
spark_read_csv( header = TRUE,
columns = NULL, infer_schema = TRUE,
delimiter = ",", quote = """, escape = "",
charset = "UTF-8", null_value = NULL)
spark_read_json()
spark_read_parquet()
Arguments that apply to all functions:
sc, name, path, options = list(), repartition = 0,
memory = TRUE, overwrite = TRUE
Import into Spark from a File
CSV
JSON
PARQUET
sdf_predict(object, newdata)
Spark DataFrame with predicted values
sdf_collect
dplyr::copy_to
spark_dataframe()
sparklyr
is an R
interface
for
Machine Learning Extensions
ml_create_dummy_variables()
ml_model()ml_prepare_dataframe()
ml_prepare_response_features_intercept()
ml_options()
ML Transformers

More Related Content

What's hot

Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformation
Tim Essam
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
Laura Hughes
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
Rsquared Academy
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
Julian Hyde
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In R
Rsquared Academy
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
Rsquared Academy
 
Hive function-cheat-sheet
Hive function-cheat-sheetHive function-cheat-sheet
Hive function-cheat-sheet
Dr. Volkan OBAN
 
Language R
Language RLanguage R
Language R
Girish Khanzode
 
R programming language
R programming languageR programming language
R programming language
Alberto Minetti
 
Stata cheat sheet analysis
Stata cheat sheet analysisStata cheat sheet analysis
Stata cheat sheet analysis
Tim Essam
 
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data Structure
Sakthi Dasans
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programming
Alberto Labarga
 
Why async and functional programming in PHP7 suck and how to get overr it?
Why async and functional programming in PHP7 suck and how to get overr it?Why async and functional programming in PHP7 suck and how to get overr it?
Why async and functional programming in PHP7 suck and how to get overr it?
Lucas Witold Adamus
 
R Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RR Programming: Mathematical Functions In R
R Programming: Mathematical Functions In R
Rsquared Academy
 
R Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In RR Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In R
Rsquared Academy
 
Data Management in Python
Data Management in PythonData Management in Python
Data Management in Python
Sankhya_Analytics
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
Julian Hyde
 
Data Analysis and Programming in R
Data Analysis and Programming in RData Analysis and Programming in R
Data Analysis and Programming in REshwar Sai
 
Data handling in r
Data handling in rData handling in r
Data handling in r
Abhik Seal
 

What's hot (20)

Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformation
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In R
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
 
Hive function-cheat-sheet
Hive function-cheat-sheetHive function-cheat-sheet
Hive function-cheat-sheet
 
R learning by examples
R learning by examplesR learning by examples
R learning by examples
 
Language R
Language RLanguage R
Language R
 
R programming language
R programming languageR programming language
R programming language
 
Stata cheat sheet analysis
Stata cheat sheet analysisStata cheat sheet analysis
Stata cheat sheet analysis
 
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data Structure
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programming
 
Why async and functional programming in PHP7 suck and how to get overr it?
Why async and functional programming in PHP7 suck and how to get overr it?Why async and functional programming in PHP7 suck and how to get overr it?
Why async and functional programming in PHP7 suck and how to get overr it?
 
R Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RR Programming: Mathematical Functions In R
R Programming: Mathematical Functions In R
 
R Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In RR Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In R
 
Data Management in Python
Data Management in PythonData Management in Python
Data Management in Python
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Data Analysis and Programming in R
Data Analysis and Programming in RData Analysis and Programming in R
Data Analysis and Programming in R
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 

Similar to Sparklyr

Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Machine learning using spark
Machine learning using sparkMachine learning using spark
Machine learning using spark
Ran Silberman
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
Snehal Nagmote
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
de:code 2017
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Spark
felixcss
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
Gal Marder
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
Thành Nguyễn
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 

Similar to Sparklyr (20)

Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Machine learning using spark
Machine learning using sparkMachine learning using spark
Machine learning using spark
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Spark
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 

More from Dieudonne Nahigombeye

Rstudio ide-cheatsheet
Rstudio ide-cheatsheetRstudio ide-cheatsheet
Rstudio ide-cheatsheet
Dieudonne Nahigombeye
 
Reg ex cheatsheet
Reg ex cheatsheetReg ex cheatsheet
Reg ex cheatsheet
Dieudonne Nahigombeye
 
How big-is-your-graph
How big-is-your-graphHow big-is-your-graph
How big-is-your-graph
Dieudonne Nahigombeye
 
Ggplot2 cheatsheet-2.1
Ggplot2 cheatsheet-2.1Ggplot2 cheatsheet-2.1
Ggplot2 cheatsheet-2.1
Dieudonne Nahigombeye
 
Eurostat cheatsheet
Eurostat cheatsheetEurostat cheatsheet
Eurostat cheatsheet
Dieudonne Nahigombeye
 
Devtools cheatsheet
Devtools cheatsheetDevtools cheatsheet
Devtools cheatsheet
Dieudonne Nahigombeye
 
Base r
Base rBase r
Advanced r
Advanced rAdvanced r

More from Dieudonne Nahigombeye (8)

Rstudio ide-cheatsheet
Rstudio ide-cheatsheetRstudio ide-cheatsheet
Rstudio ide-cheatsheet
 
Reg ex cheatsheet
Reg ex cheatsheetReg ex cheatsheet
Reg ex cheatsheet
 
How big-is-your-graph
How big-is-your-graphHow big-is-your-graph
How big-is-your-graph
 
Ggplot2 cheatsheet-2.1
Ggplot2 cheatsheet-2.1Ggplot2 cheatsheet-2.1
Ggplot2 cheatsheet-2.1
 
Eurostat cheatsheet
Eurostat cheatsheetEurostat cheatsheet
Eurostat cheatsheet
 
Devtools cheatsheet
Devtools cheatsheetDevtools cheatsheet
Devtools cheatsheet
 
Base r
Base rBase r
Base r
 
Advanced r
Advanced rAdvanced r
Advanced r
 

Recently uploaded

一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 

Recently uploaded (20)

一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 

Sparklyr

  • 1. Data Science in Spark with sparklyr Cheat Sheet Intro sparklyris an R interface for Apache Spark™, it provides a complete dplyr backend and the option to query directly using Spark SQL statement. With sparklyr, you can orchestrate distributed machine learning using either Spark’s MLlib or H2O Sparkling Water. www.rstudio.com sparklyr Data Science Toolchain with Spark + sparklyr fd Import • Export an R DataFrame • Read a file • Read existing Hive table fd Tidy • dplyr verb • Direct Spark SQL (DBI) • SDF function (Scala API) fd Transform Transformer function fd Model • Spark MLlib • H2O Extension fd Visualize Collect data into R for plotting Communicate • Collect data into R • Share plots, documents, and apps R for Data Science, Grolemund & Wickham Getting started 1. Install RStudio Server or RStudio Pro on one of the existing nodes, preferably an edge node On a YARN Managed Cluster 3. Open a connection 2. Locate path to the cluster’s Spark Home Directory, it normally is “/usr/lib/spark” spark_connect(master=“yarn-client”, version = “1.6.2”, spark_home = [Cluster’s Spark path]) 1. Install RStudio Server or Pro on one of the existing nodes On a Mesos Managed Cluster 3. Open a connection 2. Locate path to the cluster’s Spark directory spark_connect(master=“[mesos URL]”, version = “1.6.2”, spark_home = [Cluster’s Spark path]) Disconnect Open the Spark UI Spark & Hive Tables Open connection log Preview 1K rows RStudio Integrates with sparklyr Starting with version 1.044, RStudio Desktop, Server and Pro include integrated support for the sparklyr package. You can create and manage connections to Spark clusters and local Spark instances from inside the IDE. Using sparklyr library(sparklyr); library(dplyr); library(ggplot2); library(tidyr); set.seed(100) spark_install("2.0.1") sc <- spark_connect(master = "local") import_iris <- copy_to(sc, iris, "spark_iris", overwrite = TRUE) partition_iris <- sdf_partition( import_iris,training=0.5, testing=0.5) sdf_register(partition_iris, c("spark_iris_training","spark_iris_test")) tidy_iris <- tbl(sc,"spark_iris_training") %>% select(Species, Petal_Length, Petal_Width) model_iris <- tidy_iris %>% ml_decision_tree(response="Species", features=c("Petal_Length","Petal_Width")) test_iris <- tbl(sc,"spark_iris_test") pred_iris <- sdf_predict( model_iris, test_iris) %>% collect pred_iris %>% inner_join(data.frame(prediction=0:2, lab=model_iris$model.parameters$labels)) %>% ggplot(aes(Petal_Length, Petal_Width, col=lab)) + geom_point() spark_disconnect(sc) Partition data Install Spark locally Connect to local version Copy data to Spark memory Create a hive metadata for each partition Bring data back into R memory for plotting A brief example of a data analysis using Apache Spark, R and sparklyr in local mode Spark ML Decision Tree Model Create reference to Spark table • spark.executor.heartbeatInterval • spark.network.timeout • spark.executor.memory • spark.executor.cores • spark.executor.extraJavaOptions • spark.executor.instances • sparklyr.shell.executor-memory • sparklyr.shell.driver-memory Important Tuning Parameters with defaults continuedconfig <- spark_config() config$spark.executor.cores <- 2 config$spark.executor.memory <- "4G" sc <- spark_connect (master = "yarn- client", config = config, version = "2.0.1") Cluster Deployment Example Configuration • spark.yarn.am.cores • spark.yarn.am.memory Important Tuning Parameters with defaults 512m 10s 120s 1g 1 Managed Cluster Driver Node fd Worker Nodes YARN Mesos or fd fd Driver Node fd fd fd Worker NodesCluster Manager Stand Alone Cluster Cluster Deployment Options Understand RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more at spark.rstudio.com • package version 0.5 • Updated: 12/21/16 Disconnect 1. Install RStudio Server or RStudio Pro on one of the existing nodes or a server in the same LAN On a Spark Standalone Cluster 3. Open a connection 2. Install a local version of Spark: spark_install (version = “2.0.1") spark_connect(master=“spark:// host:port“, version = "2.0.1", spark_home = spark_home_dir()) 1. The Livy REST application should be running on the cluster 2. Connect to the cluster sc <- spark_connect(master = “http://host:port” , method = “livy") Using Livy (Experimental) Tuning Spark Local Mode 2. Open a connection 1. Install a local version of Spark: spark_install ("2.0.1") Easy setup; no cluster required sc <- spark_connect (master = "local") Wrangle
  • 2. Visualize & Communicate Download data to R memory dplyr::collect(x) r_table <- collect(my_table) plot(Petal_Width~Petal_Length, data=r_table) sdf_read_column(x, column) Returns contents of a single column to R my_var <- tbl_cache(sc, name= "hive_iris") From a table in Hive tbl_cache(sc, name, force = TRUE) Loads the table into memory my_var <- dplyr::tbl(sc, name= "hive_iris") dplyr::tbl(scr, …) Creates a reference to the table without loading it into memory Import sdf_copy_to(sc, x, name, memory, repartition, overwrite) sdf_copy_to(sc, iris, "spark_iris") DBI::dbWriteTable(
 sc, "spark_iris", iris) Spark SQL commands DBI::dbWriteTable(conn, name, value) Copy a DataFrame into Spark Model (MLlib) ml_als_factorization(x, rating.column = "rating", user.column = "user", item.column = "item", rank = 10L, regularization.parameter = 0.1, iter.max = 10L, ml.options = ml_options()) ml_decision_tree(x, response, features, max.bins = 32L, max.depth = 5L, type = c("auto", "regression", "classification"), ml.options = ml_options()) ml_generalized_linear_regression(x, response, features, intercept = TRUE, family = gaussian(link = "identity"), iter.max = 100L, ml.options = ml_options()) Same options for: ml_gradient_boosted_trees ml_kmeans(x, centers, iter.max = 100, features = dplyr::tbl_vars(x), compute.cost = TRUE, tolerance = 1e-04, ml.options = ml_options()) ml_lda(x, features = dplyr::tbl_vars(x), k = length(features), alpha = (50/k) + 1, beta = 0.1 + 1, ml.options = ml_options()) ml_linear_regression(x, response, features, intercept = TRUE, alpha = 0, lambda = 0, iter.max = 100L, ml.options = ml_options()) Same options for: ml_logistic_regression ml_multilayer_perceptron(x, response, features, layers, iter.max = 100, seed = sample(.Machine$integer.max, 1), ml.options = ml_options()) ml_naive_bayes(x, response, features, lambda = 0, ml.options = ml_options()) ml_one_vs_rest(x, classifier, response, features, ml.options = ml_options()) ml_pca(x, features = dplyr::tbl_vars(x), ml.options = ml_options()) ml_random_forest(x, response, features, max.bins = 32L, max.depth = 5L, num.trees = 20L, type = c("auto", "regression", "classification"), ml.options = ml_options()) ml_survival_regression(x, response, features, intercept = TRUE,censor = "censor", iter.max = 100L, ml.options = ml_options()) ml_binary_classification_eval(predicted_tbl_spark, label, score, metric = "areaUnderROC") ml_classification_eval(predicted_tbl_spark, label, predicted_lbl, metric = "f1") ml_tree_feature_importance(sc, model) ml_decision_tree(my_table , response=“Species", features= c(“Petal_Length" , "Petal_Width")) Wrangle Spark SQL via dplyr verbs Translates into Spark SQL statements my_table <- my_var %>% filter(Species=="setosa") %>% sample_n(10) my_table <- DBI::dbGetQuery( sc , ”SELECT * FROM iris LIMIT 10") Direct Spark SQL commands DBI::dbGetQuery(conn, statement) Scala API via SDF functions sdf_mutate(.data) Works like dplyr mutate function sdf_partition(x, ..., weights = NULL, seed = sample (.Machine$integer.max, 1)) sdf_partition(x, training = 0.5, test = 0.5) sdf_register(x, name = NULL) Gives a Spark DataFrame a table name sdf_sample(x, fraction = 1, replacement = TRUE, seed = NULL) sdf_sort(x, columns) Sorts by >=1 columns in ascending order sdf_with_unique_id(x, id = "id") Add unique ID column ft_binarizer(threshold = 0.5) ft_bucketizer(splits) ft_discrete_cosine_transform(invers e = FALSE) ft_elementwise_product(scaling.col) ft_index_to_string() ft_one_hot_encoder() ft_quantile_discretizer( n.buckets = 5L) ft_sql_transformer(sql) ft_string_indexer( params = NULL) ft_vector_assembler() Combine vectors into a single row- vector Column of labels into a column of label indices. Continuous to binary vectors Continuous to binned categorical values Index labels back to label as strings Element-wise product between 2 columns Time domain to frequency domain Numeric column to discretized column Assigned values based on threshold ft_binarizer(my_table,input.col=“Petal_ Length”, output.col="petal_large", threshold=1.2) Arguments that apply to all functions: x, input.col = NULL, output.col = NULL Reading & Writing from Apache Spark spark_read_<fmt> sdf_copy_to DBI::dbWriteTable dplyr::collect sdf_read_column spark_write_<fmt> tbl_cache dplyr::tbl File System Download a Spark DataFrame to an R DataFrame spark_jobj() Extensions Create an R package that calls the full Spark API & provide interfaces to Spark packages. Core Types spark_connection() Connection between R and the Spark shell process Instance of a remote Spark object spark_dataframe() Instance of a remote Spark DataFrame object Call Spark from R invoke() Call a method on a Java object invoke_new() Create a new object by invoking a constructor invoke_static() Call a static method on an object spark_jobj() RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more at spark.rstudio.com • package version 0.5 • Updated: 12/21/16 spark_read_csv( header = TRUE, delimiter = ",", quote = """, escape = "", charset = "UTF-8", null_value = NULL) spark_read_json(mode = NULL) spark_read_parquet(mode = NULL) Arguments that apply to all functions: x, path Save from Spark to File System CSV JSON PARQUET spark_read_csv( header = TRUE, columns = NULL, infer_schema = TRUE, delimiter = ",", quote = """, escape = "", charset = "UTF-8", null_value = NULL) spark_read_json() spark_read_parquet() Arguments that apply to all functions: sc, name, path, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE Import into Spark from a File CSV JSON PARQUET sdf_predict(object, newdata) Spark DataFrame with predicted values sdf_collect dplyr::copy_to spark_dataframe() sparklyr is an R interface for Machine Learning Extensions ml_create_dummy_variables() ml_model()ml_prepare_dataframe() ml_prepare_response_features_intercept() ml_options() ML Transformers