This document provides an overview of using sparklyr to perform data science tasks with Apache Spark. It discusses how sparklyr allows users to import data, transform it using dplyr verbs and Spark SQL, build and evaluate models using Spark MLlib and H2O, and visualize results. It also covers important concepts like connecting to different Spark cluster configurations and tuning Spark settings. Finally, it demonstrates an example workflow performing preprocessing, modeling, and visualization on the iris dataset using sparklyr functions.
Desk reference for data wrangling, analysis, visualization, and programming in Stata. Co-authored with Tim Essam(@StataRGIS, linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/03
Desk reference for data transformation in Stata. Co-authored with Tim Essam (@StataRGIS, linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/03.
Desk reference for data wrangling, analysis, visualization, and programming in Stata. Co-authored with Tim Essam(@StataRGIS, linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/03
Desk reference for data transformation in Stata. Co-authored with Tim Essam (@StataRGIS, linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/03.
Stata cheat sheet: programming. Co-authored with Tim Essam (linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/04
Learn to manipulate strings in R using the built in R functions. This tutorial is part of the Working With Data module of the R Programming Course offered by r-squared.
Is it easier to add functional programming features to a query language, or to add query capabilities to a functional language? In Morel, we have done the latter.
Functional and query languages have much in common, and yet much to learn from each other. Functional languages have a rich type system that includes polymorphism and functions-as-values and Turing-complete expressiveness; query languages have optimization techniques that can make programs several orders of magnitude faster, and runtimes that can use thousands of nodes to execute queries over terabytes of data.
Morel is an implementation of Standard ML on the JVM, with language extensions to allow relational expressions. Its compiler can translate programs to relational algebra and, via Apache Calcite’s query optimizer, run those programs on relational backends.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
(A talk given by Julian Hyde at Strange Loop 2021, St. Louis, MO, on October 1st, 2021.)
Why async and functional programming in PHP7 suck and how to get overr it?Lucas Witold Adamus
This presentation describes basic issues related to functional programming with PHP and solution for most of problems served by the library called PhpSlang.
Learn the built-in mathematical functions in R. This tutorial is part of the Working With Data module of the R Programming course offered by r-squared.
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
The perfect data parallel language has not yet been invented. SQL queries can achieve great performance and scale, but there are many general purpose algorithms that it cannot express. In Morel, we build on the functional and relational roots of MapReduce in an elegant and strongly-typed general-purpose programming language. But Morel is, in a real sense, a query language; programs are executed on relational frameworks such as Google BigQuery and Spark.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
We also introduce Apache Calcite, the popular open source framework for query planning, and describe how Morel's compiler uses Calcite's relational algebra and rewrite rules to generate efficient plans.
The slide shows a full gist of reading different types of data in R thanks to coursera it was much comprehensive and i made some additional changes too.
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.
Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.
When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.
Through live demonstrations and code samples, you will understand: – how to effectively leverage both pandas and Spark inside the same code base – how to leverage powerful pandas concepts such as lightweight indexing with Spark – technical considerations for unifying the different behaviors of Spark and pandas
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
Stata cheat sheet: programming. Co-authored with Tim Essam (linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/04
Learn to manipulate strings in R using the built in R functions. This tutorial is part of the Working With Data module of the R Programming Course offered by r-squared.
Is it easier to add functional programming features to a query language, or to add query capabilities to a functional language? In Morel, we have done the latter.
Functional and query languages have much in common, and yet much to learn from each other. Functional languages have a rich type system that includes polymorphism and functions-as-values and Turing-complete expressiveness; query languages have optimization techniques that can make programs several orders of magnitude faster, and runtimes that can use thousands of nodes to execute queries over terabytes of data.
Morel is an implementation of Standard ML on the JVM, with language extensions to allow relational expressions. Its compiler can translate programs to relational algebra and, via Apache Calcite’s query optimizer, run those programs on relational backends.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
(A talk given by Julian Hyde at Strange Loop 2021, St. Louis, MO, on October 1st, 2021.)
Why async and functional programming in PHP7 suck and how to get overr it?Lucas Witold Adamus
This presentation describes basic issues related to functional programming with PHP and solution for most of problems served by the library called PhpSlang.
Learn the built-in mathematical functions in R. This tutorial is part of the Working With Data module of the R Programming course offered by r-squared.
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
The perfect data parallel language has not yet been invented. SQL queries can achieve great performance and scale, but there are many general purpose algorithms that it cannot express. In Morel, we build on the functional and relational roots of MapReduce in an elegant and strongly-typed general-purpose programming language. But Morel is, in a real sense, a query language; programs are executed on relational frameworks such as Google BigQuery and Spark.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
We also introduce Apache Calcite, the popular open source framework for query planning, and describe how Morel's compiler uses Calcite's relational algebra and rewrite rules to generate efficient plans.
The slide shows a full gist of reading different types of data in R thanks to coursera it was much comprehensive and i made some additional changes too.
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.
Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.
When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.
Through live demonstrations and code samples, you will understand: – how to effectively leverage both pandas and Spark inside the same code base – how to leverage powerful pandas concepts such as lightweight indexing with Spark – technical considerations for unifying the different behaviors of Spark and pandas
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkRde:code 2017
Azure's HDInsight provides an easy way to process big data using Spark, and learn from it using Machine Learning. See SparkML in action, and learn how to use R and Python at scale, within Jupyter.
製品/テクノロジ: AI (人工知能)/Deep Learning (深層学習)/Machine Learning (機械学習)/Microsoft Azure
Michael Lanzetta
Microsoft Corporation
Developer Experience and Evangelism
Principal Software Development Engineer
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael
Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Follow Michael on -
Twitter: https://twitter.com/michaelarmbrust
LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
As Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk I given an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Streaming DataFrames/Datasets. Datasets provide an evolution of the RDD API by allowing users to express computation as type-safe lambda functions on domain objects, while still leveraging the powerful optimizations supplied by the Catalyst optimizer and Tungsten execution engine. I will describe the high-level concepts as well as dive into the details of the internal code generation that enable us to provide good performance automatically. Streaming DataFrames/Datasets let developers seamlessly turn their existing structured pipelines into real-time incremental processing engines. I will demonstrate this new API’s capabilities and discuss future directions including easy sessionization and event-time-based windowing.
Scalable Data Science in Python and R on Apache Sparkfelixcss
In the world of Data Science, Python and R are very popular. Apache Spark is a highly scalable data platform. How could a Data Scientist integrate Spark into their existing Data Science toolset? How does Python work with Spark? How could one leverage the rich 10000+ packages on CRAN for R?
We will start with PySpark, beginning with a quick walkthrough of data preparation practices and an introduction to Spark MLLib Pipeline Model. We will also discuss how to integrate native Python packages with Spark.
Compare to PySpark, SparkR is a new language binding for Apache Spark and it is designed to be familiar to native R users. In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable scalable machine learning on Big Data. In addition to talking about the R interface to the ML Pipeline model, we will explore how SparkR support running user code on large scale data in a distributed manner, and give examples on how that could be used to work with your favorite R packages.
Python
R
Apache Spark
ML
DL
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher
Apache Cassandra is a powerful system for supporting large-scale, low-latency data systems, but it has some tradeoffs. Apache Spark can help fill those gaps, and this presentation will show you how.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Tabula.io Cheatsheet: automate your data workflows
Sparklyr
1. Data Science in Spark
with sparklyr
Cheat Sheet
Intro
sparklyris an R interface for
Apache Spark™, it provides a
complete dplyr backend and
the option to query directly
using Spark SQL statement. With sparklyr, you
can orchestrate distributed machine learning
using either Spark’s MLlib or H2O Sparkling
Water.
www.rstudio.com
sparklyr
Data Science Toolchain with Spark + sparklyr
fd
Import
• Export an R
DataFrame
• Read a file
• Read existing
Hive table
fd
Tidy
• dplyr verb
• Direct Spark
SQL (DBI)
• SDF function
(Scala API)
fd
Transform
Transformer
function
fd
Model
• Spark MLlib
• H2O Extension
fd
Visualize
Collect data into
R for plotting
Communicate
• Collect data
into R
• Share plots,
documents,
and apps
R for Data Science, Grolemund & Wickham
Getting started
1. Install RStudio Server or RStudio Pro on one
of the existing nodes, preferably an edge
node
On a YARN Managed Cluster
3. Open a connection
2. Locate path to the cluster’s Spark Home
Directory, it normally is “/usr/lib/spark”
spark_connect(master=“yarn-client”,
version = “1.6.2”, spark_home = [Cluster’s
Spark path])
1. Install RStudio Server or Pro on one of the
existing nodes
On a Mesos Managed Cluster
3. Open a connection
2. Locate path to the cluster’s Spark directory
spark_connect(master=“[mesos URL]”,
version = “1.6.2”, spark_home = [Cluster’s
Spark path])
Disconnect
Open the
Spark UI
Spark & Hive Tables
Open connection log
Preview
1K rows
RStudio Integrates with sparklyr
Starting with version 1.044, RStudio Desktop,
Server and Pro include integrated support for
the sparklyr package. You can create and
manage connections to Spark clusters and local
Spark instances from inside the IDE.
Using sparklyr
library(sparklyr); library(dplyr); library(ggplot2);
library(tidyr);
set.seed(100)
spark_install("2.0.1")
sc <- spark_connect(master = "local")
import_iris <- copy_to(sc, iris, "spark_iris",
overwrite = TRUE)
partition_iris <- sdf_partition(
import_iris,training=0.5, testing=0.5)
sdf_register(partition_iris,
c("spark_iris_training","spark_iris_test"))
tidy_iris <- tbl(sc,"spark_iris_training") %>%
select(Species, Petal_Length, Petal_Width)
model_iris <- tidy_iris %>%
ml_decision_tree(response="Species",
features=c("Petal_Length","Petal_Width"))
test_iris <- tbl(sc,"spark_iris_test")
pred_iris <- sdf_predict(
model_iris, test_iris) %>%
collect
pred_iris %>%
inner_join(data.frame(prediction=0:2,
lab=model_iris$model.parameters$labels)) %>%
ggplot(aes(Petal_Length, Petal_Width, col=lab)) +
geom_point()
spark_disconnect(sc)
Partition
data
Install Spark locally
Connect to local version
Copy data to Spark memory
Create a hive metadata for each partition
Bring data back
into R memory
for plotting
A brief example of a data analysis using
Apache Spark, R and sparklyr in local mode
Spark ML
Decision Tree
Model
Create
reference to
Spark table
• spark.executor.heartbeatInterval
• spark.network.timeout
• spark.executor.memory
• spark.executor.cores
• spark.executor.extraJavaOptions
• spark.executor.instances
• sparklyr.shell.executor-memory
• sparklyr.shell.driver-memory
Important Tuning Parameters
with defaults continuedconfig <- spark_config()
config$spark.executor.cores <- 2
config$spark.executor.memory <- "4G"
sc <- spark_connect (master = "yarn-
client", config = config, version = "2.0.1")
Cluster Deployment
Example Configuration
• spark.yarn.am.cores
• spark.yarn.am.memory
Important Tuning Parameters
with defaults
512m
10s
120s
1g
1
Managed Cluster
Driver Node
fd
Worker Nodes
YARN
Mesos
or
fd
fd
Driver Node
fd
fd
fd
Worker NodesCluster
Manager
Stand Alone Cluster
Cluster Deployment Options
Understand
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more at spark.rstudio.com • package version 0.5 • Updated: 12/21/16
Disconnect
1. Install RStudio Server or RStudio Pro on
one of the existing nodes or a server in the
same LAN
On a Spark Standalone Cluster
3. Open a connection
2. Install a local version of Spark:
spark_install (version = “2.0.1")
spark_connect(master=“spark://
host:port“, version = "2.0.1",
spark_home = spark_home_dir())
1. The Livy REST application should be running
on the cluster
2. Connect to the cluster
sc <- spark_connect(master = “http://host:port” ,
method = “livy")
Using Livy (Experimental)
Tuning Spark
Local Mode
2. Open a connection
1. Install a local version of Spark:
spark_install ("2.0.1")
Easy setup; no cluster required
sc <- spark_connect (master = "local")
Wrangle
2. Visualize & Communicate
Download data to R memory
dplyr::collect(x)
r_table <- collect(my_table)
plot(Petal_Width~Petal_Length, data=r_table)
sdf_read_column(x, column)
Returns contents of a single column to R
my_var <- tbl_cache(sc,
name= "hive_iris")
From a table in Hive
tbl_cache(sc, name, force = TRUE)
Loads the table into memory
my_var <- dplyr::tbl(sc,
name= "hive_iris")
dplyr::tbl(scr, …)
Creates a reference to the table
without loading it into memory
Import
sdf_copy_to(sc, x, name, memory, repartition,
overwrite)
sdf_copy_to(sc, iris, "spark_iris")
DBI::dbWriteTable(
sc, "spark_iris", iris)
Spark SQL commands
DBI::dbWriteTable(conn, name,
value)
Copy a DataFrame into Spark
Model (MLlib)
ml_als_factorization(x, rating.column = "rating", user.column =
"user", item.column = "item", rank = 10L, regularization.parameter =
0.1, iter.max = 10L, ml.options = ml_options())
ml_decision_tree(x, response, features, max.bins = 32L, max.depth
= 5L, type = c("auto", "regression", "classification"), ml.options =
ml_options())
ml_generalized_linear_regression(x, response, features,
intercept = TRUE, family = gaussian(link = "identity"), iter.max =
100L, ml.options = ml_options())
Same options for: ml_gradient_boosted_trees
ml_kmeans(x, centers, iter.max = 100, features = dplyr::tbl_vars(x),
compute.cost = TRUE, tolerance = 1e-04, ml.options = ml_options())
ml_lda(x, features = dplyr::tbl_vars(x), k = length(features), alpha =
(50/k) + 1, beta = 0.1 + 1, ml.options = ml_options())
ml_linear_regression(x, response, features, intercept = TRUE,
alpha = 0, lambda = 0, iter.max = 100L, ml.options = ml_options())
Same options for: ml_logistic_regression
ml_multilayer_perceptron(x, response, features, layers, iter.max
= 100, seed = sample(.Machine$integer.max, 1), ml.options =
ml_options())
ml_naive_bayes(x, response, features, lambda = 0, ml.options =
ml_options())
ml_one_vs_rest(x, classifier, response, features, ml.options =
ml_options())
ml_pca(x, features = dplyr::tbl_vars(x), ml.options = ml_options())
ml_random_forest(x, response, features, max.bins = 32L,
max.depth = 5L, num.trees = 20L, type = c("auto", "regression",
"classification"), ml.options = ml_options())
ml_survival_regression(x, response, features, intercept =
TRUE,censor = "censor", iter.max = 100L, ml.options =
ml_options())
ml_binary_classification_eval(predicted_tbl_spark, label,
score, metric = "areaUnderROC")
ml_classification_eval(predicted_tbl_spark, label, predicted_lbl,
metric = "f1")
ml_tree_feature_importance(sc, model)
ml_decision_tree(my_table , response=“Species", features=
c(“Petal_Length" , "Petal_Width"))
Wrangle
Spark SQL via dplyr verbs
Translates into Spark SQL statements
my_table <- my_var %>%
filter(Species=="setosa") %>%
sample_n(10)
my_table <- DBI::dbGetQuery( sc , ”SELECT *
FROM iris LIMIT 10")
Direct Spark SQL commands
DBI::dbGetQuery(conn, statement)
Scala API via SDF functions
sdf_mutate(.data)
Works like dplyr mutate function
sdf_partition(x, ..., weights = NULL, seed
= sample (.Machine$integer.max, 1))
sdf_partition(x, training = 0.5, test = 0.5)
sdf_register(x, name = NULL)
Gives a Spark DataFrame a table name
sdf_sample(x, fraction = 1, replacement =
TRUE, seed = NULL)
sdf_sort(x, columns)
Sorts by >=1 columns in ascending order
sdf_with_unique_id(x, id = "id")
Add unique ID column
ft_binarizer(threshold = 0.5)
ft_bucketizer(splits)
ft_discrete_cosine_transform(invers
e = FALSE)
ft_elementwise_product(scaling.col)
ft_index_to_string()
ft_one_hot_encoder()
ft_quantile_discretizer( n.buckets
= 5L)
ft_sql_transformer(sql)
ft_string_indexer( params = NULL)
ft_vector_assembler()
Combine vectors into a single row-
vector
Column of labels into a column of
label indices.
Continuous to binary vectors
Continuous to binned categorical
values
Index labels back to label as strings
Element-wise product between 2
columns
Time domain to frequency domain
Numeric column to discretized column
Assigned values based on threshold
ft_binarizer(my_table,input.col=“Petal_
Length”, output.col="petal_large",
threshold=1.2)
Arguments that apply to all functions:
x, input.col = NULL, output.col = NULL
Reading & Writing from Apache Spark
spark_read_<fmt>
sdf_copy_to
DBI::dbWriteTable
dplyr::collect
sdf_read_column
spark_write_<fmt>
tbl_cache
dplyr::tbl
File
System
Download a Spark DataFrame to an R DataFrame
spark_jobj()
Extensions
Create an R package that calls the full Spark API &
provide interfaces to Spark packages.
Core Types
spark_connection() Connection between R and the
Spark shell process
Instance of a remote Spark object
spark_dataframe()
Instance of a remote Spark
DataFrame object
Call Spark from R
invoke() Call a method on a Java object
invoke_new() Create a new object by invoking a
constructor
invoke_static() Call a static method on an object
spark_jobj()
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more at spark.rstudio.com • package version 0.5 • Updated: 12/21/16
spark_read_csv( header = TRUE,
delimiter = ",", quote = """, escape = "",
charset = "UTF-8", null_value = NULL)
spark_read_json(mode = NULL)
spark_read_parquet(mode = NULL)
Arguments that apply to all functions: x, path
Save from Spark to File System
CSV
JSON
PARQUET
spark_read_csv( header = TRUE,
columns = NULL, infer_schema = TRUE,
delimiter = ",", quote = """, escape = "",
charset = "UTF-8", null_value = NULL)
spark_read_json()
spark_read_parquet()
Arguments that apply to all functions:
sc, name, path, options = list(), repartition = 0,
memory = TRUE, overwrite = TRUE
Import into Spark from a File
CSV
JSON
PARQUET
sdf_predict(object, newdata)
Spark DataFrame with predicted values
sdf_collect
dplyr::copy_to
spark_dataframe()
sparklyr
is an R
interface
for
Machine Learning Extensions
ml_create_dummy_variables()
ml_model()ml_prepare_dataframe()
ml_prepare_response_features_intercept()
ml_options()
ML Transformers