SparkR: Enabling Interactive Data Science at Scale on Hadoop

•

15 likes•3,271 views

SparkR enables interactive data science at scale on Hadoop by providing an R interface to Apache Spark. Some key points: - SparkR allows users to manipulate distributed datasets (RDDs) using familiar R operations like map, filter, reduceByKey. - It integrates R and Spark by running R code on Spark executors via JNI, allowing R scripts to process large datasets in parallel. - Examples show how to do tasks like word count and logistic regression on Spark using R code, demonstrating the ability to scale R for data science on big data.

SparkR: Enabling Interactive
Data Science at Scale on Hadoop
Shivaram Venkataraman
Zongheng Yang

Fast !
Scalable
Flexible
Statistics !
Plots
Packages

HDFS / HBase / Cassandra
Spark
SparkR
Mesos / YARN
Storage
Cluster
Manager
Data
Processing

Outline
SparkR API
Live Demo
Design Details

RDD
Parallel Collection
Transformations
map
filter
groupBy
…
Actions
count
collect
saveAsTextFile
…

Q: How can I use a loop
to [...insert task here...] ?!
A: Don’t. Use one of the
apply functions.!
From: http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/
R

R + RDD =
RRDD
lapply
lapplyPartition
groupByKey
reduceByKey
sampleRDD
collect
cache
…
broadcast
includePackage
textFile
parallelize

$Example: Word Count lines <-‐ textFile(sc, “hdfs://my_text_file”) words <-‐ flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <-‐ lapply(words, function(word) { list(word, 1L) })$

A
b
|| Ax − b ||2Minimize
x = (AT
A)−1
AT
b

Dataﬂow
Local
R
Spark
Context
Java
Spark
Context
JNI
Worker
Worker

Dataﬂow
Local Worker
Worker
R
Spark
Context
Java
Spark
Context
JNI
Spark
Executor
exec
R
Spark
Executor
exec
R

From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/

Pipelined RDD
words
<-‐
flatMap(lines,…)

wordCount
<-‐
lapply(words,…)

Spark
Executor
exec
R
Spark
Executor Rexec

Pipelined RDD
Spark
Executor
exec
R
Spark
Executor Rexec
Spark
Executor exec R R
Spark
Executor

Alpha developer release
One line install !

install_github("amplab-‐extras/SparkR-‐pkg",

subdir="pkg")

SparkR Implementation
Very similar to PySpark
Spark is easy to extend
292 lines of Scala code
1694 lines of R code
549 lines of test code in R

EC2 setup scripts
All Spark examples
MNIST demo
Hadoop2, Maven build
Also on github

In the Roadmap
High level DataFrame API
Integrating Spark’s MLLib from R
Reading from SequenceFiles, HBase

SparkR
Combine scalability & utility
RDD à distributed lists
Serialize closure
Re-use R packages

SparkR
https://github.com/amplab-extras/SparkR-pkg
Shivaram Venkataraman shivaram@cs.berkeley.edu
Spark User mailing list user@spark.apache.org

Example: Logistic Regression
pointsRDD
<-‐
textFile(sc,
"hdfs://myfile")

weights
<-‐
runif(n=D,
min
=
-‐1,
max
=
1)

#
Logistic
gradient

gradient
<-‐
function(partition)
{

X
<-‐
partition[,1];
Y
<-‐
partition[,-‐1]

t(X)
%*%
(1/(1
+
exp(-‐Y
*
(X
%*%
weights)))
-‐
1)
*
Y

}

How does it work ?
R Shell
rJava
Spark Context
Spark Executor Spark Executor
RScript RScript
Data: RDD[Array[Byte]]
Func: Array[Byte]

At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.

End-to-end Data Pipeline with Apache SparkDatabricks

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

Databricks

In this webcast, Patrick Wendell from Databricks will be speaking about Apache Spark's new 1.6 release. Spark 1.6 will include (but not limited to) a type-safe API called Dataset on top of DataFrames that leverages all the work in Project Tungsten to have more robust and efficient execution (including memory management, code generation, and query optimization) [SPARK-9999], adaptive query execution [SPARK-9850], and unified memory management by consolidating cache and execution memory [SPARK-10000].

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...

Spark Summit

Performant data processing with PySpark, SparkR and DataFrame API

Ryuji Tamagawa

Enabling exploratory data science with Spark and R

Databricks

R is a favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large datasets with R is challenging, especially when data scientists use R with frameworks or tools written in other languages. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show how SparkR solves these problems to enable a much smoother experience. In this talk we will present an overview of the SparkR architecture, including how data and control is transferred between R and JVM. This knowledge will help data scientists make better decisions when using SparkR. We will demo and explain some of the existing and supported use cases with real large datasets inside a notebook environment. The demonstration will emphasize how Spark clusters, R and interactive notebook environments, such as Jupyter or Databricks, facilitate exploratory analysis of large data.

Introduction to Spark (Intern Event Presentation)

Databricks

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Databricks

Scalable Data Science in Python and R on Apache Spark

felixcss

In the world of Data Science, Python and R are very popular. Apache Spark is a highly scalable data platform. How could a Data Scientist integrate Spark into their existing Data Science toolset? How does Python work with Spark? How could one leverage the rich 10000+ packages on CRAN for R? We will start with PySpark, beginning with a quick walkthrough of data preparation practices and an introduction to Spark MLLib Pipeline Model. We will also discuss how to integrate native Python packages with Spark. Compare to PySpark, SparkR is a new language binding for Apache Spark and it is designed to be familiar to native R users. In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable scalable machine learning on Big Data. In addition to talking about the R interface to the ML Pipeline model, we will explore how SparkR support running user code on large scale data in a distributed manner, and give examples on how that could be used to work with your favorite R packages. Python R Apache Spark ML DL

New directions for Apache Spark in 2015

Databricks

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Spark Summit

R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR. • Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R. • Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas. • Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods. • Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics. • Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future. • Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency. • Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.

Spark Application Carousel: Highlights of Several Applications Built with Spark

Databricks

Use r tutorial part1, introduction to sparkr

Databricks

The BDAS Open Source Community

jeykottalam

Spark tutorial

Sahan Bulathwela

Adding Complex Data to Spark Stack by Tug Grall

Spark Summit

Spark what's new what's coming

Databricks

SparkR - Play Spark Using R (20160909 HadoopCon)

wqchen

Jump Start into Apache® Spark™ and Databricks

Databricks

These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016. --- Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...

Spark Summit

Apache Spark Model Deployment

Databricks

Tech-Talk at Bay Area Spark Meetup Apache Spark(tm) has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment. How do I embed what I have learned into customer facing data applications. Like all things in engineering, it depends. In this meetup, we will discuss best practices from Databricks on how our customers productionize machine learning models and do a deep dive with actual customer case studies and live demos of a few example architectures and code in Python and Scala. We will also briefly touch on what is coming in Apache Spark 2.X with model serialization and scoring options.

Hadoop and R Go to the MoviesDataWorks Summit

What's hot

Spark Summit EU 2015: Lessons from 300+ production users

Databricks

End-to-end Data Pipeline with Apache SparkDatabricks

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

Databricks

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...

Spark Summit

Performant data processing with PySpark, SparkR and DataFrame API

Ryuji Tamagawa

Enabling exploratory data science with Spark and R

Databricks

Introduction to Spark (Intern Event Presentation)

Databricks

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Databricks

Scalable Data Science in Python and R on Apache Spark

felixcss

New directions for Apache Spark in 2015

Databricks

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Spark Summit

Spark Application Carousel: Highlights of Several Applications Built with Spark

Databricks

Use r tutorial part1, introduction to sparkr

Databricks

The BDAS Open Source Community

jeykottalam

Spark tutorial

Sahan Bulathwela

Adding Complex Data to Spark Stack by Tug Grall

Spark Summit

Spark what's new what's coming

Databricks

SparkR - Play Spark Using R (20160909 HadoopCon)

wqchen

Jump Start into Apache® Spark™ and Databricks

Databricks

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...

Spark Summit

What's hot (20)

Spark Summit EU 2015: Lessons from 300+ production users

End-to-end Data Pipeline with Apache Spark

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...

Performant data processing with PySpark, SparkR and DataFrame API

Enabling exploratory data science with Spark and R

Introduction to Spark (Intern Event Presentation)

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Scalable Data Science in Python and R on Apache Spark

New directions for Apache Spark in 2015

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Spark Application Carousel: Highlights of Several Applications Built with Spark

Use r tutorial part1, introduction to sparkr

The BDAS Open Source Community

Spark tutorial

Adding Complex Data to Spark Stack by Tug Grall

Spark what's new what's coming

SparkR - Play Spark Using R (20160909 HadoopCon)

Jump Start into Apache® Spark™ and Databricks

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...

Viewers also liked

Apache Spark Model Deployment

Databricks

Hadoop and R Go to the MoviesDataWorks Summit

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...

DataWorks Summit/Hadoop Summit

Anatomy of Data Frame API : A deep dive into Spark Data Frame API

datamantra

Cassandra/Hadoop Integration

Jeremy Hanna

Evaluating Apache Cassandra as a Cloud Database

DataStax

Big Data Analytics with Hadoop

Philippe Julio

Viewers also liked (7)

Apache Spark Model Deployment

Hadoop and R Go to the Movies

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...

Anatomy of Data Frame API : A deep dive into Spark Data Frame API

Cassandra/Hadoop Integration

Evaluating Apache Cassandra as a Cloud Database

Big Data Analytics with Hadoop

Similar to SparkR: Enabling Interactive Data Science at Scale on Hadoop

Apache spark - Architecture , Overview & libraries

Walaa Hamdy Assy

Parallelizing Existing R Packages

Craig Warman

An Overview of Apache Spark

Yasoda Jayaweera

Apache spark-melbourne-april-2015-meetup

Ned Shawa

Alpine academy apache spark series #1 introduction to cluster computing wit...

Holden Karau

Introduction to Apache Spark

Mohamed hedi Abidi

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

Chetan Khatri

Apache spark sneha challa- google pittsburgh-aug 25th

Sneha Challa

A really really fast introduction to PySpark - lightning fast cluster computi...

Holden Karau

Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. This tutorial will briefly introduce PySpark (the Python API for Spark) with some hands-on-exercises combined with a quick introduction to Spark's core concepts. We will cover the obligatory wordcount example which comes in with every big-data tutorial, as well as discuss Spark's unique methods for handling node failure and other relevant internals. Then we will briefly look at how to access some of Spark's libraries (like Spark SQL & Spark ML) from Python. While Spark is available in a variety of languages this workshop will be focused on using Spark and Python together.

Big Data Processing with .NET and Spark (SQLBits 2020)

Michael Rys

Apache Spark Workshop

Michael Spector

20130912 YTC_Reynold Xin_Spark and Shark

YahooTechConference

In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications. These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.

Introduction to Spark - DataFactZ

DataFactZ

We are a company driven by inquisitive data scientists, having developed a pragmatic and interdisciplinary approach, which has evolved over the decades working with over 100 clients across multiple industries. Combining several Data Science techniques from statistics, machine learning, deep learning, decision science, cognitive science, and business intelligence, with our ecosystem of technology platforms, we have produced unprecedented solutions. Welcome to the Data Science Analytics team that can do it all, from architecture to algorithms. Our practice delivers data driven solutions, including Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics. We employ a number of technologies in the area of Big Data and Advanced Analytics such as DataStax (Cassandra), Databricks (Spark), Cloudera, Hortonworks, MapR, R, SAS, Matlab, SPSS and Advanced Data Visualizations. This presentation is designed for Spark Enthusiasts to get started and details of the course are below. 1. Introduction to Apache Spark 2. Functional Programming + Scala 3. Spark Core 4. Spark SQL + Parquet 5. Advanced Libraries 6. Tips & Tricks 7. Where do I go from here?

Intro to Spark and Spark SQL

jeykottalam

Apache Spark Introduction

Rich Lee

20170126 big data processing

Vienna Data Science Group

Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed: • What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them? • When to use batch and when stream processing? • What is a Lambda-Architecture and a Kappa Architecture? • What are the best practices for your project?

Apache Spark Workshop, Apr. 2016, Euangelos Linardos

Euangelos Linardos

Apache Spark An Overview

Mohit Jain

Fossasia 2018-chetan-khatri

Chetan Khatri

Spark devoxx2014Andy Petrella

Similar to SparkR: Enabling Interactive Data Science at Scale on Hadoop (20)

Apache spark - Architecture , Overview & libraries

Parallelizing Existing R Packages

An Overview of Apache Spark

Apache spark-melbourne-april-2015-meetup

Alpine academy apache spark series #1 introduction to cluster computing wit...

Introduction to Apache Spark

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

Apache spark sneha challa- google pittsburgh-aug 25th

A really really fast introduction to PySpark - lightning fast cluster computi...

Big Data Processing with .NET and Spark (SQLBits 2020)

Apache Spark Workshop

20130912 YTC_Reynold Xin_Spark and Shark

Introduction to Spark - DataFactZ

Intro to Spark and Spark SQL

Apache Spark Introduction

20170126 big data processing

Apache Spark Workshop, Apr. 2016, Euangelos Linardos

Apache Spark An Overview

Fossasia 2018-chetan-khatri

Spark devoxx2014

More from DataWorks Summit

Data Science Crash Course

SparkR: Enabling Interactive Data Science at Scale on Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to SparkR: Enabling Interactive Data Science at Scale on Hadoop

Similar to SparkR: Enabling Interactive Data Science at Scale on Hadoop (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

SparkR: Enabling Interactive Data Science at Scale on Hadoop