Spark R tutorial at PyData Carolinas - Processing large datasets in R have been limited by the amount of memory in the local system. To overcome the native R limitation, several cluster computing alternatives have recently emerged including Apache Spark. In this session, we discussed the architecture of Spark and introduced the SparkR library. We worked through examples of the SparkDataFrame API in Spark 2.0 to explore data, perform interactive analysis and run machine learning algorithms. Slides include link to notebook with executable code.
1. Scalable Data Science
with Spark and R
Zeydy Ortiz, Ph. D. & Rob Montalvo
DataCrunch Lab, LLC
Twitter: @DCrunchLab
PyData Carolinas 2016
2. DataCrunch
Lab
Tutorial Objectives
Understand basic concepts of Spark
Learn how to explore data in SparkR
Learn how to perform interactive analysis
in SparkR
Learn to run machine learning algorithms in
SparkR
3. DataCrunch
Lab
What is Spark?
A distributed computing framework
Provides programming abstraction and
parallel runtime to hide complexities of
fault tolerance and slow machines
Partitions out big data across multiple
machines and stores the data on those
machines' memory
5. DataCrunch
Lab
Why Spark?
Spark is fast...
Massively parallel
Minimizes I/O bottlenecks by storing data in
memory
Spark is partitioning-aware to avoid network-
intensive shuffles
7. DataCrunch
Lab
Spark Concepts - Driver and Executors
A Spark program consists of two programs:
Driver program
runs on one machine ("main")
Executor program
runs either on cluster nodes or in local threads
on the same machine
8. DataCrunch
Lab
Spark Concepts – Resilient Distributed
Dataset (RDD)
“The main abstraction Spark provides is
a resilient distributed dataset (RDD), which
is a collection of elements partitioned across
the nodes of the cluster that can be operated
on in parallel.”
9. DataCrunch
Lab
Spark Concepts - SparkDataFrame
"A SparkDataFrame is a distributed collection
of data organized into named columns. It is
conceptually equivalent to a table in a
relational database or a data frame in R, but
with richer optimizations under the hood."
10. DataCrunch
Lab
Spark Concepts – SparkDataFrame
Properties
Immutable; once created cannot be changed
Distributed across all Executors
Can be created from many sources (HDFS, text
files, JSON, Parquet, Hive,...
Can be cached in memory for later reuse
(optimization by you)
Must have a schema (columns, each with name
and type)
11. DataCrunch
Lab
Spark Concepts - Operations
Transformations
Are lazily evaluated (part of an
execution plan); are not
immediately executed
Are executed only when an
action is invoked
Create a new SparkDataFrame
from an existing one
Actions
The mechanism to get results
out of Spark
Trigger the execution of "the
execution plan"
12. DataCrunch
Lab
Setting up for the Tutorial
Pre-requisites:
Databricks Community Edition account
https://databricks.com/ce
1. Log into your Databricks account
https://community.cloud.databricks.com/
2. Import tutorial notebook
http://bit.ly/DCL-SparkR
13. DataCrunch
Lab
Import Notebook
1. On a separate window (or tab), point your browser to
http://bit.ly/DCL-SparkR
2. Copy to the clipboard the URL to which the previous
bit.ly evaluates.
3. On the Databricks UI, click on Workspace. The
Workspace view comes up.
4. Click on the right-most dropdown arrow (next to your
User ID).
5. Select Import. The Import Item window comes up.
6. Select the URL radio button, and paste onto the space
the URL obtained on step 2 above.
7. Click Import.
15. DataCrunch
Lab
Exploring Old Faithful Geyser Data
Old Faithful, named by members of the
1870 Washburn Expedition, was once called
“Eternity’s Timepiece” because of the
regularity of its eruptions. Despite the
myth, this geyser has never erupted at
exact hourly intervals, nor is it the largest
or most regular geyser in Yellowstone.
Questions to explore:
• Historically, how long has been the wait between eruptions?
• How long does an eruption usually last?
• What is the most common wait time between eruptions?
• How long eruptions last for the most common wait time?
17. DataCrunch
Lab
Summary
Explained basic concepts of Spark
Learned how to explore data in SparkR
Learned how to perform interactive
analysis in SparkR
Learned to run machine learning algorithms
in SparkR
18. Thank You!
Zeydy Ortiz, Ph. D. & Rob Montalvo
zortiz @ datacrunchlab.com
rmontalvo @ datacrunchlab.com
Twitter: @DCrunchLab