Scalable Data Science with Spark and R

Scalable Data Science
with Spark and R
Zeydy Ortiz, Ph. D. & Rob Montalvo
DataCrunch Lab, LLC
Twitter: @DCrunchLab
PyData Carolinas 2016

DataCrunch
Lab
Tutorial Objectives
 Understand basic concepts of Spark
 Learn how to explore data in SparkR
 Learn how to perform interactive analysis
in SparkR
 Learn to run machine learning algorithms in
SparkR

DataCrunch
Lab
What is Spark?
 A distributed computing framework
 Provides programming abstraction and
parallel runtime to hide complexities of
fault tolerance and slow machines
 Partitions out big data across multiple
machines and stores the data on those
machines' memory

DataCrunch
Lab
Apache Spark Components

DataCrunch
Lab
Why Spark?
Spark is fast...
Massively parallel
Minimizes I/O bottlenecks by storing data in
memory
Spark is partitioning-aware to avoid network-
intensive shuffles

DataCrunch
Lab
Spark supports multiple languages

DataCrunch
Lab
Spark Concepts - Driver and Executors
A Spark program consists of two programs:
 Driver program
runs on one machine ("main")
 Executor program
runs either on cluster nodes or in local threads
on the same machine

DataCrunch
Lab
Spark Concepts – Resilient Distributed
Dataset (RDD)
“The main abstraction Spark provides is
a resilient distributed dataset (RDD), which
is a collection of elements partitioned across
the nodes of the cluster that can be operated
on in parallel.”

DataCrunch
Lab
Spark Concepts - SparkDataFrame
"A SparkDataFrame is a distributed collection
of data organized into named columns. It is
conceptually equivalent to a table in a
relational database or a data frame in R, but
with richer optimizations under the hood."

DataCrunch
Lab
Spark Concepts – SparkDataFrame
Properties
 Immutable; once created cannot be changed
 Distributed across all Executors
 Can be created from many sources (HDFS, text
files, JSON, Parquet, Hive,...
 Can be cached in memory for later reuse
(optimization by you)
 Must have a schema (columns, each with name
and type)

DataCrunch
Lab
Spark Concepts - Operations
Transformations
 Are lazily evaluated (part of an
execution plan); are not
immediately executed
 Are executed only when an
action is invoked
 Create a new SparkDataFrame
from an existing one
Actions
 The mechanism to get results
out of Spark
 Trigger the execution of "the
execution plan"

DataCrunch
Lab
Setting up for the Tutorial
Pre-requisites:
 Databricks Community Edition account
https://databricks.com/ce
1. Log into your Databricks account
https://community.cloud.databricks.com/
2. Import tutorial notebook
http://bit.ly/DCL-SparkR

DataCrunch
Lab
Import Notebook
1. On a separate window (or tab), point your browser to
http://bit.ly/DCL-SparkR
2. Copy to the clipboard the URL to which the previous
bit.ly evaluates.
3. On the Databricks UI, click on Workspace. The
Workspace view comes up.
4. Click on the right-most dropdown arrow (next to your
User ID).
5. Select Import. The Import Item window comes up.
6. Select the URL radio button, and paste onto the space
the URL obtained on step 2 above.
7. Click Import.

DataCrunch
Lab
Exploring Old Faithful Geyser Data
Old Faithful, named by members of the
1870 Washburn Expedition, was once called
“Eternity’s Timepiece” because of the
regularity of its eruptions. Despite the
myth, this geyser has never erupted at
exact hourly intervals, nor is it the largest
or most regular geyser in Yellowstone.
Questions to explore:
• Historically, how long has been the wait between eruptions?
• How long does an eruption usually last?
• What is the most common wait time between eruptions?
• How long eruptions last for the most common wait time?

DataCrunch
Lab
Identifying Irises
Use machine learning algorithms to classify irises based on
their measured features

DataCrunch
Lab
Summary
 Explained basic concepts of Spark
 Learned how to explore data in SparkR
 Learned how to perform interactive
analysis in SparkR
 Learned to run machine learning algorithms
in SparkR

Thank You!
Zeydy Ortiz, Ph. D. & Rob Montalvo
zortiz @ datacrunchlab.com
rmontalvo @ datacrunchlab.com
Twitter: @DCrunchLab

Scalable Data Science with Spark and R

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Scalable Data Science with Spark and R

Similar to Scalable Data Science with Spark and R (20)

More from Zeydy Ortiz, Ph. D.

More from Zeydy Ortiz, Ph. D. (6)

Recently uploaded

Recently uploaded (20)

Scalable Data Science with Spark and R