Use r tutorial part1, introduction to sparkr

Introduction to SparkR
Shivaram Venkataraman, Hossein Falaki

Big Data & R
DataFrames
Visualization
Libraries
Data+

Big Data & R: Challenges
Data access
HDFS, Hive
Capacity
Single machine
memory Parallelism
Single Thread

Apache Spark
Engine for large-scale data processing
Fast, Easy to Use
Runs Everywhere
EC2, clusters, laptop etc.

Speed
Scalable
Flexible
Statistics
Visualization
DataFrames
SparkR

Big Data & R: Patterns
Big Data
Small Learning
Partition
Aggregate
Large Scale
Machine Learning

1. Big Data, Small Learning
Data
Cleaning
Filtering
Aggregation
Collect
Subset
DataFrames
Visualizatio
n
Libraries

1. Big Data, Small Learning
songs <- read.df(
“songs.json”,
“json”)
newSongs <- filter(
songs,
songs$year > 2000)
ggplot(collect(newSongs))
Data
Cleaning
Filtering
Aggregation
Collect
Subset

2. Partition Aggregate
Data Best
Model
Params
Parameter Tuning

params<-c(1e-3,1e-1,1e2)
data <- read.csv(“t.csv”)
train <- function(prm) {
lm.ridge(“y ~ x+z”,
data, prm)
}
lapply(params, train)
2. Partition Aggregate
Data Best
Model
Params

3. Large Scale Machine Learning
Data Featurize Learning Model

3. Large Scale Machine Learning
Data Featurize Learning Model
training <- read.csv(
“t.csv”)
model <- glm(
delay~Distance+Dest,
family = “gaussian”,
data=data)
summary(model)

Big Data & R
Big Data
Small Learning
Partition
Aggregate
Large Scale
Machine Learning
SparkR:
Unified approach

SparkR DataFrames
people <- read.df(
“people.json”,
“json”)
avgAge <- select(
df,
avg(df$age))
head(avgAge)
Number of data sources
Column Functions, SQL
Support for R UDFs

Large Scale Machine Learning
Integration with MLLib
Key Features
R-like formulas
Model statistics
model <- glm(
a ~ b + c,
data = df)
summary(model)

Partition Aggregate
spark.lapply: Simple, parallel API
Ex: Parameter tuning, Model Averaging
Include existing R packages

SparkR Status
Open source -- Part of Apache Spark
> 60 committers from UC Berkeley, Databricks,
IBM, Intel, Alteryx etc.
Contributions welcome !

Tutorial Outline
Part 1: Data Exploration
• ETL: Data loading, schema
• Exploration: Filter, clean, aggregate etc.
• Visualization: Integration with ggplot
Part 2: Advanced Analytics (After the break)

Tutorial Setup
Each user gets a dedicated micro cluster
• Cluster is terminated after 1 hour of inactivity
• Multiple users can collaborate on a notebook
Notebooks can be exported/imported
Examples and tutorials in R/Python/Scala
Free online service for learning Apache Spark

Tutorial Setup
Databricks Notebooks
• Interactive workspace
• Markdown + R, Python, Scala, SQL
Sign up at http://databricks.com/ce

Tutorial Setup
Fill out our survey at
tiny.cc/sparkr-user-survey

SparkR
Big data processing from R
DataFrames for ETL, data exploration
Support for advanced analytics

Tutorial Next Steps
Sign up at http://databricks.com/ce
Part 1: tiny.cc/sparkr-tutorial-part1
Fill out our survey at tiny.cc/sparkr-user-survey

Use r tutorial part1, introduction to sparkr

More Related Content

What's hot

Similar to Use r tutorial part1, introduction to sparkr

More from Databricks

Recently uploaded

Use r tutorial part1, introduction to sparkr