Introduction	to	SparkR
Nitay	Alon
Data	Scientist,	IBM
• Target Audience: data scientist / R users / Big Data developers.
• Goals: Introduction to SparkR.
• Natural transition for R users.
Today's talk:
• What is SparkR?
• Simple EDA (Exploratoty data analysis)
• Plotting with SparkR
• Simple regression with SparkR
• Installing and configuring SparkR.
Today's talk:
What is SparkR?
Ø SparkR is an R package that enables us to run R on Spark
environment. The current version (And the one I'm showing you today)
is 1.6.1.
Ø SparkR uses data frames implementation to aggregate, select and
summarize data.
Ø Word on lazy valuation.
Why Spark?
Ø https://spark.apache.org/docs/latest/sparkr.html
Ø Speed.
Ø Easy to use (Installation and working)
Ø Integrality with hadoop: Spark enables you to work with HDFS.
Ø In memory.
Ø Replacing MapReduce.
Why R?
Ø R vs Python.
Ø Fast growing language.
Ø Support for (almost) any DS related task.
Ø Thousands of packages.
Ø Visualization..
Why SparkR?
Ø SparkR sits on top of Spark.
Ø Expose the RDD API of Spark as distributed list in
R.
Why SparkR?
Ø The Architecture:
SparkR
Spark
Mesos/YARN
HBase/HDFS/Cassandra
Data	
Analysis
Process	
MGMT
Data	Base
Loading the data:
SparkR DF HDFSSparkR
Loading the data:
Local	CSV SparkR DF
First look at the data: Same Function as R
First look at the data:
1) Assuming that we wish to get the average age of the loaners based on the sum of the loan:
Sub selection
1) Assuming that we wish to get the average age of the loaners based on the sum of the loan:
Sub selection
Sub selection
Main methods for subset selection:
1) select – select a subset based on given column and returns selected columns
2) Filter – select all the rows according to column condition
3) selectExpr – modify column (or add on) according to SQL query.
1) Select:
Sub selection
Sub selection
2) filter
Sub selection
3) selectExpr
Plotting with SparkR:
There are 2 ways to plot data on SparkR:
1) Using ggplot2 – Arguably the most common and popular visualization package in R.
2) Using ggplot2.SparkR – Costume package to plot in SparkR environmet.
This won’t work on SparkR DF
Plotting with SparkR: Using ggplot2
Plotting with SparkR: Using ggplot2
Client	Environment Cluster	Environment
Plotting with SparkR: Using ggplot2.SparkR
Plotting with SparkR: Using ggplot2.SparkR
Client	Environment Cluster	Environment
Plotting with SparkR: ggplot2.SparkR
3) Plotting simple Histogram of the ages:
count
Age
Plotting with SparkR: ggplot2.SparkR
4) Supported plots:
• Bar
• Box
Plotting with SparkR: ggplot2.SparkR
4) Supported plots:
• Histogram
• Frequency
• Stat_sum
• Bin2d – (heatmaps)
5) Supported methods:
• Positions : stack, fill, dodge.
• Facets: wrap, grid, null.
• Scales: x_log, y_log.
Comparing ggplot2 and ggplot2.SparkR
Criteria ggplot2 ggplot2.SparkR
Plotting	option Full Partial
Plotting in	Big	data	
environment
Require collect,	
select,	groupBy,	and	
other	functions
Easy	to	use,	as	if the	
data	frame	is	stored	
localy
Scaling Limited	by	the size	of	
the	single	node
Linearly	growth	with	
the	number	of	nodes.
• Spark MLlib:
• Holds many ML algorithms.
• Optimized for big data performance.
• Available as one of five Spark library.
• GLM library R
Machine learning with SparkR
• R glm function.
• Logistic regression model:
• 𝜋 =	
$%&'%()('%*)*'⋯'%,),
-.$%&'%()('%*)*'⋯'%,),
• We will use the data on the clients to build prediction model.
Machine learning with SparkR
Machine learning with SparkR
Machine learning with SparkR
• Supported	families:	Binomial,	Gaussian.
• Supported	link	function:	ongoing	work	(JIRA)
• Other	GLM?
• Other	platforms	– Pyspark.
What’s next?
• SparkR 1.6	+
• Incorporating	MLlib with	R.
• More	algorithms.
• More	visualization.
• More	power	for	you	to	enhance	your	analytics.
IBM’s	Analytics	Team
‫שאלות‬?
Starting SparkR on RStudio:
• Installing SparkR:

Final_show

  • 1.
  • 2.
    • Target Audience:data scientist / R users / Big Data developers. • Goals: Introduction to SparkR. • Natural transition for R users. Today's talk:
  • 3.
    • What isSparkR? • Simple EDA (Exploratoty data analysis) • Plotting with SparkR • Simple regression with SparkR • Installing and configuring SparkR. Today's talk:
  • 4.
    What is SparkR? ØSparkR is an R package that enables us to run R on Spark environment. The current version (And the one I'm showing you today) is 1.6.1. Ø SparkR uses data frames implementation to aggregate, select and summarize data. Ø Word on lazy valuation.
  • 5.
    Why Spark? Ø https://spark.apache.org/docs/latest/sparkr.html ØSpeed. Ø Easy to use (Installation and working) Ø Integrality with hadoop: Spark enables you to work with HDFS. Ø In memory. Ø Replacing MapReduce.
  • 6.
    Why R? Ø Rvs Python. Ø Fast growing language. Ø Support for (almost) any DS related task. Ø Thousands of packages. Ø Visualization..
  • 7.
    Why SparkR? Ø SparkRsits on top of Spark. Ø Expose the RDD API of Spark as distributed list in R.
  • 8.
    Why SparkR? Ø TheArchitecture: SparkR Spark Mesos/YARN HBase/HDFS/Cassandra Data Analysis Process MGMT Data Base
  • 9.
  • 10.
  • 11.
    First look atthe data: Same Function as R
  • 12.
    First look atthe data:
  • 13.
    1) Assuming thatwe wish to get the average age of the loaners based on the sum of the loan: Sub selection
  • 14.
    1) Assuming thatwe wish to get the average age of the loaners based on the sum of the loan: Sub selection
  • 15.
    Sub selection Main methodsfor subset selection: 1) select – select a subset based on given column and returns selected columns 2) Filter – select all the rows according to column condition 3) selectExpr – modify column (or add on) according to SQL query.
  • 16.
  • 17.
  • 18.
  • 19.
    Plotting with SparkR: Thereare 2 ways to plot data on SparkR: 1) Using ggplot2 – Arguably the most common and popular visualization package in R. 2) Using ggplot2.SparkR – Costume package to plot in SparkR environmet. This won’t work on SparkR DF
  • 20.
    Plotting with SparkR:Using ggplot2
  • 21.
    Plotting with SparkR:Using ggplot2 Client Environment Cluster Environment
  • 22.
    Plotting with SparkR:Using ggplot2.SparkR
  • 23.
    Plotting with SparkR:Using ggplot2.SparkR Client Environment Cluster Environment
  • 24.
    Plotting with SparkR:ggplot2.SparkR 3) Plotting simple Histogram of the ages: count Age
  • 25.
    Plotting with SparkR:ggplot2.SparkR 4) Supported plots: • Bar • Box
  • 26.
    Plotting with SparkR:ggplot2.SparkR 4) Supported plots: • Histogram • Frequency • Stat_sum • Bin2d – (heatmaps) 5) Supported methods: • Positions : stack, fill, dodge. • Facets: wrap, grid, null. • Scales: x_log, y_log.
  • 27.
    Comparing ggplot2 andggplot2.SparkR Criteria ggplot2 ggplot2.SparkR Plotting option Full Partial Plotting in Big data environment Require collect, select, groupBy, and other functions Easy to use, as if the data frame is stored localy Scaling Limited by the size of the single node Linearly growth with the number of nodes.
  • 28.
    • Spark MLlib: •Holds many ML algorithms. • Optimized for big data performance. • Available as one of five Spark library. • GLM library R Machine learning with SparkR
  • 29.
    • R glmfunction. • Logistic regression model: • 𝜋 = $%&'%()('%*)*'⋯'%,), -.$%&'%()('%*)*'⋯'%,), • We will use the data on the clients to build prediction model. Machine learning with SparkR
  • 30.
  • 31.
    Machine learning withSparkR • Supported families: Binomial, Gaussian. • Supported link function: ongoing work (JIRA) • Other GLM? • Other platforms – Pyspark.
  • 32.
    What’s next? • SparkR1.6 + • Incorporating MLlib with R. • More algorithms. • More visualization. • More power for you to enhance your analytics.
  • 33.
  • 34.
  • 35.
    Starting SparkR onRStudio: • Installing SparkR: