Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
SparkR
The Past, Present and Future
Shivaram Venkataraman
Big Data & R
DataFrames
Visualization
Libraries
Data+
Big Data & R
Big Data
Small Learning
Partition
Aggregate
Large Scale
Machine Learning
1. Big Data, Small Learning
Data
Cleaning
Filtering
Aggregation
Collect
Subset
DataFrames
Visualization
Libraries
2(a). Partition Aggregate
Data
Collect
Subset
Best
Model
Params
Parameter Tuning
2(b). Partition Aggregate
Data Combine
Models
Model Averaging
3. Large Scale Machine Learning
Data Featurize Learning Model
Big Data & R
Big Data
Small Learning
Partition
Aggregate
Large Scale
Machine Learning
SparkR:
Unified approach
Outline
Project History
Current Release
SparkR Future
Speed
Scalable
Flexible
Statistics
Visualization
DataFrames
RDD
Parallel Collection
Transformations
map
filter
groupBy
…
Actions
count
collect
saveAsTextFile
…
R + RDD =
RRDD
lapply
lapplyPartition
groupByKey
collect
cache
…
broadcast
includePackage
textFile
Example: Word Count
library(SparkR)	
  
lines	
  <-­‐	
  textFile(sc,	
  “hdfs://my_text_file”)	
  
words	
  <-­‐	
  flatM...
Initial Prototype
Standalone R package
Install from github
Open Source
Development
1. Architecture
2. Usability
Architecture
Local Worker
Worker
Architecture
Local Worker
Worker
R
Architecture
Local Worker
Worker
R
Spark
Context
Java
Spark
Context
R-JVM
bridge
Architecture
Local Worker
Worker
R
Spark
Context
Java
Spark
Context
R-JVM
bridge
Spark
Executor
exec
R
Spark
Executor
exec...
Architecture
Local Worker
Worker
R
Spark
Context
Java
Spark
Context
R-JVM
bridge
Spark
Executor
exec
R
Spark
Executor
exec...
R-JVM Bridge
Layer to call JVM
methods directly from R
Automatic argument
serialization
result	
  <-­‐	
  	
  
	
  	
  cal...
R-JVM Bridge
Use sockets for
communication
Supported across
platforms, languages
R
JVM
Netty
Server
Usability
Need for Data Inputs
Read in CSV, JSON, JDBC etc.
High-level API for data manipulation
SparkR DataFrames
people	
  <-­‐	
  read.df(	
  
	
  	
  	
  	
  “people.json”,	
  
	
  	
  	
  	
  “json”)	
  
	
  
avgAg...
SparkR DataFrames
Scala Optimizations
Released in Spark 1.4 !
0
 1
 2
 3
SparkR DataFrame
Scala DataFrame
Python DataFrame...
SparkR Future
Big Data & R
Big Data
Small Learning
Partition
Aggregate
Large Scale
Machine Learning
Big Data, Small Learning
SparkR DataFrames: Read input, aggregation
Collect results, apply machine learning
Upcoming featu...
Partition Aggregate
Upcoming feature:
Simple, parallel API for SparkR
Ex: Parameter tuning, Model Averaging
Integrated wit...
Large Scale Machine Learning
Integration with MLLib
Support for GLM, KMeans etc.
	
  
	
  
model	
  <-­‐	
  glm(	
  
	
  a...
Large Scale Machine Learning
Key Features
DataFrame inputs
R-like formulas
Model statistics
	
  
	
  
model	
  <-­‐	
  glm...
Extensibility
Existing data sources
R package support on
spark-packages.org
Example packages
	
  
	
  
	
  
./bin/sparkR	
...
Developer Community
>20 contributors including
AMPLab, Databricks, Alteryx, Intel
R and Scala contributions welcome !
SparkR
Big data processing from R
DataFrames in Spark 1.4
Future: Large Scale ML & more
Upcoming SlideShare
Loading in …5
×

SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui Sun, UC Berkeley AMPLAB and Intel)

6,779 views

Published on

Presentation at Spark Summit 2015

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui Sun, UC Berkeley AMPLAB and Intel)

  1. 1. SparkR The Past, Present and Future Shivaram Venkataraman
  2. 2. Big Data & R DataFrames Visualization Libraries Data+
  3. 3. Big Data & R Big Data Small Learning Partition Aggregate Large Scale Machine Learning
  4. 4. 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset DataFrames Visualization Libraries
  5. 5. 2(a). Partition Aggregate Data Collect Subset Best Model Params Parameter Tuning
  6. 6. 2(b). Partition Aggregate Data Combine Models Model Averaging
  7. 7. 3. Large Scale Machine Learning Data Featurize Learning Model
  8. 8. Big Data & R Big Data Small Learning Partition Aggregate Large Scale Machine Learning SparkR: Unified approach
  9. 9. Outline Project History Current Release SparkR Future
  10. 10. Speed Scalable Flexible Statistics Visualization DataFrames
  11. 11. RDD Parallel Collection Transformations map filter groupBy … Actions count collect saveAsTextFile …
  12. 12. R + RDD = RRDD lapply lapplyPartition groupByKey collect cache … broadcast includePackage textFile
  13. 13. Example: Word Count library(SparkR)   lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)   words  <-­‐  flatMap(lines,                                    function(line)  {                                        strsplit(line,  "  ")[[1]]                                    })   wordCount  <-­‐  lapply(words,              function(word)  {            list(word,  1L)              })   counts  <-­‐  reduceByKey(wordCount,  "+",  2L)   output  <-­‐  collect(counts)    
  14. 14. Initial Prototype Standalone R package Install from github
  15. 15. Open Source Development 1. Architecture 2. Usability
  16. 16. Architecture Local Worker Worker
  17. 17. Architecture Local Worker Worker R
  18. 18. Architecture Local Worker Worker R Spark Context Java Spark Context R-JVM bridge
  19. 19. Architecture Local Worker Worker R Spark Context Java Spark Context R-JVM bridge Spark Executor exec R Spark Executor exec R
  20. 20. Architecture Local Worker Worker R Spark Context Java Spark Context R-JVM bridge Spark Executor exec R Spark Executor exec R
  21. 21. R-JVM Bridge Layer to call JVM methods directly from R Automatic argument serialization result  <-­‐        callJStatic(          “sparkr.RRDD”,            “someMethod”,            arg1,            arg2)  
  22. 22. R-JVM Bridge Use sockets for communication Supported across platforms, languages R JVM Netty Server
  23. 23. Usability Need for Data Inputs Read in CSV, JSON, JDBC etc. High-level API for data manipulation
  24. 24. SparkR DataFrames people  <-­‐  read.df(          “people.json”,          “json”)     avgAge  <-­‐  select(          df,            avg(df$age))     head(avgAge)   DataSources API Support for schema dplyr-like syntax
  25. 25. SparkR DataFrames Scala Optimizations Released in Spark 1.4 ! 0 1 2 3 SparkR DataFrame Scala DataFrame Python DataFrame Time (s) Demo: github.com/cafreeman/SparkR_DataFrame_Demo
  26. 26. SparkR Future
  27. 27. Big Data & R Big Data Small Learning Partition Aggregate Large Scale Machine Learning
  28. 28. Big Data, Small Learning SparkR DataFrames: Read input, aggregation Collect results, apply machine learning Upcoming features: Support for R transformations More column functions (e.g. math, strings)
  29. 29. Partition Aggregate Upcoming feature: Simple, parallel API for SparkR Ex: Parameter tuning, Model Averaging Integrated with DataFrames Use existing R packages
  30. 30. Large Scale Machine Learning Integration with MLLib Support for GLM, KMeans etc.     model  <-­‐  glm(    a  ~  b  +  c,        data  =  df)    
  31. 31. Large Scale Machine Learning Key Features DataFrame inputs R-like formulas Model statistics     model  <-­‐  glm(    a  ~  b  +  c,        data  =  df)     summary(model)  
  32. 32. Extensibility Existing data sources R package support on spark-packages.org Example packages       ./bin/sparkR        -­‐-­‐packages  spark-­‐csv  
  33. 33. Developer Community >20 contributors including AMPLab, Databricks, Alteryx, Intel R and Scala contributions welcome !
  34. 34. SparkR Big data processing from R DataFrames in Spark 1.4 Future: Large Scale ML & more

×