Use r tutorial part1, introduction to sparkr

3,152 views

Published on

Presentation given at useR 2016 at http://user2016.org/tutorials/11.html

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,152
On SlideShare
0
From Embeds
0
Number of Embeds
44
Actions
Shares
0
Downloads
172
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Use r tutorial part1, introduction to sparkr

  1. 1. Introduction to SparkR Shivaram Venkataraman, Hossein Falaki
  2. 2. Big Data & R DataFrames Visualization Libraries Data+
  3. 3. Big Data & R: Challenges Data access HDFS, Hive Capacity Single machine memory Parallelism Single Thread
  4. 4. Apache Spark Engine for large-scale data processing Fast, Easy to Use Runs Everywhere EC2, clusters, laptop etc.
  5. 5. Speed Scalable Flexible Statistics Visualization DataFrames SparkR
  6. 6. Big Data & R: Patterns Big Data Small Learning Partition Aggregate Large Scale Machine Learning
  7. 7. 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset DataFrames Visualizatio n Libraries
  8. 8. 1. Big Data, Small Learning songs <- read.df( “songs.json”, “json”) newSongs <- filter( songs, songs$year > 2000) ggplot(collect(newSongs)) Data Cleaning Filtering Aggregation Collect Subset
  9. 9. 2. Partition Aggregate Data Best Model Params Parameter Tuning
  10. 10. params<-c(1e-3,1e-1,1e2) data <- read.csv(“t.csv”) train <- function(prm) { lm.ridge(“y ~ x+z”, data, prm) } lapply(params, train) 2. Partition Aggregate Data Best Model Params
  11. 11. 3. Large Scale Machine Learning Data Featurize Learning Model
  12. 12. 3. Large Scale Machine Learning Data Featurize Learning Model training <- read.csv( “t.csv”) model <- glm( delay~Distance+Dest, family = “gaussian”, data=data) summary(model)
  13. 13. Big Data & R Big Data Small Learning Partition Aggregate Large Scale Machine Learning SparkR: Unified approach
  14. 14. SparkR DataFrames people <- read.df( “people.json”, “json”) avgAge <- select( df, avg(df$age)) head(avgAge) Number of data sources Column Functions, SQL Support for R UDFs
  15. 15. Large Scale Machine Learning Integration with MLLib Key Features R-like formulas Model statistics model <- glm( a ~ b + c, data = df) summary(model)
  16. 16. Partition Aggregate spark.lapply: Simple, parallel API Ex: Parameter tuning, Model Averaging Include existing R packages
  17. 17. SparkR Status Open source -- Part of Apache Spark > 60 committers from UC Berkeley, Databricks, IBM, Intel, Alteryx etc. Contributions welcome !
  18. 18. Tutorial Outline Part 1: Data Exploration • ETL: Data loading, schema • Exploration: Filter, clean, aggregate etc. • Visualization: Integration with ggplot Part 2: Advanced Analytics (After the break)
  19. 19. Tutorial Setup Each user gets a dedicated micro cluster • Cluster is terminated after 1 hour of inactivity • Multiple users can collaborate on a notebook Notebooks can be exported/imported Examples and tutorials in R/Python/Scala Free online service for learning Apache Spark
  20. 20. Tutorial Setup Databricks Notebooks • Interactive workspace • Markdown + R, Python, Scala, SQL Sign up at http://databricks.com/ce
  21. 21. Tutorial Setup Fill out our survey at tiny.cc/sparkr-user-survey
  22. 22. SparkR Big data processing from R DataFrames for ETL, data exploration Support for advanced analytics
  23. 23. Tutorial Next Steps Sign up at http://databricks.com/ce Part 1: tiny.cc/sparkr-tutorial-part1 Fill out our survey at tiny.cc/sparkr-user-survey

×