Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SparkR - Play Spark Using R (20160909 HadoopCon)

1,341 views

Published on

1. Introduction to SparkR
2. Demo
Starting to use SparkR
DataFrames: dplyr style, SQL style
RDD v.s. DataFrames
SparkR on MLlib: GLM, K-means
3. User Case
Median: approxQuantile()
ID Match: dplyr style, SQL style, SparkR function
SparkR + Shiny
4. The Future of SparkR

Published in: Data & Analytics
  • Be the first to comment

SparkR - Play Spark Using R (20160909 HadoopCon)

  1. 1. SparkR - Play Spark Using R Gil Chen @HadoopCon 2016 Demo: http://goo.gl/VF77ad
  2. 2. about me • R, Python & Matlab User • Taiwan R User Group • Taiwan Spark User Group • Co-founder • Data Scientist @
  3. 3. HadoopCon 2015
  4. 4. Outline • Introduction to SparkR • Demo • Starting to use SparkR • DataFrames: dplyr style, SQL style • RDD v.s. DataFrames • MLlib: GLM, K-means • User Case • Median: approxQuantile() • ID Match: dplyr style, SQL style, SparkR function • SparkR + Shiny • The Future of SparkR
  5. 5. Introduction to SparkR
  6. 6. Spark Origin • Apache Spark is an open source cluster computing framework • Originally developed at the University of California, Berkeley's AMPLab • The first 2 contributors of SparkR:
 Shivaram Venkataraman & Zongheng Yang https://amplab.cs.berkeley.edu/
  7. 7. Spark History https://en.wikipedia.org/wiki/Apache_Spark SparkR DataFrames PySpark
  8. 8. Key Advantages of Spark & R + Fast! Flexible Scalable Statistical! Interactive Packages https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf
  9. 9. ggplot2 Google Search: ggplot2 ggplot2 is a plotting system for R, based on the grammar of graphics.
  10. 10. Shiny http://shiny.rstudio.com/gallery/ and more impressive dashboard… A web application framework for R Turn your analyses into interactive web applications No HTML, CSS, or JavaScript knowledge required
  11. 11. Performance https://amplab.cs.berkeley.edu/announcing-sparkr-r-on-spark/ The runtime performance of running group-by aggregation on 10 million integer pairs on a single machine in R, Python and Scala. (using the same dataset as https://goo.gl/iMLXnh) https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
  12. 12. RDD (Resilient Distributed Dataset) https://spark.apache.org/docs/2.0.0/api/scala/#org.apache.spark.rdd.RDD Internally, each RDD is characterized by five main properties: 1. A list of partitions 2. A function for computing each split 3. A list of dependencies on other RDDs 4. Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) 5. Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) https://docs.cloud.databricks.com/docs/latest/courses
  13. 13. RDD dependencies • Narrow dependency: Each partition of the parent RDD is used by at most one partition of the child RDD. This means the task can be executed locally and we don’t have to shuffle. (Eg: map, flatMap, Filter, sample etc.) • Wide dependency: Multiple child partitions may depend on one partition of the parent RDD. This means we have to shuffle data unless the parents are hash-partitioned (Eg: sortByKey, reduceByKey, groupByKey, join etc.) http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  14. 14. Job Scheduling http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf Black: if they are already in memory
  15. 15. Transformations RDD Operations map() flatmap() filter() mapPartitions() sample() union() intersection() distinct() groupByKey() reduceByKey() sortByKey() join() cogroup() … Actions reduce() collect() count() first() take(num) takeSample() takeOrdered() saveAsTextFile() saveAsSequenceFile() saveAsObjectFile() countByValue() countByKey() foreach() … http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf narrow dep. wide dep. lazy evaluation
  16. 16. RDD Example RDDRDDRDDRDD Transformations Action Value rdd <- SparkR:::textFile(sc, "txt") words <- SparkR:::flatMap(rdd, function()) wordCount <- SparkR:::lapply(words, function(word)) counts <- SparkR:::reduceByKey(wordCount, "+", 1) op <- SparkR:::collect(counts)
  17. 17. R shell RDD SparkR RDD & DataFrames before v1.6 since v2.0 array data.frame + schema SparkDataFrame + schema General Action Transformation
  18. 18. DataFrames are Faster! http://scala-phase.org/talks/rdds-dataframes-datasets-2016-06-16/#/ Beyond SQL: Speeding up Spark with DataFrames http://www.slideshare.net/databricks/spark-sqlsse2015public
  19. 19. Spark Stack https://www.safaribooksonline.com/library/view/data-analytics-with/9781491913734/ch04.html Storage Cluster Manager Processing Engine Access & Interfaces
  20. 20. How does sparkR works? https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
  21. 21. Upgrading From SparkR 1.6 to 2.0 Before 1.6.2 Since 2.0.0 data type naming DataFrame SparkDataFrame read csv Package from Databricks built-in function (like approxQuantile) X O ML function glm more
 (or use sparklyr) SQLContext / HiveContext sparkRSQL.init(sc) merge in sparkR.session() Execute Message very detailed simple Launch on EC2 API X https://spark.apache.org/docs/latest/sparkr.html
  22. 22. Demo http://goo.gl/VF77ad
  23. 23. Easy Setting 1. Download 2. Decompress and Give a Path 3. Set Path and Launch SparkR in R
  24. 24. Documents • If you have to use RDD, refer to AMP-Lab github:
 http://amplab-extras.github.io/SparkR-pkg/rdocs/1.2/
 and use “:::”
 e.g. SparkR:::textFile, SparkR:::lapply • Otherwise, refer to SparkR official documents:
 https://spark.apache.org/docs/2.0.0/api/R/index.html
  25. 25. Starting to Use SparkR (v1.6.2) # Set Spark path Sys.setenv(SPARK_HOME="/usr/local/spark-1.6.2-bin-hadoop2.6/") # Load SparkR library into your R session library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) # Initialize SparkContext, sc:sparkContext sc <- sparkR.init(appName = "Demo_SparkR") # Initialize SQLContext sqlContext <- sparkRSQL.init(sc) # your sparkR script # ... # ... sparkR.stop()
  26. 26. Starting to Use SparkR (v2.0.0) # Set Spark path Sys.setenv(SPARK_HOME="/usr/local/spark-2.0.0-bin-hadoop2.7/") # Load SparkR library into your R session library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) # Initialize SparkContext, sc: sparkContext sc <- sparkR.session(appName = "Demo_SparkR") # Initialize SQLContext (don’t need anymore since 2.0.0) # sqlContext <- sparkRSQL.init(sc) # your sparkR script # ... # ... sparkR.stop()
  27. 27. DataFrames # Load the flights CSV file using read.df sdf <- read.df(sqlContext,"data_flights.csv", "com.databricks.spark.csv", header = "true") # Filter flights from JFK jfk_flights <- filter(sdf, sdf$origin == "JFK") # Group and aggregate flights to each destination dest_flights <- summarize( groupBy(jfk_flights, jfk_flights$dest), count = n(jfk_flights$dest)) # Running SQL Queries registerTempTable(sdf, "tempTable") training <- sql(sqlContext, "SELECT dest, count(dest) as cnt FROM tempTable WHERE dest = 'JFK' GROUP BY dest")
  28. 28. Word Count # read data into RDD rdd <- SparkR:::textFile(sc, "data_word_count.txt") # split word words <- SparkR:::flatMap(rdd, function(line) { strsplit(line, " ")[[1]] }) # map: give 1 for each word wordCount <- SparkR:::lapply(words, function(word) { list(word, 1) }) # reduce: count the value by key(word) counts <- SparkR:::reduceByKey(wordCount, "+", 2) # convert RDD to list op <- SparkR:::collect(counts)
  29. 29. RDD v.s. DataFrames flights_SDF <- read.df(sqlContext, "data_flights.csv", source = "com.databricks.spark.csv", header = "true") SDF_op <- flights_SDF %>% group_by(flights_SDF$hour) %>% summarize(sum(flights_SDF$dep_delay)) %>% collect() flights_RDD <- SparkR:::textFile(sc, "data_flights.csv") RDD_op <- flights_RDD %>% SparkR:::filterRDD(function (x) { x >= 1 }) %>% SparkR:::lapply(function(x) { y1 <- as.numeric(unlist(strsplit(x, ","))[2]) y2 <- as.numeric(unlist(strsplit(x, ","))[6]) return(list(y1,y2))}) %>% SparkR:::reduceByKey(function(x,y) x + y, 1) %>% SparkR:::collect() DataFrames RDD
  30. 30. SparkR on MLlib SparkR supports a subset of the available R formula operators for model fitting, including ~ . : + - , e.g. y ~ x1 + x2
  31. 31. Generalized Linear Model, GLM # read data and cache flights_SDF <- read.df("data_flights.csv", source = "csv",
 header = "true", schema) %>% cache() # drop NA flights_SDF_2 <- dropna(flights_SDF, how = "any") # split train/test dataset train <- sample(flights_SDF_2, withReplacement = FALSE, fraction = 0.5, seed = 42) test <- except(flights_SDF_2, train) # building model gaussianGLM <- spark.glm(train, arr_delay ~ dep_delay + dist, 
 family = "gaussian") summary(gaussianGLM) # prediction preds <- predict(gaussianGLM, test)
  32. 32. K-means # read data and cache flights_SDF <- read.df("data_flights.csv", source = "csv",
 header = "true", schema) %>% cache() # drop NA flights_SDF_2 <- dropna(flights_SDF, how = "any") # clustering kmeansModel <- spark.kmeans(flights_SDF_2, ~ arr_delay + 
 dep_delay + dist + flight + dest + cancelled + 
 time + dist, k = 15) summary(kmeansModel) cluster_op <- fitted(kmeansModel) # clustering result kmeansPredictions <- predict(kmeansModel, flights_SDF_2)
  33. 33. User Case
  34. 34. Median (approxQuantile) gdf <- seq(1,10,1) %>% data.frame() colnames(gdf) <- "seq" sdf <- createDataFrame(gdf) median_val <- approxQuantile(sdf, "seq", 0.5, 0) %>% print() Calculate Median using SQL query…so complicated… http://www.1keydata.com/tw/sql/sql-median.html
  35. 35. ID Match ##### method 1 : like dplyr + pipeline join_id_m1 <- join(sdf_1, sdf_2, sdf_1$id1 == sdf_2$id2, "inner") %>% select("id2") %>% collect() ##### method 2 : sql query createOrReplaceTempView(sdf_1, "table1") createOrReplaceTempView(sdf_2, "table2") qry_str <- "SELECT table2.id2 FROM table1 JOIN table2 ON table1.id1 = table2.id2" join_id_m2 <- sql(qry_str) ##### method 3 : SparkR function join_id_m2 <- intersect(sdf_1, sdf_2) %>% collect()
  36. 36. Play Pokemon Go Data with SparkR !!
  37. 37. Application on SparkR Interactive MapsWeb FrameworkCompute Engine + Where is the Dragonite nest ? +
  38. 38. Port: 8080 - Cluster Monitor Capacity of each worker
  39. 39. Port: 4040 Jobs Monitor cache(SparkDataFrame), long run time for first time Advanced performance Status of each worker
  40. 40. Some Tricks • Customize spark config for launch • cache() • Some codes can’t run in Rstudio, try to use terminal • Packages from 3rd party, like package of read csv file from databricks
  41. 41. The Future of SparkR • More MLlib API • Advanced User Define Function • package(“sparklyr”) from Rstudio
  42. 42. Reference • SparkR: Scaling R Programs with Spark, Shivaram Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica, and Matei Zaharia. SIGMOD 2016. June 2016.
 https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf • SparkR: Interactive R programs at Scale, Shivaram Venkataraman, Zongheng Yang. Spark Summit, June 2014, San Francisco.
 https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf • Apache Spark Official Research
 http://spark.apache.org/research.html
 - Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
 - http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf • Apache Spark Official Document
 http://spark.apache.org/docs/latest/api/scala/ • AMPlab UC Berkeley - SparkR Project
 https://github.com/amplab-extras/SparkR-pkg • Databricks Official Blog
 https://databricks.com/blog/category/engineering/spark • R-blogger: Launch Apache Spark on AWS EC2 and Initialize SparkR Using Rstudio
 https://www.r-bloggers.com/launch-apache-spark-on-aws-ec2-and-initialize-sparkr-using-rstudio-2/
  43. 43. Rstudio in Amazon EC2
  44. 44. Join Us • Fansboard • Web Designer (php & JavaScript) • Editor w/ facebook & instagram • Vpon - Data Scientist • Taiwan Spark User Group • Taiwan R User Group
  45. 45. Thanks for your attention & Taiwan Spark User Group & Vpon Data Team

×