SparkR: Enabling Interactive Data Science at Scale on Hadoop

2,826 views

Published on

Published in: Technology, Education

SparkR: Enabling Interactive Data Science at Scale on Hadoop

  1. 1. SparkR: Enabling Interactive Data Science at Scale on Hadoop Shivaram Venkataraman Zongheng Yang
  2. 2. Fast ! Scalable Flexible
  3. 3. Statistics ! Packages Plots
  4. 4. Fast ! Scalable Flexible Statistics ! Plots Packages
  5. 5. HDFS / HBase / Cassandra Spark SparkR Mesos / YARN Storage Cluster Manager Data Processing
  6. 6. Outline SparkR API Live Demo Design Details
  7. 7. RDD Parallel Collection Transformations map filter groupBy … Actions count collect saveAsTextFile …
  8. 8. Q: How can I use a loop to [...insert task here...] ?! A: Don’t. Use one of the apply functions.! From: http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/ R
  9. 9. R + RDD = R2D2
  10. 10. R + RDD = RRDD lapply lapplyPartition groupByKey reduceByKey sampleRDD collect cache … broadcast includePackage textFile parallelize
  11. 11. Example: Word Count lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)  
  12. 12. Example: Word Count lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)     words  <-­‐  flatMap(lines,                                    function(line)  {                                        strsplit(line,  "  ")[[1]]                                    })   wordCount  <-­‐  lapply(words,              function(word)  {            list(word,  1L)              })  
  13. 13. Example: Word Count lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)     words  <-­‐  flatMap(lines,                                    function(line)  {                                        strsplit(line,  "  ")[[1]]                                    })   wordCount  <-­‐  lapply(words,              function(word)  {            list(word,  1L)              })   counts  <-­‐  reduceByKey(wordCount,  "+",  2L)   output  <-­‐  collect(counts)  
  14. 14. Demo
  15. 15. MNIST
  16. 16. A b || Ax − b ||2Minimize x = (AT A)−1 AT b
  17. 17. How does this work ?
  18. 18. Dataflow Local Worker Worker
  19. 19. Dataflow Local R Worker Worker
  20. 20. Dataflow Local R Spark Context Java Spark Context JNI Worker Worker
  21. 21. Dataflow Local Worker Worker R Spark Context Java Spark Context JNI Spark Executor exec R Spark Executor exec R
  22. 22. From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/
  23. 23. Dataflow Local Worker Worker R Spark Context Java Spark Context JNI Spark Executor exec R Spark Executor exec R
  24. 24. Pipelined RDD words  <-­‐  flatMap(lines,…)   wordCount  <-­‐  lapply(words,…)   Spark Executor exec R Spark Executor Rexec
  25. 25. Pipelined RDD Spark Executor exec R Spark Executor Rexec Spark Executor exec R R Spark Executor
  26. 26. Alpha developer release One line install !        install_github("amplab-­‐extras/SparkR-­‐pkg",              subdir="pkg")  
  27. 27. SparkR Implementation Very similar to PySpark Spark is easy to extend 292 lines of Scala code 1694 lines of R code 549 lines of test code in R
  28. 28. EC2 setup scripts All Spark examples MNIST demo Hadoop2, Maven build Also on github
  29. 29. In the Roadmap High level DataFrame API Integrating Spark’s MLLib from R Reading from SequenceFiles, HBase
  30. 30. SparkR Combine scalability & utility RDD à distributed lists Serialize closure Re-use R packages
  31. 31. SparkR https://github.com/amplab-extras/SparkR-pkg Shivaram Venkataraman shivaram@cs.berkeley.edu Spark User mailing list user@spark.apache.org
  32. 32. Example: Logistic Regression pointsRDD  <-­‐  textFile(sc,  "hdfs://myfile")   weights  <-­‐  runif(n=D,  min  =  -­‐1,  max  =  1)     #  Logistic  gradient   gradient  <-­‐  function(partition)  {      X  <-­‐  partition[,1];  Y  <-­‐  partition[,-­‐1]      t(X)  %*%  (1/(1  +  exp(-­‐Y  *  (X  %*%  weights)))  -­‐  1)  *  Y   }  
  33. 33. Example: Logistic Regression pointsRDD  <-­‐  textFile(sc,  "hdfs://myfile")   weights  <-­‐  runif(n=D,  min  =  -­‐1,  max  =  1)     #  Logistic  gradient   gradient  <-­‐  function(partition)  {      X  <-­‐  partition[,1];  Y  <-­‐  partition[,-­‐1]      t(X)  %*%  (1/(1  +  exp(-­‐Y  *  (X  %*%  weights)))  -­‐  1)  *  Y   }     #Iterate   weights  <-­‐  weights  -­‐  reduce(    lapplyPartition(pointsRDD,  gradient),  "+")  
  34. 34. How does it work ? R Shell rJava Spark Context Spark Executor Spark Executor RScript RScript Data: RDD[Array[Byte]] Func: Array[Byte]

×