• Save
SparkR: Enabling Interactive Data Science at Scale on Hadoop
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
836
On Slideshare
836
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
7

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. SparkR: Enabling Interactive Data Science at Scale on Hadoop Shivaram Venkataraman Zongheng Yang
  • 2. Fast ! Scalable Flexible
  • 3. Statistics ! Packages Plots
  • 4. Fast ! Scalable Flexible Statistics ! Plots Packages
  • 5. HDFS / HBase / Cassandra Spark SparkR Mesos / YARN Storage Cluster Manager Data Processing
  • 6. Outline SparkR API Live Demo Design Details
  • 7. RDD Parallel Collection Transformations map filter groupBy … Actions count collect saveAsTextFile …
  • 8. Q: How can I use a loop to [...insert task here...] ?! A: Don’t. Use one of the apply functions.! From: http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/ R
  • 9. R + RDD = R2D2
  • 10. R + RDD = RRDD lapply lapplyPartition groupByKey reduceByKey sampleRDD collect cache … broadcast includePackage textFile parallelize
  • 11. Example: Word Count lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)  
  • 12. Example: Word Count lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)     words  <-­‐  flatMap(lines,                                    function(line)  {                                        strsplit(line,  "  ")[[1]]                                    })   wordCount  <-­‐  lapply(words,              function(word)  {            list(word,  1L)              })  
  • 13. Example: Word Count lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)     words  <-­‐  flatMap(lines,                                    function(line)  {                                        strsplit(line,  "  ")[[1]]                                    })   wordCount  <-­‐  lapply(words,              function(word)  {            list(word,  1L)              })   counts  <-­‐  reduceByKey(wordCount,  "+",  2L)   output  <-­‐  collect(counts)  
  • 14. Demo
  • 15. MNIST
  • 16. A b || Ax − b ||2Minimize x = (AT A)−1 AT b
  • 17. How does this work ?
  • 18. Dataflow Local Worker Worker
  • 19. Dataflow Local R Worker Worker
  • 20. Dataflow Local R Spark Context Java Spark Context JNI Worker Worker
  • 21. Dataflow Local Worker Worker R Spark Context Java Spark Context JNI Spark Executor exec R Spark Executor exec R
  • 22. From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/
  • 23. Dataflow Local Worker Worker R Spark Context Java Spark Context JNI Spark Executor exec R Spark Executor exec R
  • 24. Pipelined RDD words  <-­‐  flatMap(lines,…)   wordCount  <-­‐  lapply(words,…)   Spark Executor exec R Spark Executor Rexec
  • 25. Pipelined RDD Spark Executor exec R Spark Executor Rexec Spark Executor exec R R Spark Executor
  • 26. Alpha developer release One line install !        install_github("amplab-­‐extras/SparkR-­‐pkg",              subdir="pkg")  
  • 27. SparkR Implementation Very similar to PySpark Spark is easy to extend 292 lines of Scala code 1694 lines of R code 549 lines of test code in R
  • 28. EC2 setup scripts All Spark examples MNIST demo Hadoop2, Maven build Also on github
  • 29. In the Roadmap High level DataFrame API Integrating Spark’s MLLib from R Reading from SequenceFiles, HBase
  • 30. SparkR Combine scalability & utility RDD à distributed lists Serialize closure Re-use R packages
  • 31. SparkR https://github.com/amplab-extras/SparkR-pkg Shivaram Venkataraman shivaram@cs.berkeley.edu Spark User mailing list user@spark.apache.org
  • 32. Example: Logistic Regression pointsRDD  <-­‐  textFile(sc,  "hdfs://myfile")   weights  <-­‐  runif(n=D,  min  =  -­‐1,  max  =  1)     #  Logistic  gradient   gradient  <-­‐  function(partition)  {      X  <-­‐  partition[,1];  Y  <-­‐  partition[,-­‐1]      t(X)  %*%  (1/(1  +  exp(-­‐Y  *  (X  %*%  weights)))  -­‐  1)  *  Y   }  
  • 33. Example: Logistic Regression pointsRDD  <-­‐  textFile(sc,  "hdfs://myfile")   weights  <-­‐  runif(n=D,  min  =  -­‐1,  max  =  1)     #  Logistic  gradient   gradient  <-­‐  function(partition)  {      X  <-­‐  partition[,1];  Y  <-­‐  partition[,-­‐1]      t(X)  %*%  (1/(1  +  exp(-­‐Y  *  (X  %*%  weights)))  -­‐  1)  *  Y   }     #Iterate   weights  <-­‐  weights  -­‐  reduce(    lapplyPartition(pointsRDD,  gradient),  "+")  
  • 34. How does it work ? R Shell rJava Spark Context Spark Executor Spark Executor RScript RScript Data: RDD[Array[Byte]] Func: Array[Byte]