Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SparkR + Zeppelin

6,432 views

Published on

An introduction to SparkR and using R/SparkR with Apache Zeppelin

Published in: Data & Analytics
  • Dating direct: ❶❶❶ http://bit.ly/2Qu6Caa ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Sex in your area is here: ♥♥♥ http://bit.ly/2Qu6Caa ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Use the viewer to check out the notebook without running Zeppelin: https://www.zeppelinhub.com/viewer/
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • The links below are broken. Please see these: https://github.com/felixcheung/vagrant-projects/tree/master/SparkR-Zeppelin https://github.com/felixcheung/incubator-zeppelin/tree/r https://github.com/felixcheung/spark-notebook-examples/blob/master/Zeppelin_notebook/2AZ9584GE/note.json
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

SparkR + Zeppelin

  1. 1. SparkR + Zeppelin Seattle Spark Meetup Sept 9, 2015 Felix Cheung
  2. 2. Agenda • R & SparkR • SparkR DataFrame • SparkR in Zeppelin • What’s next
  3. 3. R• A programming language for statistical computing and graphics • S – 1975 • S4 - advanced object-oriented features • R – 1993 • S + lexical scoping • Interpreted • Matrix arithmetic • Comprehensive R Archive Network (CRAN) – 7000+ packages
  4. 4. Fast! Scalable Flexible Statistical! Interactive Packages
  5. 5. SparkR • R language APIs for Spark and Spark SQL • Exposes Spark functionality in an R-friendly DataFrame API • Runs as its own REPL sparkR • or as a standard R package imported in tools like Rstudio library(SparkR) sc <- sparkR.init() sqlContext <- sparkRSQL.init(sc) 5
  6. 6. History • Shivaram Venkataraman & Zongheng Yang, amplab – UC Berkeley • RDD APIs in a standalone package (Jan/2014) • Spark SQL and SchemaRDD -> DataFrame • Spark 1.4 – first Spark release with SparkR APIs • Spark 1.5 (today!) 6
  7. 7. Architecture 7 Native S4 classes & methods RBackend socket • A set of native S4 classes and methods that live inside a standard R package • A backend that passes data structures and method calls to Spark Scala/JVM • A collection of “helper” methods written in Scala
  8. 8. Advantages • R-like syntax extending DataFrame API • JVM processing with full access to Spark’s DAG capabilities and Catalyst engine, e.g. execution plan optimization, constant-folding, predicate pushdown, and code generation 8
  9. 9. https://databricks.com/blog/201 5/06/09/announcing-sparkr-r- on-spark.html SparkR DataFrame • Spark packages • Data Source API • Optimizations
  10. 10. SparkR in Zeppelin
  11. 11. Architecture R R adaptor
  12. 12. Demo
  13. 13. DIY • https://github.com/felixcheung/vagrant- projects/tree/master/SparkR-Zeppelin • Vagrant + VirtualBox • Install prerequisites: JDK, R, R packages • Automatically download Spark 1.5.0 release • Need to build Zeppelin from https://github.com/felixcheung/incubator-zeppelin/tree/r • Notebook from https://github.com/felixcheung/spark- notebook- examples/blob/master/Zeppelin_notebook/2AZ9584GE/not e.json
  14. 14. (extracted from the demo) Native R
  15. 15. (extracted from the demo) Native R and dplyr... Similarly SparkR DataFrame…
  16. 16. (extracted from the demo) SparkR DataFrame…
  17. 17. What’s new • Zeppelin - run with provided Spark (SPARK_HOME) • Spark 1.5.0 release • SparkR new APIs
  18. 18. SparkR in Spark 1.5.0 Get this today: • R formula • Machine learning like GLM model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") • More R-like df[df$age %in% c(19, 30), 1:2] transform(df, newCol = df$col1 / 5, newCol2 = df$col1 * 2)
  19. 19. Zeppelin • Stay tuned! More to come with R/SparkR • Lots of updates in the upcoming 0.5.x/0.6.0 release
  20. 20. Question? https://github.com/felixcheung linkedin: http://linkd.in/1OeZDb7 blog: http://bit.ly/1E2z6OI
  21. 21. subset # Columns can be selected using `[[` and `[` df[[2]] == df[["age"]] df[,2] == df[,"age"] df[,c("name", "age")] # Or to filter rows df[df$age > 20,] # DataFrame can be subset on both rows and Columns df[df$name == "Smith", c(1,2)] df[df$age %in% c(19, 30), 1:2] subset(df, df$age %in% c(19, 30), 1:2) subset(df, df$age %in% c(19), select = c(1,2))
  22. 22. Transform/mutate newDF <- mutate(df, newCol = df$col1 * 5, newCol2 = df$col1 * 2) newDF2 <- transform(df, newCol = df$col1 / 5, newCol2 = df$col1 * 2)

×