Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Enabling exploratory data science with Spark and R

6,945 views

Published on

R is a favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large datasets with R is challenging, especially when data scientists use R with frameworks or tools written in other languages. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show how SparkR solves these problems to enable a much smoother experience. In this talk we will present an overview of the SparkR architecture, including how data and control is transferred between R and JVM. This knowledge will help data scientists make better decisions when using SparkR. We will demo and explain some of the existing and supported use cases with real large datasets inside a notebook environment. The demonstration will emphasize how Spark clusters, R and interactive notebook environments, such as Jupyter or Databricks, facilitate exploratory analysis of large data.

Published in: Software

Enabling exploratory data science with Spark and R

  1. 1. Enabling Exploratory Data Science with Spark and R Shivaram Venkataraman, Hossein Falaki (@mhfalaki)
  2. 2. About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computingengine that unifies: • Real-time streaming (SparkStreaming) • Machine learning (SparkML/MLLib) • SQL (SparkSQL) • Graph processing (GraphX) AMPLab (Algorithms, Machines, and Peoples lab) at UC Berkeley was where Spark and SparkR were developed originally. Databricks Inc. is the company founded by creators of Spark, focused on making big data simple by offering an end to end data processing platform in the cloud 2
  3. 3. What is R? Language and runtime The cornerstone of R is the data frame concept 3
  4. 4. Many data scientists love R 4 • Open source • Highly dynamic • Interactive environment • Rich ecosystem of packages • Powerful visualization infrastructure • Data frames make data manipulation convenient • Taughtby many schoolsto stats and computing students
  5. 5. Performance Limitations of R R language • R’s dynamic design imposes restrictions on optimization R runtime • Single threaded • Everything has to fit in memory 5
  6. 6. What would be ideal? Seamless manipulationand analysisof very large data in R • R’s flexible syntax • R’s rich package ecosystem • R’s interactive environment • Scalability (scaleup and out) • Integration with distributed data sources/ storage 6
  7. 7. Augmenting R with other frameworks In practice data scientists use R in conjunction with other frameworks (Hadoop MR, Hive, Pig, Relational Databases, etc) 7 Framework  X (Language  Y) Distributed Storage 1.  Load,  clean,  transform,  aggregate,  sample Local Storage 2.  Save  to  local  storage 3.  Read  and  analyze  in  R Iterate
  8. 8. What is SparkR? An R package distributed with ApacheSpark: • Provides R frontend to Spark • Exposes Spark Dataframes (inspired by R and Pandas) • Convenientinteroperability between R and Spark DataFrames 8 +distributed/robust   processing,  data   sources,  off-­‐memory  data  structures Spark Dynamic  environment,  interactivity,   packages,  visualization R
  9. 9. How does SparkR solve our problems? No local storage involved Write everything in R Use Spark’s distributed cachefor interactive/iterative analysis at speed of thought 9 Local Storage 2.  Save  to  local  storage 3.  Read  and  analyze  in  R Framework  X (Language  Y) Distributed Storage 1.  Load,  clean,  transform,  aggregate,  sample Iterate
  10. 10. Example SparkR program # Loading distributed data df <- read.df(“hdfs://bigdata/logs”, source = “json”) # Distributed filtering and aggregation errors <- subset(df, df$type == “error”) counts <- agg(groupBy(errors, df$code), num = count(df$code)) # Collecting and plotting small data qplot(code, num, data = collect(counts), geom = “bar”, stat = “identity”) + coord_flip() 10
  11. 11. SparkR architecture 11 Spark  Driver R JVM R  Backend JVM Worker JVM Worker Data  Sources
  12. 12. Overview of SparkR API IO • read.df / write.df • createDataFrame / collect Caching • cache / persist / unpersist • cacheTable / uncacheTable Utility functions • dim / head / take • names / rand / sample / ... 12 ML Lib • glm / predict DataFrame API select / subset / groupBy head / showDF /unionAll agg / avg / column / ... SQL sql / table / saveAsTable registerTempTable / tables
  13. 13. Moving data between R and JVM 13 R JVM R  Backend SparkR::collect() SparkR::createDataFrame()
  14. 14. Moving data between R and JVM 14 R JVM R  Backend JVM Worker JVM Worker HDFS/S3/… FUSE read.df() write.df()
  15. 15. Moving between languages 15 R Scala Spark df <- read.df(...) wiki <- filter(df, ...) registerTempTable(wiki, “wiki”) val wiki = table(“wiki”) val parsed = wiki.map { Row(_, _, text: String, _, _) =>text.split(‘ ’) } val model = Kmeans.train(parsed)
  16. 16. Mixing R and SQL Pass a query to SQLContextand getthe resultback as a DataFrame 16 # Register DataFrame as a table registerTempTable(df, “dataTable”) # Complex SQL query, result is returned as another DataFrame aggCount <- sql(sqlContext, “select count(*) as num, type, date group by type order by date desc”) qplot(date, num, data = collect(aggCount), geom = “line”)
  17. 17. SparkR roadmap and upcoming features • Exposing MLLib functionality in SparkR • GLM already exposed with R formula support • UDF supportin R • Distribute a function and data • Ideal way for distributing existing R functionality and packages • Complete DataFrame API to behave/feeljust like data.frame 17
  18. 18. Example use case: exploratory analysis • Data pipeline implemented in Scala/Python • New files are appended to existing data partitioned by time • Table schemeis saved in Hive metastore • Data scientists use SparkRto analyze and visualize data 1. refreshTable(sqlConext,“logsTable”) 2. logs <- table(sqlContext,“logsTable”) 3. Iteratively analyze/aggregate/visualize using Spark & R DataFrames 4. Publish/share results 18
  19. 19. Demo 19
  20. 20. How to get started with SparkR? • On your computer 1. Download latest version ofSpark (1.5.2) 2. Build (maven orsbt) 3. Run ./install-dev.sh inside the R directory 4. Start R shell by running ./bin/sparkR • Deploy Spark (1.4+) on your cluster • Sign up for 14 days free trial at Databricks 20
  21. 21. Summary 1. SparkR is an R frontend to ApacheSpark 2. Distributed data resides in the JVM 3. Workers are not runningR process(yet) 4. Distinction between Spark DataFrames and R data frames 21
  22. 22. Further pointers http://spark.apache.org http://www.r-project.org http://www.ggplot2.org https://cran.r-project.org/web/packages/magrittr www.databricks.com Office hour: 13-14 Databricks Booth 22
  23. 23. Thank you

×