Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Analytics with SparkR

1,353 views

Published on

SparkR, an R package that provides an interface with Apache Spark, leverages Spark’s powerful distributed computation engine to enable exploration, data analysis, and data science on big data sets. In this session, we will introduce SparkR. We will also demonstrate how to get started using SparkR for interactively querying data, building predictive user defined functions and running large scale machine learning with MLlib.

Published in: Data & Analytics

Big Data Analytics with SparkR

  1. 1. PASS Business Analytics Marathon PASS BUSINESS ANALYTICS MARATHON DECEMBER 14 Big Data Analytics with SparkR Jen Underwood, Founder, Impact Analytix, LLC Sponsored By
  2. 2. PASS Business Analytics Marathon PASS Virtual Chapters for Business Analytics www.sqlpass.org/vc FREE ONLINE LEARNING
  3. 3. PASS Business Analytics Marathon jenunderwood.com twitter.com/idigdata linkedin.com/in/idigdata • Keeps a constant pulse on industry trends • IBM Analytics Insider, SAS contributor, past Tableau Zen Master, active in community • Former product team roles at Microsoft • Bachelor of Business Administration – Marketing, Cum Laude from the University of Wisconsin, Milwaukee • Post-graduate certificate in Computer Science – Data Mining from the University of California, San Diego Jen Underwood, Founder of Impact Analytix
  4. 4. PASS Business Analytics Marathon PASS BUSINESS ANALYTICS MARATHON DECEMBER 14 Big Data Analytics with SparkR Jen Underwood, Founder, Impact Analytix, LLC Sponsored By
  5. 5. PASS Business Analytics Marathon Agenda • Overview of Apache Spark • Technical Architecture • Introduction to SparkR • Interactively Querying Data • Building Predictive Models • Q&A 8
  6. 6. PASS Business Analytics Marathon Apache Spark for Big Data Analytics • FAST, scalable, efficient analysis of Big Data • Spark apps can run up to 100 times faster in memory and 10 times faster on disk • Open source framework is growing faster than Hadoop • Most active open source project in big data 9 Source: Indeed
  7. 7. PASS Business Analytics Marathon Overview of Apache Spark 10 See: http://www.jenunderwood.com/2016/10/16/spark-big-data-analytics-part-1/
  8. 8. PASS Business Analytics Marathon Overview of Apache Spark • Spark SQL is a module for working with structured data • Spark Streaming is used for processing real-time streaming data • Spark MLlib is a machine learning library • Classification, regression, clustering, collaborative filtering and dimensionality reduction • Prescriptive optimization primitives • Spark GraphX is an API to simplify graph analytics tasks • Expanding ecosystem of libraries including but not limited to Zeppelin, SnappyData, CaffeOnSpark, Spark Cassandra, BlinkDB and SparkR 11
  9. 9. PASS Business Analytics Marathon Spark SQL Architecture 12
  10. 10. PASS Business Analytics Marathon Spark Resilient Distributed Datasets (RDDs) • RDDs are fault-tolerant collections of objects partitioned across a cluster that can be queried in parallel • Created by “transformations” with map, filter, and groupBy • Persist in memory for rapid reuse and can overflow to disk 13
  11. 11. PASS Business Analytics Marathon Spark DataFrames • Abstraction to manipulate Spark RDDs • Conceptually like a database table • Similar to R and Python data frames • sqlContext -> DataFrames • Not Hadoop HDFS dependent; integrates and coexists with a wide range of data storage 14
  12. 12. PASS Business Analytics Marathon Getting Started • Create a Spark cluster • Often pre-created in varying platforms • Look for Analytics Notebook options • RStudio also cooked into offerings 15
  13. 13. PASS Business Analytics Marathon 16
  14. 14. PASS Business Analytics Marathon PASS BUSINESS ANALYTICS MARATHON DECEMBER 14 DEMO 17
  15. 15. PASS Business Analytics Marathon SparkR: a Spark API • Light-weight R frontend that allows data scientists to analyze large datasets in Spark • SQLContext, SparkR can create DataFrames from a wide array of sources such as: structured data files, tables in Hive, external databases, JSON files, Parquet files, distributed file systems (HDFS), cloud storage (S3), or existing RDDs and third-party data formats. • HiveContext, SparkR can access tables from Hive MetaStore
  16. 16. PASS Business Analytics Marathon 19PASS Business Analytics Marathon
  17. 17. PASS Business Analytics Marathon Sparklyr: R interface for Apache Spark • Open-source R package • Integrated with RStudio IDE • Extensible foundation for Spark + R • dplyr back-end for Spark • Orchestrate machine learning from R using either Spark MLlib or H2O Sparkling Water. 20
  18. 18. PASS Business Analytics Marathon 21
  19. 19. PASS Business Analytics Marathon Sparklyr: R interface for Apache Spark 22 Easy installation via devtools Loads data into Spark DataFrames from: local R data frames, Hive tables, CSV, JSON, and Parquet files. Connect to both local instances of Spark and remote Spark clusters Source: http://spark.rstudio.com/index.html
  20. 20. PASS Business Analytics Marathon Data Processing and Modeling • Supports selections: select(), filter() • Aggregations: summarize(), arrange() • Applying UDFs on partitions starting with SparkR 2.0 • Uses MLlib to fit linear models over Spark DataFrames • Regression Model • Naive Bayes Model • KMeans Model 23
  21. 21. PASS Business Analytics Marathon 24 • Use dplyr for data manipulation Data Analysis and Visualization library(dplyr) # use standard verbs to filter and aggregate select( filter(my_tbl, Petal_Width < 0.3), Petal_Length, Petal_Width ) # use magrittr pipes for a cleaner syntax my_tbl %>% filter(Petal_Width < 0.3) %>% select(Petal_Length, Petal_Width)
  22. 22. PASS Business Analytics Marathon Interactively Querying Data • Spark SQL, previously project Shark, supports similar queries to columnar database engines • sqlContext for analyzing data • Note Spark SQL does not support DELETE • Often used with data discovery tools like Tableau, Qlik, TIBCO Spot, etc. via JDBC/ODBC connectivity
  23. 23. PASS Business Analytics Marathon Interactively Querying Data 26
  24. 24. PASS Business Analytics Marathon 27 See: http://spark.rstudio.com/examples.html
  25. 25. PASS Business Analytics Marathon 28 See: https://github.com/trestletech/user2016-sparklyr
  26. 26. PASS Business Analytics Marathon 29 See: https://beta.rstudioconnect.com/content/1813/babynames-dplyr.nb.html
  27. 27. PASS Business Analytics Marathon PASS BUSINESS ANALYTICS MARATHON DECEMBER 14 DEMO 30
  28. 28. PASS Business Analytics Marathon Building Predictive Models • Functions for machine learning • ml_*: Machine learning algorithms for analyzing data provided by the spark.ml package. • K-Means, GLM, LR, Survival Regression, DT, RF, GBT, PCA, Naive-Bayes, Multilayer Perceptron • ft_*: Feature transformers for manipulating individual features. • sdf_*: Functions for manipulating SparkDataFrames. 31
  29. 29. PASS Business Analytics Marathon 32 Spark MLlib
  30. 30. PASS Business Analytics Marathon Sparklyr + H20.ai • rsparkling: R interface to Sparkling Water Spark package • A sparklyr extension – docs at http://spark.rstudio.com/h2o.html • Spark DataFrames -> H2OFrames • Apply H2O machine learning algorithms on the data • df_mutate is a sparklyr command that accesses the Spark ML API 33
  31. 31. PASS Business Analytics Marathon Machine Learning with rsparkling • Perform SQL queries through the sparklyr dplyr interface, • Use the sdf_* and ft_* family of functions to generate new columns, or partition your data set • Convert your training, validation and/or test data frames into H2O Frames using the as_h2o_frame function • Choose an appropriate H2O machine learning algorithm to model your data • Inspect the quality of your model fit, and use it to make predictions with new data 34
  32. 32. PASS Business Analytics Marathon 35
  33. 33. PASS Business Analytics Marathon 36 See: http://sparkdemo.rstudio.com/dashboards/nycflights13-dash-spark/#summary
  34. 34. PASS Business Analytics Marathon PASS BUSINESS ANALYTICS MARATHON DECEMBER 14 DEMO 37
  35. 35. PASS Business Analytics Marathon Classifiers using Sparklyr 38 See https://beta.rstudioconnect.com/content/1518/notebook-classification.html
  36. 36. PASS Business Analytics Marathon Classifiers using Sparklyr 39 See https://beta.rstudioconnect.com/content/1518/notebook-classification.html
  37. 37. PASS Business Analytics Marathon Score and Compare Models 40 See https://beta.rstudioconnect.com/content/1518/notebook-classification.html
  38. 38. PASS Business Analytics Marathon PASS BUSINESS ANALYTICS MARATHON DECEMBER 14 DEMO 41
  39. 39. PASS Business Analytics Marathon
  40. 40. PASS Business Analytics Marathon Free Tutorials from KDD 2016 • Presentations • Hands-on tutorials • Sample data sets • Download at http://tinyurl.com/KDD2016R 2016
  41. 41. PASS Business Analytics Marathon More Awesome Resources • Spark Jump Start • A Gentle Introduction to Apache Spark • Apache Spark for Data Scientists 44
  42. 42. PASS Business Analytics Marathon Summary • Overview of Apache Spark • Technical Architecture • Introduction to SparkR • Interactively Querying Data • Building Predictive Models • Q&A 45
  43. 43. PASS Business Analytics Marathon Questions?
  44. 44. PASS Business Analytics Marathon 47 Access to online training and content Enjoy discounted event rates Join Local Chapters and Virtual Chapters Get advance notice of member exclusives PASS is a not-for-profit organization which offers year-round learning opportunities to data professionals Membership is free, join today at www.sqlpass.org JOIN PASS
  45. 45. PASS Business Analytics Marathon PASS BUSINESS ANALYTICS MARATHON DECEMBER 14 Coming up next… Disrupt the static nature of BI with Predictive Anomaly Detection Uri Maoz Sponsored By

×