Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, Revolution Analytics


Published on

When two of the most powerful innovations in modern analytics come together, the result is revolutionary. This session will provide an overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization. It will discuss the ways that R and Hadoop have been integrated and look at use case that provides real-world experience. Finally it will provide suggestions of how enterprises can take advantage of both of these industry-leading technologies.

Published in: Technology
1 Comment
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, Revolution Analytics

  1. 1. Revolution AnalyticsLeveraging R in Hadoop EnvironmentsNovember 9, 2011 1
  2. 2. In Today’s Presentation: About Revolution Analytics Why R and Hadoop? The Packages (rhdfs, rhbase, rmr) Examples
  3. 3. Most advanced statisticalanalysis software available The professor who invented analytic software forHalf the cost of the experts now wants to take it to the massescommercial alternatives2M+ Users Power 3,000+ Applications Finance Statistics Life Sciences Predictive Manufacturing Analytics Productivity Retail Data Mining Telecom Enterprise Visualization Social Media Readiness Government
  4. 4. What’s the Difference Between R andRevolution R Enterprise? Revolution R is 100% R and More® Multi-Threaded Web-Based Web Services Big Data Parallel Math Libraries GUI API Analysis Tools Technical IDE / Developer Support GUI 4,000+ Community Build Packages R Engine Assurance Language Libraries For more information contact: info@revolutionanalytics.com 4
  5. 5. Let’s Talk about R and Hadoop 5
  6. 6. Why R and Hadoop? Hadoop - a scalable infrastructure for processing massive amounts of data Storage – HDFS, HBASE Distributed Computing - MapReduce R - a statistical programming language Need for more than counts and averages Analyze all of the data 6
  7. 7. Motivation for this project Make it easy for the R programmer to interact with the Hadoop data stores and write MapReduce programs Run R on a massively distributed system without having to understand the underlying infrastructure Statisticians stay focused on the analysis Open source 7
  8. 8. R and Hadoop – The R Packages Capabilities delivered as individual HBASE R packages HDFS rhdfs - R and HDFS R Thrift rhbase - R and HBASE Map or Reduce rmr - R and MapReduce Task rhbase rhdfs Node Downloads available from R Client Github Job Tracker rmr 8
  9. 9. rhdfs Manipulate HDFS directly from R Mimic as much of the HDFS Java API as possible Examples: Read a HDFS text file into a data frame. Serialize/Deserialize a model to HDFS Write an HDFS file to local storage rhdfs/pkg/inst/unitTests rhdfs/pkg/inst/examples 9
  10. 10. rhdfs Functions File Manipulations - hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get File Read/Write - hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file Directory - hdfs.dircreate, hdfs.mkdir Utility - hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists Initialization – hdfs.init, hdfs.defaults 10
  11. 11. rhbase Manipulate HBASE tables and their content Uses Thrift C++ API as the mechanism to communicate to HBASE Examples Create a data frame from a collection of rows and columns in an HBASE table Update an HBASE table with values from a data frame rhbase/pkg/inst/unitTests 11
  12. 12. rhbase Functions Table Manipulation – hb.new.table, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table Row Read/Write - hb.insert, hb.get, hb.delete, hb.insert.data.frame, hb.get.data.frame, hb.scan Utility - hb.list.tables Initialization - hb.defaults, hb.init 12
  13. 13. Writing MapReduce programs in R 13
  14. 14. rmr - For R Programmers• A way to access big data sets• A simple way to write parallel programs – everyone will have to• Very R-like, building on the functional characteristics of R• Just a library 14
  15. 15. rmr – For MapReduce Developers• Much simpler than writing Java• Not as simple as Hive, Pig at what they do, but more general• Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable. 15
  16. 16. rmr mapreduce Function mapreduce (input, output, map, reduce, …) input – input folder output – output folder map – R function used as map reduce – R function used as reduce … - other advanced parameters 16
  17. 17. Some Simple ThingsExample showing sampling and counting map = function(k, v) if (hash(k) %% 10 == 0) keyval(k, v) reduce = function(k, vv) keyval(k, length(vv)) mapreduce(input, output, map, reduce)
  18. 18. More Simple Things HIVE INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; rmr mapreduce(input = mapreduce(input = "pv_users", map = function(k, v) keyval(v[userid], v[gender]), reduce = function(k, vv) keyval(k, vv[[1]]), output = "pv_gender_sum", map = function(k,v) keyval(v, 1) reduce = function(k, vv) keyval(k, sum(unlist(vv))) Takeaways A language like HIVE makes a class of problems easy to solve, but it is not a general tool The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities 18
  19. 19. Complex Thingsk-means Clustering 19
  20. 20. k-means - Implementation Well known design (MacQueen, 1967) Comparison of the k-means in MapReduce Pig From Hortonworks Requires coding in 3 languages (Python-Pig-Java) 100 lines of code rmr 20 lines of only R code 22
  21. 21. k-means - Highlights map = function(k,v) keyval(which.min(distances(centers,v)),v) reduce = function(k,vv) keyval(NULL, col.average(vv)) centers = from.dfs( mapreduce("data-points", map, reduce)) 23
  22. 22. k-means - OptimizationsSlow Fast Notesfor(i in 1:100) a=b+c light use of R interpreter, use a[i] = b[i] + c[i] fast vector primitives, C if necessary[ 1, 2, 3, 4, 5] [[ 1, 2, 3, 4, 5],[6, 7, 8, 9, use beefier records, say 1k 10],[11, 12, 13, 14, 15]... points per recorddistance(center, point) norm(center - P) compute all distances with fast matrix operationscombiner = FALSE combiner = TRUE reduce often and early, use combinerkeyval(k, mean(…)) keyval(k, replace means with (sum, c(total, count)) count) pairs to enable early reduction https://github.com/RevolutionAnalytics/RHadoop/wiki/Fast-k-means 24
  23. 23. Final thoughts R and Hadoop together offer innovation and flexibility needed to meet analytics challenges of big data We need contributors to this project! Developers Documentation Use cases General Feedback 25
  24. 24. Resources RHadoop Open source project: https://github.com/RevolutionAnalytics/RHa doop/wiki Revolution R Enterprise: bit.ly/Enterprise-R Cloudera CDH: http://www.cloudera.com/hadoop/ Email: rhadoop@revolutionanalytics.com 26
  25. 25. Thank you. The leading commercial provider of software and support for the popular open source R statistics language. www.revolutionanalytics.com 650.330.0553 Twitter: @RevolutionR 27