Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, Revolution Analytics


Published on

When two of the most powerful innovations in modern analytics come together, the result is revolutionary. This session will provide an overview of R, the Open Source programming language used by …

When two of the most powerful innovations in modern analytics come together, the result is revolutionary. This session will provide an overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization. It will discuss the ways that R and Hadoop have been integrated and look at use case that provides real-world experience. Finally it will provide suggestions of how enterprises can take advantage of both of these industry-leading technologies.

Published in: Technology
1 Comment
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Revolution AnalyticsLeveraging R in Hadoop EnvironmentsNovember 9, 2011 1
  • 2. In Today’s Presentation: About Revolution Analytics Why R and Hadoop? The Packages (rhdfs, rhbase, rmr) Examples
  • 3. Most advanced statisticalanalysis software available The professor who invented analytic software forHalf the cost of the experts now wants to take it to the massescommercial alternatives2M+ Users Power 3,000+ Applications Finance Statistics Life Sciences Predictive Manufacturing Analytics Productivity Retail Data Mining Telecom Enterprise Visualization Social Media Readiness Government
  • 4. What’s the Difference Between R andRevolution R Enterprise? Revolution R is 100% R and More® Multi-Threaded Web-Based Web Services Big Data Parallel Math Libraries GUI API Analysis Tools Technical IDE / Developer Support GUI 4,000+ Community Build Packages R Engine Assurance Language Libraries For more information contact: 4
  • 5. Let’s Talk about R and Hadoop 5
  • 6. Why R and Hadoop? Hadoop - a scalable infrastructure for processing massive amounts of data Storage – HDFS, HBASE Distributed Computing - MapReduce R - a statistical programming language Need for more than counts and averages Analyze all of the data 6
  • 7. Motivation for this project Make it easy for the R programmer to interact with the Hadoop data stores and write MapReduce programs Run R on a massively distributed system without having to understand the underlying infrastructure Statisticians stay focused on the analysis Open source 7
  • 8. R and Hadoop – The R Packages Capabilities delivered as individual HBASE R packages HDFS rhdfs - R and HDFS R Thrift rhbase - R and HBASE Map or Reduce rmr - R and MapReduce Task rhbase rhdfs Node Downloads available from R Client Github Job Tracker rmr 8
  • 9. rhdfs Manipulate HDFS directly from R Mimic as much of the HDFS Java API as possible Examples: Read a HDFS text file into a data frame. Serialize/Deserialize a model to HDFS Write an HDFS file to local storage rhdfs/pkg/inst/unitTests rhdfs/pkg/inst/examples 9
  • 10. rhdfs Functions File Manipulations - hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get File Read/Write - hdfs.file, hdfs.write, hdfs.close, hdfs.flush,,, hdfs.tell, hdfs.line.reader, Directory - hdfs.dircreate, hdfs.mkdir Utility -, hdfs.list.files,, hdfs.exists Initialization – hdfs.init, hdfs.defaults 10
  • 11. rhbase Manipulate HBASE tables and their content Uses Thrift C++ API as the mechanism to communicate to HBASE Examples Create a data frame from a collection of rows and columns in an HBASE table Update an HBASE table with values from a data frame rhbase/pkg/inst/unitTests 11
  • 12. rhbase Functions Table Manipulation –, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table Row Read/Write - hb.insert, hb.get, hb.delete,,, hb.scan Utility - hb.list.tables Initialization - hb.defaults, hb.init 12
  • 13. Writing MapReduce programs in R 13
  • 14. rmr - For R Programmers• A way to access big data sets• A simple way to write parallel programs – everyone will have to• Very R-like, building on the functional characteristics of R• Just a library 14
  • 15. rmr – For MapReduce Developers• Much simpler than writing Java• Not as simple as Hive, Pig at what they do, but more general• Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable. 15
  • 16. rmr mapreduce Function mapreduce (input, output, map, reduce, …) input – input folder output – output folder map – R function used as map reduce – R function used as reduce … - other advanced parameters 16
  • 17. Some Simple ThingsExample showing sampling and counting map = function(k, v) if (hash(k) %% 10 == 0) keyval(k, v) reduce = function(k, vv) keyval(k, length(vv)) mapreduce(input, output, map, reduce)
  • 18. More Simple Things HIVE INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; rmr mapreduce(input = mapreduce(input = "pv_users", map = function(k, v) keyval(v[userid], v[gender]), reduce = function(k, vv) keyval(k, vv[[1]]), output = "pv_gender_sum", map = function(k,v) keyval(v, 1) reduce = function(k, vv) keyval(k, sum(unlist(vv))) Takeaways A language like HIVE makes a class of problems easy to solve, but it is not a general tool The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities 18
  • 19. Complex Thingsk-means Clustering 19
  • 20. k-means - Implementation Well known design (MacQueen, 1967) Comparison of the k-means in MapReduce Pig From Hortonworks Requires coding in 3 languages (Python-Pig-Java) 100 lines of code rmr 20 lines of only R code 22
  • 21. k-means - Highlights map = function(k,v) keyval(which.min(distances(centers,v)),v) reduce = function(k,vv) keyval(NULL, col.average(vv)) centers = from.dfs( mapreduce("data-points", map, reduce)) 23
  • 22. k-means - OptimizationsSlow Fast Notesfor(i in 1:100) a=b+c light use of R interpreter, use a[i] = b[i] + c[i] fast vector primitives, C if necessary[ 1, 2, 3, 4, 5] [[ 1, 2, 3, 4, 5],[6, 7, 8, 9, use beefier records, say 1k 10],[11, 12, 13, 14, 15]... points per recorddistance(center, point) norm(center - P) compute all distances with fast matrix operationscombiner = FALSE combiner = TRUE reduce often and early, use combinerkeyval(k, mean(…)) keyval(k, replace means with (sum, c(total, count)) count) pairs to enable early reduction 24
  • 23. Final thoughts R and Hadoop together offer innovation and flexibility needed to meet analytics challenges of big data We need contributors to this project! Developers Documentation Use cases General Feedback 25
  • 24. Resources RHadoop Open source project: doop/wiki Revolution R Enterprise: Cloudera CDH: Email: 26
  • 25. Thank you. The leading commercial provider of software and support for the popular open source R statistics language. 650.330.0553 Twitter: @RevolutionR 27