• Like

The Powerful Marriage of Hadoop and R (David Champagne)

  • 3,220 views
Uploaded on

When two of the most powerful innovations in modern analytics come together, the result is revolutionary. …

When two of the most powerful innovations in modern analytics come together, the result is revolutionary.

This presentation covers:
- An overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization.
- The ways that R and Hadoop have been integrated.
- A use case that provides real-world experience.
- A look at how enterprises can take advantage of both of these industry-leading technologies.

Presented at Hadoop World 2011 by:

David Champagne
CTO, Revolution Analytics

David Champagne is a top software architect, programmer and product manager with over 20 years experience in enterprise and web application development for business customers across a wide range of industries. As Principal Architect/Engineer for SPSS, Champagne led the development teams and created and led the text mining team.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,220
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
139
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Revolution AnalyticsLeveraging R in Hadoop EnvironmentsNovember 9, 2011 1
  • 2. In Today’s Presentation: About Revolution Analytics Why R and Hadoop? The Packages (rhdfs, rhbase, rmr) Examples
  • 3. Most advanced statisticalanalysis software available The professor who invented analytic software forHalf the cost of the experts now wants to take it to the massescommercial alternatives2M+ Users Power 3,000+ Packages Finance Statistics Life Sciences Predictive Manufacturing Analytics Productivity Retail Data Mining Telecom Enterprise Visualization Social Media Readiness Government
  • 4. What’s the Difference Between R andRevolution R Enterprise? Revolution R is 100% R and More® Multi-Threaded Web-Based Web Services Big Data Parallel Math Libraries GUI API Analysis Tools Technical IDE / Developer Support GUI 3,000+ Community Build Packages R Engine Assurance Language Libraries For more information contact: info@revolutionanalytics.com 4
  • 5. Let’s Talk about R and Hadoop 5
  • 6. Why R and Hadoop? Hadoop - a scalable infrastructure for processing massive amounts of data Storage – HDFS, HBASE Distributed Computing - MapReduce R - a statistical programming language Need for more than counts and averages Analyze all of the data 6
  • 7. Motivation for this project Make it easy for the R programmer to interact with the Hadoop data stores and write MapReduce programs Run R on a massively distributed system without having to understand the underlying infrastructure Statisticians stay focused on the analysis Open source 7
  • 8. R and Hadoop – The R Packages Capabilities delivered as individual HBASE R packages HDFS rhdfs - R and HDFS R Thrift rhbase - R and HBASE Map or Reduce rmr - R and MapReduce Task rhbase rhdfs Node Downloads available from R Client Github Job Tracker rmr 8
  • 9. rhdfs Manipulate HDFS directly from R Mimic as much of the HDFS Java API as possible Examples: Read a HDFS text file into a data frame. Serialize/Deserialize a model to HDFS Write an HDFS file to local storage rhdfs/pkg/inst/unitTests rhdfs/pkg/inst/examples 9
  • 10. rhdfs Functions File Manipulations - hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get File Read/Write - hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file Directory - hdfs.dircreate, hdfs.mkdir Utility - hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists Initialization – hdfs.init, hdfs.defaults 10
  • 11. rhbase Manipulate HBASE tables and their content Uses Thrift C++ API as the mechanism to communicate to HBASE Examples Create a data frame from a collection of rows and columns in an HBASE table Update an HBASE table with values from a data frame rhbase/pkg/inst/unitTests 11
  • 12. rhbase Functions Table Manipulation – hb.new.table, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table Row Read/Write - hb.insert, hb.get, hb.delete, hb.insert.data.frame, hb.get.data.frame, hb.scan Utility - hb.list.tables Initialization - hb.defaults, hb.init 12
  • 13. Writing MapReduce programs in R 13
  • 14. rmr - For R Programmers• A way to access big data sets• A simple way to write parallel programs – everyone will have to• Very R-like, building on the functional characteristics of R• Just a library 14
  • 15. rmr – For MapReduce Developers• Much simpler than writing Java• Not as simple as Hive, Pig at what they do, but more general• Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable. 15
  • 16. rmr mapreduce Function mapreduce (input, output, map, reduce, …) input – input folder output – output folder map – R function used as map reduce – R function used as reduce … - other advanced parameters 16
  • 17. Some Simple ThingsExample showing sampling and counting map = function(k, v) if (hash(k) %% 10 == 0) keyval(k, v) reduce = function(k, vv) keyval(k, length(vv)) mapreduce(input, output, map, reduce)
  • 18. More Simple Things HIVE INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; rmr mapreduce(input = mapreduce(input = "pv_users", map = function(k, v) keyval(v[userid], v[gender]), reduce = function(k, vv) keyval(k, vv[[1]]), output = "pv_gender_sum", map = function(k,v) keyval(v, 1) reduce = function(k, vv) keyval(k, sum(unlist(vv))) Takeaways A language like HIVE makes a class of problems easy to solve, but it is not a general tool The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities 18
  • 19. Complex Thingsk-means Clustering 19
  • 20. k-means - Implementation Well known design (MacQueen, 1967) Comparison of the k-means in MapReduce Pig From Hortonworks Requires coding in 3 languages (Python-Pig-Java) 100 lines of code rmr 20 lines of only R code 22
  • 21. k-means - Highlights map = function(k,v) keyval(which.min(distances(centers,v)),v) reduce = function(k,vv) keyval(NULL, col.average(vv)) centers = from.dfs( mapreduce("data-points", map, reduce)) 23
  • 22. k-means - OptimizationsSlow Fast Notesfor(i in 1:100) a=b+c light use of R interpreter, use a[i] = b[i] + c[i] fast vector primitives, C if necessary[ 1, 2, 3, 4, 5] [[ 1, 2, 3, 4, 5],[6, 7, 8, 9, use beefier records, say 1k 10],[11, 12, 13, 14, 15]... points per recorddistance(center, point) norm(center - P) compute all distances with fast matrix operationscombiner = FALSE combiner = TRUE reduce often and early, use combinerkeyval(k, mean(…)) keyval(k, replace means with (sum, c(total, count)) count) pairs to enable early reduction https://github.com/RevolutionAnalytics/RHadoop/wiki/Fast-k-means 24
  • 23. Final thoughts R and Hadoop together offer innovation and flexibility needed to meet analytics challenges of big data We need contributors to this project! Developers Documentation Use cases General Feedback 25
  • 24. Resources RHadoop Open source project: https://github.com/RevolutionAnalytics/RHa doop/wiki Revolution R Enterprise: bit.ly/Enterprise-R Cloudera CDH: http://www.cloudera.com/hadoop/ Email: rhadoop@revolutionanalytics.com 26
  • 25. Thank you. The leading commercial provider of software and support for the popular open source R statistics language. www.revolutionanalytics.com 650.330.0553 Twitter: @RevolutionR 27