• Like
  • Save
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, Revolution Analytics
 

Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, Revolution Analytics

on

  • 11,799 views

When two of the most powerful innovations in modern analytics come together, the result is revolutionary. This session will provide an overview of R, the Open Source programming language used by ...

When two of the most powerful innovations in modern analytics come together, the result is revolutionary. This session will provide an overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization. It will discuss the ways that R and Hadoop have been integrated and look at use case that provides real-world experience. Finally it will provide suggestions of how enterprises can take advantage of both of these industry-leading technologies.

Statistics

Views

Total Views
11,799
Views on SlideShare
3,682
Embed Views
8,117

Actions

Likes
16
Downloads
0
Comments
1

26 Embeds 8,117

http://blog.revolutionanalytics.com 2386
http://d.hatena.ne.jp 2277
http://acro-engineer.hatenablog.com 1814
http://www.cloudera.com 1304
http://www.r-bloggers.com 242
http://webcache.googleusercontent.com 24
http://paper.li 13
http://www.newsblur.com 9
http://localhost 6
http://translate.googleusercontent.com 6
http://127.0.0.1 5
http://test.cloudera.com 4
http://revolution-computing.typepad.com 3
http://blog.cloudera.com 3
https://www.facebook.com 3
http://staging.cloudera.com 3
https://www.cloudera.com 3
http://cloudera.louddog.net 2
http://cloudera.brian.dev 2
http://garagekidztweetz.hatenablog.com 2
http://207.46.192.232 1
http://cloudera.matt.dev 1
http://www.facebook.com 1
http://a0.twimg.com 1
http://feeds.feedburner.com 1
https://www.google.co.jp 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, Revolution Analytics Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, Revolution Analytics Presentation Transcript

    • Revolution AnalyticsLeveraging R in Hadoop EnvironmentsNovember 9, 2011 1
    • In Today’s Presentation: About Revolution Analytics Why R and Hadoop? The Packages (rhdfs, rhbase, rmr) Examples
    • Most advanced statisticalanalysis software available The professor who invented analytic software forHalf the cost of the experts now wants to take it to the massescommercial alternatives2M+ Users Power 3,000+ Applications Finance Statistics Life Sciences Predictive Manufacturing Analytics Productivity Retail Data Mining Telecom Enterprise Visualization Social Media Readiness Government
    • What’s the Difference Between R andRevolution R Enterprise? Revolution R is 100% R and More® Multi-Threaded Web-Based Web Services Big Data Parallel Math Libraries GUI API Analysis Tools Technical IDE / Developer Support GUI 4,000+ Community Build Packages R Engine Assurance Language Libraries For more information contact: info@revolutionanalytics.com 4
    • Let’s Talk about R and Hadoop 5
    • Why R and Hadoop? Hadoop - a scalable infrastructure for processing massive amounts of data Storage – HDFS, HBASE Distributed Computing - MapReduce R - a statistical programming language Need for more than counts and averages Analyze all of the data 6
    • Motivation for this project Make it easy for the R programmer to interact with the Hadoop data stores and write MapReduce programs Run R on a massively distributed system without having to understand the underlying infrastructure Statisticians stay focused on the analysis Open source 7
    • R and Hadoop – The R Packages Capabilities delivered as individual HBASE R packages HDFS rhdfs - R and HDFS R Thrift rhbase - R and HBASE Map or Reduce rmr - R and MapReduce Task rhbase rhdfs Node Downloads available from R Client Github Job Tracker rmr 8
    • rhdfs Manipulate HDFS directly from R Mimic as much of the HDFS Java API as possible Examples: Read a HDFS text file into a data frame. Serialize/Deserialize a model to HDFS Write an HDFS file to local storage rhdfs/pkg/inst/unitTests rhdfs/pkg/inst/examples 9
    • rhdfs Functions File Manipulations - hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get File Read/Write - hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file Directory - hdfs.dircreate, hdfs.mkdir Utility - hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists Initialization – hdfs.init, hdfs.defaults 10
    • rhbase Manipulate HBASE tables and their content Uses Thrift C++ API as the mechanism to communicate to HBASE Examples Create a data frame from a collection of rows and columns in an HBASE table Update an HBASE table with values from a data frame rhbase/pkg/inst/unitTests 11
    • rhbase Functions Table Manipulation – hb.new.table, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table Row Read/Write - hb.insert, hb.get, hb.delete, hb.insert.data.frame, hb.get.data.frame, hb.scan Utility - hb.list.tables Initialization - hb.defaults, hb.init 12
    • Writing MapReduce programs in R 13
    • rmr - For R Programmers• A way to access big data sets• A simple way to write parallel programs – everyone will have to• Very R-like, building on the functional characteristics of R• Just a library 14
    • rmr – For MapReduce Developers• Much simpler than writing Java• Not as simple as Hive, Pig at what they do, but more general• Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable. 15
    • rmr mapreduce Function mapreduce (input, output, map, reduce, …) input – input folder output – output folder map – R function used as map reduce – R function used as reduce … - other advanced parameters 16
    • Some Simple ThingsExample showing sampling and counting map = function(k, v) if (hash(k) %% 10 == 0) keyval(k, v) reduce = function(k, vv) keyval(k, length(vv)) mapreduce(input, output, map, reduce)
    • More Simple Things HIVE INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; rmr mapreduce(input = mapreduce(input = "pv_users", map = function(k, v) keyval(v[userid], v[gender]), reduce = function(k, vv) keyval(k, vv[[1]]), output = "pv_gender_sum", map = function(k,v) keyval(v, 1) reduce = function(k, vv) keyval(k, sum(unlist(vv))) Takeaways A language like HIVE makes a class of problems easy to solve, but it is not a general tool The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities 18
    • Complex Thingsk-means Clustering 19
    • k-means - Implementation Well known design (MacQueen, 1967) Comparison of the k-means in MapReduce Pig From Hortonworks Requires coding in 3 languages (Python-Pig-Java) 100 lines of code rmr 20 lines of only R code 22
    • k-means - Highlights map = function(k,v) keyval(which.min(distances(centers,v)),v) reduce = function(k,vv) keyval(NULL, col.average(vv)) centers = from.dfs( mapreduce("data-points", map, reduce)) 23
    • k-means - OptimizationsSlow Fast Notesfor(i in 1:100) a=b+c light use of R interpreter, use a[i] = b[i] + c[i] fast vector primitives, C if necessary[ 1, 2, 3, 4, 5] [[ 1, 2, 3, 4, 5],[6, 7, 8, 9, use beefier records, say 1k 10],[11, 12, 13, 14, 15]... points per recorddistance(center, point) norm(center - P) compute all distances with fast matrix operationscombiner = FALSE combiner = TRUE reduce often and early, use combinerkeyval(k, mean(…)) keyval(k, replace means with (sum, c(total, count)) count) pairs to enable early reduction https://github.com/RevolutionAnalytics/RHadoop/wiki/Fast-k-means 24
    • Final thoughts R and Hadoop together offer innovation and flexibility needed to meet analytics challenges of big data We need contributors to this project! Developers Documentation Use cases General Feedback 25
    • Resources RHadoop Open source project: https://github.com/RevolutionAnalytics/RHa doop/wiki Revolution R Enterprise: bit.ly/Enterprise-R Cloudera CDH: http://www.cloudera.com/hadoop/ Email: rhadoop@revolutionanalytics.com 26
    • Thank you. The leading commercial provider of software and support for the popular open source R statistics language. www.revolutionanalytics.com 650.330.0553 Twitter: @RevolutionR 27