RHadoop, R meets Hadoop
 

RHadoop, R meets Hadoop

on

  • 66,458 views

(Presented by Antonio Piccolboni to Strata 2012 Conference, Feb 29 2012)....

(Presented by Antonio Piccolboni to Strata 2012 Conference, Feb 29 2012).

Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.

- rhdfs provides file level manipulation for HDFS, the Hadoop file system
- rhbase provides access to HBASE, the hadoop database
- rmr allows to write mapreduce programs in R

rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.

Statistics

Views

Total Views
66,458
Views on SlideShare
7,158
Embed Views
59,300

Actions

Likes
32
Downloads
662
Comments
2

35 Embeds 59,300

http://blog.revolutionanalytics.com 55778
http://www.r-bloggers.com 2420
http://smartdatacollective.com 878
http://www.scoop.it 62
http://translate.googleusercontent.com 22
http://feeds.feedburner.com 20
http://www.newsblur.com 15
http://www.google.com 14
http://webcache.googleusercontent.com 13
http://atomicules.co.uk 12
https://www.google.com 10
http://vizdat.collected.info 9
http://core.traackr.com 5
https://www.google.co.kr 5
http://03.collected.info 4
http://www.google.co.in 3
http://127.0.0.1 3
https://twitter.com 2
https://www.google.se 2
https://www.google.co.uk 2
http://www.twylah.com 2
http://staffmail.scu.edu.au 2
http://revolution-computing.typepad.com 2
http://webmail.scu.edu.au 2
http://xianguo.com 2
http://www.hanrss.com 2
http://www.pinterest.com 1
https://translate.googleusercontent.com 1
http://www.google.com.au 1
https://www.google.lk 1
http://cafe.naver.com 1
https://www.google.be 1
http://www.smartdatacollective.com 1
http://cache.baidu.com 1
http://www.google.co.uk 1
More...

Accessibility

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Hmmm. I'm not sure the difference between RHadoop and RHive. RHive has mapreduce functionality even though its name is Hive. It also has HDFS adapter.
    Are you sure you want to
    Your message goes here
    Processing…
  • RHadoop step by step
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • What is R\nWhat is RHadoop\nOpen source project\nstarted by RevoLution\naims to make R and Hadoop work together\nwhat is revolution\n
  • \n
  • \n
  • \n
  • Faster, assured builds\nLarge Data extensions\nWeb deployments\nTech support\nConsulting service\nTraining\n
  • \n
  • hadoop bring horizontal scalability\nr sophisticated analytics\ncombination could be powerful\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • A way to access big data sets\n\n
  • A simple way to write parallel programs – everyone will have to\n \n\n
  • Very R-like, building on the functional characteristics of R\n\n
  • Just a library \n
  • \n Much simpler than writing Java\n Not as simple as Hive, Pig at what they do, but more general\n Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.\n
  • \n Much simpler than writing Java\n Not as simple as Hive, Pig at what they do, but more general\n Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.\n
  • \n Much simpler than writing Java\n Not as simple as Hive, Pig at what they do, but more general\n Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.\n
  • \n Much simpler than writing Java\n Not as simple as Hive, Pig at what they do, but more general\n Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.\n
  • \n Much simpler than writing Java\n Not as simple as Hive, Pig at what they do, but more general\n Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.\n
  • \n Much simpler than writing Java\n Not as simple as Hive, Pig at what they do, but more general\n Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.\n
  • \n Much simpler than writing Java\n Not as simple as Hive, Pig at what they do, but more general\n Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.\n
  • mapreduce first an most important element of API\ninput can be as simple as a path\noutput the same or skip for managed space with stubs\nmap reduce simple R functions as opposed to Rhipe\n
  • mapreduce first an most important element of API\ninput can be as simple as a path\noutput the same or skip for managed space with stubs\nmap reduce simple R functions as opposed to Rhipe\n
  • mapreduce first an most important element of API\ninput can be as simple as a path\noutput the same or skip for managed space with stubs\nmap reduce simple R functions as opposed to Rhipe\n
  • mapreduce first an most important element of API\ninput can be as simple as a path\noutput the same or skip for managed space with stubs\nmap reduce simple R functions as opposed to Rhipe\n
  • mapreduce first an most important element of API\ninput can be as simple as a path\noutput the same or skip for managed space with stubs\nmap reduce simple R functions as opposed to Rhipe\n
  • mapreduce first an most important element of API\ninput can be as simple as a path\noutput the same or skip for managed space with stubs\nmap reduce simple R functions as opposed to Rhipe\n
  • mapreduce first an most important element of API\ninput can be as simple as a path\noutput the same or skip for managed space with stubs\nmap reduce simple R functions as opposed to Rhipe\n
  • mapreduce first an most important element of API\ninput can be as simple as a path\noutput the same or skip for managed space with stubs\nmap reduce simple R functions as opposed to Rhipe\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • easy to parametrize jobs\n
  • easy to parametrize jobs\n
  • easy to parametrize jobs\n
  • easy to parametrize jobs\n
  • easy to parametrize jobs\n
  • easy to parametrize jobs\n
  • second pillar of API, the memory-hdfs bridge\n
  • A language like HIVE makes a class of problems easy to solve, but it is not a general tool\n The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities\n
  • A language like HIVE makes a class of problems easy to solve, but it is not a general tool\n The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities\n
  • kmeans implementation in two simple functions\nnote how easy it is to get data in and out of the cluster\n
  • kmeans implementation in two simple functions\nnote how easy it is to get data in and out of the cluster\n
  • kmeans implementation in two simple functions\nnote how easy it is to get data in and out of the cluster\n
  • kmeans implementation in two simple functions\nnote how easy it is to get data in and out of the cluster\n
  • skip quickly to other slides\nnotice three different languages\n
  • \n
  • \n
  • more things you can do combining the elements of the API\n
  • more things you can do combining the elements of the API\n
  • more things you can do combining the elements of the API\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

RHadoop, R meets Hadoop RHadoop, R meets Hadoop Presentation Transcript

  • RHadoop, Hadoop for R
  • Scholarly Activity 05-09 change 50%37.5% 25%12.5% 0%-12.5% -25%-37.5% R SAS SPSS S-Plus Stata
  • Scholarly Activity 05-09 change 50% 37.5% 25% 12.5% Packages 0%10000 -12.5% -25% 1000 -37.5% R SAS SPSS S-Plus Stata 100 10 1 2002 2004 2006 2008 2010
  • Scholarly Activity 05-09 change 50% 37.5% 25% 12.5% Packages 0%10000 -12.5% -25% 1000 -37.5% R SAS SPSS S-Plus Stata 100 10 http://r4stats.com/popularity 1 2002 2004 2006 2008 2010
  • David Champagne, CTO
  • f s h dr
  • rh d f srhb ase
  • rh d f srhb ase rm r
  • rmr
  • sapply(data, function)mapreduce(data, map = function)
  • library(rmr)mapreduce(…)
  • Rmr
  • Rmr Java, C++
  • Rmr Cascading, Java, C++ Crunch
  • Rmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy Cascading, Java, C++ Crunch
  • Rmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy Cascading, Java, C++ Crunch
  • Expose MR Hide MRRmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy Cascading, Java, C++ Crunch
  • Expose MR Hide MR Hive, PigRmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy Cascading, Java, C++ Crunch
  • Expose MR Hide MR Hive, PigRmr, Rhipe, Dumbo,Rmr Cascalog, Pydoop, Hadoopy Scalding, Scrunch Cascading, Java, C++ Crunch
  • mapreduce(input, output, map, reduce)
  • mapreduce(input, output, map, reduce)
  • mapreduce(input, output, map, reduce)
  • mapreduce(input, output, map, reduce)
  • mapreduce(input, output, map, reduce)
  • mapreduce(input, output, map, reduce)
  • map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
  • map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
  • map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
  • map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
  • map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
  • map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
  • map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
  • map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
  • condition = function(x) x > 10
  • condition = function(x) x > 10out = mapreduce(
  • condition = function(x) x > 10out = mapreduce( input = input,
  • condition = function(x) x > 10out = mapreduce( input = input, map = function(k,v)
  • condition = function(x) x > 10out = mapreduce( input = input, map = function(k,v) if (condition(v)) keyval(k,v))
  • condition = function(x) x > 10out = mapreduce( input = input, map = function(k,v) if (condition(v)) keyval(k,v))
  • x = from.dfs(hdfs.object)hdfs.object = to.dfs(x)
  • INSERT OVERWRITE TABLE pv_gender_sumSELECT pv_users.gender, count (DISTINCT pv_users.userid)FROM pv_usersGROUP BY pv_users.gender;
  • INSERT OVERWRITE TABLE pv_gender_sumSELECT pv_users.gender, count (DISTINCT pv_users.userid)FROM pv_usersGROUP BY pv_users.gender;mapreduce(input = mapreduce(input = "pv_users", map = function(k, v) keyval(v[userid], v[gender]), reduce = function(uid, genders) lapply(unique(genders), function(g) keyval(NULL, g)), output = "pv_gender_sum", map = function(x, gender) keyval(gender, 1) reduce = function(gender,counts) keyval(k,sum(unlist(counts)))
  • kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = F)) { newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters}
  • kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = F)) { newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters}kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs( mapreduce( input = points, map = if (is.null(centers)) { function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T)}
  • kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = F)) { newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters}kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs( mapreduce( input = points, map = if (is.null(centers)) { function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T)}
  • kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = F)) { newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters}kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs( mapreduce( input = points, map = if (is.null(centers)) { function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T)}
  • kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = F)) { newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters}kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs( mapreduce( input = points, map = if (is.null(centers)) { function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T)}
  • #!/usr/bin/pythonimport sysfrom math import fabsfrom org.apache.pig.scripting import Pigfilename = "student.txt"k = 4tolerance = 0.01MAX_SCORE = 4MIN_SCORE = 0MAX_ITERATION = 100# initial centroid, equally divide the spaceinitial_centroids = ""last_centroids = [None] * kfor i in range(k): last_centroids[i] = MIN_SCORE + float(i)/k*(MAX_SCORE-MIN_SCORE) initial_centroids = initial_centroids + str(last_centroids[i]) if i!=k-1: initial_centroids = initial_centroids + ":"P = Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid($centroids); raw = load student.txt as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = foreach grouped generate group, AVG(centroided.gpa); store result into output; """)converged = Falseiter_num = 0while iter_num<MAX_ITERATION: Q = P.bind({centroids:initial_centroids}) results = Q.runSingle()
  • if results.isSuccessful() == "FAILED": raise "Pig job failed" iter = results.result("result").iterator() centroids = [None] * k distance_move = 0 # get new centroid of this iteration, caculate the moving distance with last iteration for i in range(k): tuple = iter.next() centroids[i] = float(str(tuple.get(1))) distance_move = distance_move + fabs(last_centroids[i]-centroids[i]) distance_move = distance_move / k; Pig.fs("rmr output") print("iteration " + str(iter_num)) print("average distance moved: " + str(distance_move)) if distance_move<tolerance: sys.stdout.write("k-means converged at centroids: [") sys.stdout.write(",".join(str(v) for v in centroids)) sys.stdout.write("]n") converged = True break last_centroids = centroids[:] initial_centroids = "" for i in range(k): initial_centroids = initial_centroids + str(last_centroids[i]) if i!=k-1: initial_centroids = initial_centroids + ":" iter_num += 1if not converged: print("not converge after " + str(iter_num) + " iterations") sys.stdout.write("last centroids: [") sys.stdout.write(",".join(str(v) for v in last_centroids)) sys.stdout.write("]n")
  • import java.io.IOException;import org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;public class FindCentroid extends EvalFunc<Double> { double[] centroids; public FindCentroid(String initialCentroid) { String[] centroidStrings = initialCentroid.split(":"); centroids = new double[centroidStrings.length]; for (int i=0;i<centroidStrings.length;i++) centroids[i] = Double.parseDouble(centroidStrings[i]); } @Override public Double exec(Tuple input) throws IOException { double min_distance = Double.MAX_VALUE; double closest_centroid = 0; for (double centroid : centroids) { double distance = Math.abs(centroid - (Double)input.get(0)); if (distance < min_distance) { min_distance = distance; closest_centroid = centroid; } } return closest_centroid; }}
  • mapreduce(mapreduce(…
  • mapreduce(mapreduce(…mapreduce(input = c(input1, input2), …)
  • mapreduce(mapreduce(…mapreduce(input = c(input1, input2), …)equijoin = function( left.input, right.input, input, output, outer, map.left, map.right, reduce, reduce.all)
  • out1 = mapreduce(…)mapreduce(input = out1, <xyz>)mapreduce(input = out1, <abc>)
  • out1 = mapreduce(…)mapreduce(input = out1, <xyz>)mapreduce(input = out1, <abc>)abstract.job = function(input, output, …) { … result = mapreduce(input = input, output = output) … result}
  • input.format, output.format, format
  • input.format, output.format, formatcombine
  • input.format, output.format, formatcombinereduce.on.data.frame
  • input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backends
  • input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backendsbackend.parameters
  • input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backendsbackend.parametersprofiling
  • input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backendsbackend.parametersprofilingverbose
  • RHADOOP USERONE FAT CLUSTER AVE.HYDROPOWER CITY, OR 0x0000 RHADOOP@ REVOLUTIONANALYTICS.COM