Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
RHadoop, Hadoop for R
Scholarly Activity 05-09 change  50%37.5%  25%12.5%   0%-12.5% -25%-37.5%         R        SAS       SPSS     S-Plus    St...
Scholarly Activity 05-09 change                                                 50%                                       ...
Scholarly Activity 05-09 change                                                 50%                                       ...
David Champagne, CTO
f s    h dr
rh d f srhb      ase
rh d f srhb      ase      rm  r
rmr
sapply(data, function)mapreduce(data, map = function)
library(rmr)mapreduce(…)
Rmr
Rmr      Java, C++
Rmr                  Cascading,      Java, C++                   Crunch
Rmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy                     Cascading,        Java, C++                      Crunch
Rmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy                     Cascading,        Java, C++                      Crunch
Expose MR   Hide MRRmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy                               Cascading,        Java, C++        ...
Expose MR   Hide MR                               Hive, PigRmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy                          ...
Expose MR   Hide MR                               Hive, PigRmr, Rhipe, Dumbo,Rmr                            Cascalog, Pydo...
mapreduce(input, output, map, reduce)
mapreduce(input, output, map, reduce)
mapreduce(input, output, map, reduce)
mapreduce(input, output, map, reduce)
mapreduce(input, output, map, reduce)
mapreduce(input, output, map, reduce)
map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v)    reduce = function(k, vv) keyval(k, length(vv))
map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v)    reduce = function(k, vv) keyval(k, length(vv))
map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v)    reduce = function(k, vv) keyval(k, length(vv))
map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v)    reduce = function(k, vv) keyval(k, length(vv))
map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v)    reduce = function(k, vv) keyval(k, length(vv))
map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v)    reduce = function(k, vv) keyval(k, length(vv))
map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v)    reduce = function(k, vv) keyval(k, length(vv))
map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v)    reduce = function(k, vv) keyval(k, length(vv))
condition = function(x) x > 10
condition = function(x) x > 10out = mapreduce(
condition = function(x) x > 10out = mapreduce(        input = input,
condition = function(x) x > 10out = mapreduce(        input = input,        map = function(k,v)
condition = function(x) x > 10out = mapreduce(        input = input,        map = function(k,v)                 if (condit...
condition = function(x) x > 10out = mapreduce(        input = input,        map = function(k,v)                 if (condit...
x = from.dfs(hdfs.object)hdfs.object = to.dfs(x)
INSERT OVERWRITE TABLE pv_gender_sumSELECT pv_users.gender, count (DISTINCT pv_users.userid)FROM pv_usersGROUP BY pv_users...
INSERT OVERWRITE TABLE pv_gender_sumSELECT pv_users.gender, count (DISTINCT pv_users.userid)FROM pv_usersGROUP BY pv_users...
kmeans =  function(points, ncenters, iterations = 10,           distfun = function(a,b) norm(as.matrix(a-b), type = F)) { ...
kmeans =  function(points, ncenters, iterations = 10,           distfun = function(a,b) norm(as.matrix(a-b), type = F)) { ...
kmeans =  function(points, ncenters, iterations = 10,           distfun = function(a,b) norm(as.matrix(a-b), type = F)) { ...
kmeans =  function(points, ncenters, iterations = 10,           distfun = function(a,b) norm(as.matrix(a-b), type = F)) { ...
kmeans =  function(points, ncenters, iterations = 10,           distfun = function(a,b) norm(as.matrix(a-b), type = F)) { ...
#!/usr/bin/pythonimport sysfrom math import fabsfrom org.apache.pig.scripting import Pigfilename = "student.txt"k = 4toler...
if results.isSuccessful() == "FAILED":        raise "Pig job failed"    iter = results.result("result").iterator()    cent...
import java.io.IOException;import org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;public class FindCentroid exten...
mapreduce(mapreduce(…
mapreduce(mapreduce(…mapreduce(input = c(input1, input2), …)
mapreduce(mapreduce(…mapreduce(input = c(input1, input2), …)equijoin = function(    left.input, right.input, input,    out...
out1 = mapreduce(…)mapreduce(input = out1, <xyz>)mapreduce(input = out1, <abc>)
out1 = mapreduce(…)mapreduce(input = out1, <xyz>)mapreduce(input = out1, <abc>)abstract.job = function(input, output, …) {...
input.format, output.format, format
input.format, output.format, formatcombine
input.format, output.format, formatcombinereduce.on.data.frame
input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backends
input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backendsbackend.parameters
input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backendsbackend.parametersprofiling
input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backendsbackend.parametersprofilingverbose
RHADOOP USERONE FAT CLUSTER AVE.HYDROPOWER CITY, OR 0x0000             RHADOOP@      REVOLUTIONANALYTICS.COM
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
Upcoming SlideShare
Loading in …5
×

RHadoop, R meets Hadoop

95,362 views

Published on

(Presented by Antonio Piccolboni to Strata 2012 Conference, Feb 29 2012).

Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.

- rhdfs provides file level manipulation for HDFS, the Hadoop file system
- rhbase provides access to HBASE, the hadoop database
- rmr allows to write mapreduce programs in R

rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.

Published in: Technology, News & Politics

RHadoop, R meets Hadoop

  1. 1. RHadoop, Hadoop for R
  2. 2. Scholarly Activity 05-09 change 50%37.5% 25%12.5% 0%-12.5% -25%-37.5% R SAS SPSS S-Plus Stata
  3. 3. Scholarly Activity 05-09 change 50% 37.5% 25% 12.5% Packages 0%10000 -12.5% -25% 1000 -37.5% R SAS SPSS S-Plus Stata 100 10 1 2002 2004 2006 2008 2010
  4. 4. Scholarly Activity 05-09 change 50% 37.5% 25% 12.5% Packages 0%10000 -12.5% -25% 1000 -37.5% R SAS SPSS S-Plus Stata 100 10 http://r4stats.com/popularity 1 2002 2004 2006 2008 2010
  5. 5. David Champagne, CTO
  6. 6. f s h dr
  7. 7. rh d f srhb ase
  8. 8. rh d f srhb ase rm r
  9. 9. rmr
  10. 10. sapply(data, function)mapreduce(data, map = function)
  11. 11. library(rmr)mapreduce(…)
  12. 12. Rmr
  13. 13. Rmr Java, C++
  14. 14. Rmr Cascading, Java, C++ Crunch
  15. 15. Rmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy Cascading, Java, C++ Crunch
  16. 16. Rmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy Cascading, Java, C++ Crunch
  17. 17. Expose MR Hide MRRmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy Cascading, Java, C++ Crunch
  18. 18. Expose MR Hide MR Hive, PigRmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy Cascading, Java, C++ Crunch
  19. 19. Expose MR Hide MR Hive, PigRmr, Rhipe, Dumbo,Rmr Cascalog, Pydoop, Hadoopy Scalding, Scrunch Cascading, Java, C++ Crunch
  20. 20. mapreduce(input, output, map, reduce)
  21. 21. mapreduce(input, output, map, reduce)
  22. 22. mapreduce(input, output, map, reduce)
  23. 23. mapreduce(input, output, map, reduce)
  24. 24. mapreduce(input, output, map, reduce)
  25. 25. mapreduce(input, output, map, reduce)
  26. 26. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
  27. 27. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
  28. 28. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
  29. 29. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
  30. 30. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
  31. 31. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
  32. 32. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
  33. 33. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
  34. 34. condition = function(x) x > 10
  35. 35. condition = function(x) x > 10out = mapreduce(
  36. 36. condition = function(x) x > 10out = mapreduce( input = input,
  37. 37. condition = function(x) x > 10out = mapreduce( input = input, map = function(k,v)
  38. 38. condition = function(x) x > 10out = mapreduce( input = input, map = function(k,v) if (condition(v)) keyval(k,v))
  39. 39. condition = function(x) x > 10out = mapreduce( input = input, map = function(k,v) if (condition(v)) keyval(k,v))
  40. 40. x = from.dfs(hdfs.object)hdfs.object = to.dfs(x)
  41. 41. INSERT OVERWRITE TABLE pv_gender_sumSELECT pv_users.gender, count (DISTINCT pv_users.userid)FROM pv_usersGROUP BY pv_users.gender;
  42. 42. INSERT OVERWRITE TABLE pv_gender_sumSELECT pv_users.gender, count (DISTINCT pv_users.userid)FROM pv_usersGROUP BY pv_users.gender;mapreduce(input = mapreduce(input = "pv_users", map = function(k, v) keyval(v[userid], v[gender]), reduce = function(uid, genders) lapply(unique(genders), function(g) keyval(NULL, g)), output = "pv_gender_sum", map = function(x, gender) keyval(gender, 1) reduce = function(gender,counts) keyval(k,sum(unlist(counts)))
  43. 43. kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = F)) { newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters}
  44. 44. kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = F)) { newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters}kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs( mapreduce( input = points, map = if (is.null(centers)) { function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T)}
  45. 45. kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = F)) { newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters}kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs( mapreduce( input = points, map = if (is.null(centers)) { function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T)}
  46. 46. kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = F)) { newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters}kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs( mapreduce( input = points, map = if (is.null(centers)) { function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T)}
  47. 47. kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = F)) { newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters}kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs( mapreduce( input = points, map = if (is.null(centers)) { function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T)}
  48. 48. #!/usr/bin/pythonimport sysfrom math import fabsfrom org.apache.pig.scripting import Pigfilename = "student.txt"k = 4tolerance = 0.01MAX_SCORE = 4MIN_SCORE = 0MAX_ITERATION = 100# initial centroid, equally divide the spaceinitial_centroids = ""last_centroids = [None] * kfor i in range(k): last_centroids[i] = MIN_SCORE + float(i)/k*(MAX_SCORE-MIN_SCORE) initial_centroids = initial_centroids + str(last_centroids[i]) if i!=k-1: initial_centroids = initial_centroids + ":"P = Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid($centroids); raw = load student.txt as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = foreach grouped generate group, AVG(centroided.gpa); store result into output; """)converged = Falseiter_num = 0while iter_num<MAX_ITERATION: Q = P.bind({centroids:initial_centroids}) results = Q.runSingle()
  49. 49. if results.isSuccessful() == "FAILED": raise "Pig job failed" iter = results.result("result").iterator() centroids = [None] * k distance_move = 0 # get new centroid of this iteration, caculate the moving distance with last iteration for i in range(k): tuple = iter.next() centroids[i] = float(str(tuple.get(1))) distance_move = distance_move + fabs(last_centroids[i]-centroids[i]) distance_move = distance_move / k; Pig.fs("rmr output") print("iteration " + str(iter_num)) print("average distance moved: " + str(distance_move)) if distance_move<tolerance: sys.stdout.write("k-means converged at centroids: [") sys.stdout.write(",".join(str(v) for v in centroids)) sys.stdout.write("]n") converged = True break last_centroids = centroids[:] initial_centroids = "" for i in range(k): initial_centroids = initial_centroids + str(last_centroids[i]) if i!=k-1: initial_centroids = initial_centroids + ":" iter_num += 1if not converged: print("not converge after " + str(iter_num) + " iterations") sys.stdout.write("last centroids: [") sys.stdout.write(",".join(str(v) for v in last_centroids)) sys.stdout.write("]n")
  50. 50. import java.io.IOException;import org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;public class FindCentroid extends EvalFunc<Double> { double[] centroids; public FindCentroid(String initialCentroid) { String[] centroidStrings = initialCentroid.split(":"); centroids = new double[centroidStrings.length]; for (int i=0;i<centroidStrings.length;i++) centroids[i] = Double.parseDouble(centroidStrings[i]); } @Override public Double exec(Tuple input) throws IOException { double min_distance = Double.MAX_VALUE; double closest_centroid = 0; for (double centroid : centroids) { double distance = Math.abs(centroid - (Double)input.get(0)); if (distance < min_distance) { min_distance = distance; closest_centroid = centroid; } } return closest_centroid; }}
  51. 51. mapreduce(mapreduce(…
  52. 52. mapreduce(mapreduce(…mapreduce(input = c(input1, input2), …)
  53. 53. mapreduce(mapreduce(…mapreduce(input = c(input1, input2), …)equijoin = function( left.input, right.input, input, output, outer, map.left, map.right, reduce, reduce.all)
  54. 54. out1 = mapreduce(…)mapreduce(input = out1, <xyz>)mapreduce(input = out1, <abc>)
  55. 55. out1 = mapreduce(…)mapreduce(input = out1, <xyz>)mapreduce(input = out1, <abc>)abstract.job = function(input, output, …) { … result = mapreduce(input = input, output = output) … result}
  56. 56. input.format, output.format, format
  57. 57. input.format, output.format, formatcombine
  58. 58. input.format, output.format, formatcombinereduce.on.data.frame
  59. 59. input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backends
  60. 60. input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backendsbackend.parameters
  61. 61. input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backendsbackend.parametersprofiling
  62. 62. input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backendsbackend.parametersprofilingverbose
  63. 63. RHADOOP USERONE FAT CLUSTER AVE.HYDROPOWER CITY, OR 0x0000 RHADOOP@ REVOLUTIONANALYTICS.COM

×