Jan 2012 HUG: RHadoop


Published on

RHadoop is an open source project aiming to combine two rising star in the analytics firmament: R and Hadoop. With more than 2M users, R is arguably the dominant language to express complex statistical computations. Hadoop needs no introduction at HUG. With RHadoop we are trying to combine the expressiveness of R with the scalability of Hadoop and to pave the way for the statistical community to tackle big data with the tools they are familiar with. At this time RHadoop is a collection of three packages that interface with HDFS, HBase and mapreduce, respectively. For mapreduce, the package is called rmr and we tried to give it a simple, high level interface that's true to the mapreduce model and integrated with the rest of the language. We will cover the API and provide some examples.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • What is RHadoop
    Open source project
    started by Revo
    aims to make R and Hadoop work together
  • Intro to R
    language for statistics
    replacement for S
    most popular??
  • Hottest people in Hollywood
    Must get together
    Is it a good idea?
    Available analysts
    Trend towards high level language
    Polyglot infrastructure
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family
    Hadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level
  • A way to access big data sets
    A simple way to write parallel programs – everyone will have to
    Very R-like, building on the functional characteristics of R
    Just a library 
  • Much simpler than writing Java
    Not as simple as Hive, Pig at what they do, but more general
    Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.
  • skip quickly to other slides
    notice three different languages
  • Takeaways
    A language like HIVE makes a class of problems easy to solve, but it is not a general tool
    The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities
  • Jan 2012 HUG: RHadoop

    1. 1. RHadoop, Hadoop for R
    2. 2. r4stats.com
    3. 3. rhdfs rhbase rmr
    4. 4. sapply(data, function) mapreduce(data, function) #!/usr/bin/Rscript library(rmr) mapreduce(…)
    5. 5. Hive, Pig Rmr, Rhipe, Dumbo, Pydoop Hadoopy Java, C++ Cascalog, Scalding, Scrunch Cascading, Crunch Rmr Expose MR Hide MR
    6. 6. #!/usr/bin/python import sys from math import fabs from org.apache.pig.scripting import Pig filename = "student.txt" k = 4 tolerance = 0.01 MAX_SCORE = 4 MIN_SCORE = 0 MAX_ITERATION = 100 # initial centroid, equally divide the space initial_centroids = "" last_centroids = [None] * k for i in range(k): last_centroids[i] = MIN_SCORE + float(i)/k*(MAX_SCORE-MIN_SCORE) initial_centroids = initial_centroids + str(last_centroids[i]) if i!=k-1: initial_centroids = initial_centroids + ":" P = Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid('$centroids'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = foreach grouped generate group, AVG(centroided.gpa); store result into 'output'; """) converged = False iter_num = 0 while iter_num<MAX_ITERATION: Q = P.bind({'centroids':initial_centroids})
    7. 7. if results.isSuccessful() == "FAILED": raise "Pig job failed" iter = results.result("result").iterator() centroids = [None] * k distance_move = 0 # get new centroid of this iteration, caculate the moving distance with last iteration for i in range(k): tuple = iter.next() centroids[i] = float(str(tuple.get(1))) distance_move = distance_move + fabs(last_centroids[i]-centroids[i]) distance_move = distance_move / k; Pig.fs("rmr output") print("iteration " + str(iter_num)) print("average distance moved: " + str(distance_move)) if distance_move<tolerance: sys.stdout.write("k-means converged at centroids: [") sys.stdout.write(",".join(str(v) for v in centroids)) sys.stdout.write("]n") converged = True break last_centroids = centroids[:] initial_centroids = "" for i in range(k): initial_centroids = initial_centroids + str(last_centroids[i]) if i!=k-1: initial_centroids = initial_centroids + ":" iter_num += 1 if not converged: print("not converge after " + str(iter_num) + " iterations") sys.stdout.write("last centroids: [") sys.stdout.write(",".join(str(v) for v in last_centroids)) sys.stdout.write("]n")
    8. 8. import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class FindCentroid extends EvalFunc<Double> { double[] centroids; public FindCentroid(String initialCentroid) { String[] centroidStrings = initialCentroid.split(":"); centroids = new double[centroidStrings.length]; for (int i=0;i<centroidStrings.length;i++) centroids[i] = Double.parseDouble(centroidStrings[i]); } @Override public Double exec(Tuple input) throws IOException { double min_distance = Double.MAX_VALUE; double closest_centroid = 0; for (double centroid : centroids) { double distance = Math.abs(centroid - (Double)input.get(0)); if (distance < min_distance) { min_distance = distance; closest_centroid = centroid; } } return closest_centroid; } }
    9. 9. mapreduce(input, output, map, reduce) one or more hdfs paths or output of other mapreduce jobs hdfs path, default to temp location a function of two args returning a keyval(), default identity a function of two args returning a keyval(), default none
    10. 10. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
    11. 11. condition = function(x) x >10 out = mapreduce( input = input, map = function(k,v) if (condition(v)) keyval(k,v))
    12. 12. x = from.dfs(hdfs.object) hdfs.object = to.dfs(x)
    13. 13. INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; mapreduce(input = mapreduce(input = "pv_users", map = function(k, v) keyval(v['userid'], v['gender']), reduce = function(k, vv) keyval(k, vv[[1]]), output = "pv_gender_sum", map = function(k,v) keyval(v, 1) reduce = function(k, vv) keyval(k, sum(unlist(vv)))
    14. 14. kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = 'F')){ newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = lapply(values(newCenters), unlist) newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters} kmeans.iter = function(points, distfun, ncenters = length(centers), centers = NULL) { from.dfs( mapreduce(input = points, map = if (is.null(centers)) { function(k,v)keyval(sample(1:ncenters,1),v)} else { function(k,v) { distances = lapply(centers,function(c)distfun(c,v)) keyval(centers[[which.min(distances)]],v)}}, reduce = function(k,vv) keyval(NULL,apply(do.call(rbind,vv),2,mean))))}
    15. 15. input.specs, output.specs combine reduce.on.data.frame tuning.params verbose local, hadoop backends profiling managed IO optimize
    16. 16. mapreduce(mapreduce(… mapreduce(input = c(input1, input2), …) equijoin(left.input = input1, right.input = input2, …) out1 = mapreduce(…) mapreduce(input = out1, <xyz>) mapreduce(input = out1, <abc>) abstract.job = function(input, output, …) { … result = mapreduce(input = input, output = output) … result}
    17. 17. repo github.com/RevolutionAnalytics/RHad oop/ license Apache 2.0 documentation R help, github wiki Q/A github issue tracking email rhadoop@revolutionanalytics.com project lead David Champagne