Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Distributed R: The Next Generation ... by Jorge Martinez de... 1181 views
- Data Hacking with RHadoop by Ed Kohlwey 3000 views
- How can R and Hadoop be used together by Vignesh Prajapati 869 views
- Big Data Analytics with R and Hadoo... by Vignesh Prajapati 29050 views
- resume_updated_nov16 by Arkosnato Neogy 72 views
- Hp distributed R User Guide by Andrey Karpov 388 views

6,257 views

Published on

No Downloads

Total views

6,257

On SlideShare

0

From Embeds

0

Number of Embeds

3

Shares

0

Downloads

47

Comments

0

Likes

2

No embeds

No notes for slide

Open source project

started by Revo

aims to make R and Hadoop work together

language for statistics

replacement for S

most popular??

Must get together

Is it a good idea?

Available analysts

Trend towards high level language

Polyglot infrastructure

Hadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level

A simple way to write parallel programs – everyone will have to

Very R-like, building on the functional characteristics of R

Just a library

Not as simple as Hive, Pig at what they do, but more general

Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.

notice three different languages

A language like HIVE makes a class of problems easy to solve, but it is not a general tool

The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities

- 1. RHadoop, Hadoop for R
- 2. r4stats.com
- 3. rhdfs rhbase rmr
- 4. sapply(data, function) mapreduce(data, function) #!/usr/bin/Rscript library(rmr) mapreduce(…)
- 5. Hive, Pig Rmr, Rhipe, Dumbo, Pydoop Hadoopy Java, C++ Cascalog, Scalding, Scrunch Cascading, Crunch Rmr Expose MR Hide MR
- 6. #!/usr/bin/python import sys from math import fabs from org.apache.pig.scripting import Pig filename = "student.txt" k = 4 tolerance = 0.01 MAX_SCORE = 4 MIN_SCORE = 0 MAX_ITERATION = 100 # initial centroid, equally divide the space initial_centroids = "" last_centroids = [None] * k for i in range(k): last_centroids[i] = MIN_SCORE + float(i)/k*(MAX_SCORE-MIN_SCORE) initial_centroids = initial_centroids + str(last_centroids[i]) if i!=k-1: initial_centroids = initial_centroids + ":" P = Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid('$centroids'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = foreach grouped generate group, AVG(centroided.gpa); store result into 'output'; """) converged = False iter_num = 0 while iter_num<MAX_ITERATION: Q = P.bind({'centroids':initial_centroids})
- 7. if results.isSuccessful() == "FAILED": raise "Pig job failed" iter = results.result("result").iterator() centroids = [None] * k distance_move = 0 # get new centroid of this iteration, caculate the moving distance with last iteration for i in range(k): tuple = iter.next() centroids[i] = float(str(tuple.get(1))) distance_move = distance_move + fabs(last_centroids[i]-centroids[i]) distance_move = distance_move / k; Pig.fs("rmr output") print("iteration " + str(iter_num)) print("average distance moved: " + str(distance_move)) if distance_move<tolerance: sys.stdout.write("k-means converged at centroids: [") sys.stdout.write(",".join(str(v) for v in centroids)) sys.stdout.write("]n") converged = True break last_centroids = centroids[:] initial_centroids = "" for i in range(k): initial_centroids = initial_centroids + str(last_centroids[i]) if i!=k-1: initial_centroids = initial_centroids + ":" iter_num += 1 if not converged: print("not converge after " + str(iter_num) + " iterations") sys.stdout.write("last centroids: [") sys.stdout.write(",".join(str(v) for v in last_centroids)) sys.stdout.write("]n")
- 8. import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class FindCentroid extends EvalFunc<Double> { double[] centroids; public FindCentroid(String initialCentroid) { String[] centroidStrings = initialCentroid.split(":"); centroids = new double[centroidStrings.length]; for (int i=0;i<centroidStrings.length;i++) centroids[i] = Double.parseDouble(centroidStrings[i]); } @Override public Double exec(Tuple input) throws IOException { double min_distance = Double.MAX_VALUE; double closest_centroid = 0; for (double centroid : centroids) { double distance = Math.abs(centroid - (Double)input.get(0)); if (distance < min_distance) { min_distance = distance; closest_centroid = centroid; } } return closest_centroid; } }
- 9. mapreduce(input, output, map, reduce) one or more hdfs paths or output of other mapreduce jobs hdfs path, default to temp location a function of two args returning a keyval(), default identity a function of two args returning a keyval(), default none
- 10. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
- 11. condition = function(x) x >10 out = mapreduce( input = input, map = function(k,v) if (condition(v)) keyval(k,v))
- 12. x = from.dfs(hdfs.object) hdfs.object = to.dfs(x)
- 13. INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; mapreduce(input = mapreduce(input = "pv_users", map = function(k, v) keyval(v['userid'], v['gender']), reduce = function(k, vv) keyval(k, vv[[1]]), output = "pv_gender_sum", map = function(k,v) keyval(v, 1) reduce = function(k, vv) keyval(k, sum(unlist(vv)))
- 14. kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = 'F')){ newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = lapply(values(newCenters), unlist) newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters} kmeans.iter = function(points, distfun, ncenters = length(centers), centers = NULL) { from.dfs( mapreduce(input = points, map = if (is.null(centers)) { function(k,v)keyval(sample(1:ncenters,1),v)} else { function(k,v) { distances = lapply(centers,function(c)distfun(c,v)) keyval(centers[[which.min(distances)]],v)}}, reduce = function(k,vv) keyval(NULL,apply(do.call(rbind,vv),2,mean))))}
- 15. input.specs, output.specs combine reduce.on.data.frame tuning.params verbose local, hadoop backends profiling managed IO optimize
- 16. mapreduce(mapreduce(… mapreduce(input = c(input1, input2), …) equijoin(left.input = input1, right.input = input2, …) out1 = mapreduce(…) mapreduce(input = out1, <xyz>) mapreduce(input = out1, <abc>) abstract.job = function(input, output, …) { … result = mapreduce(input = input, output = output) … result}
- 17. repo github.com/RevolutionAnalytics/RHad oop/ license Apache 2.0 documentation R help, github wiki Q/A github issue tracking email rhadoop@revolutionanalytics.com project lead David Champagne

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment