RHadoop, Hadoop for R
Scholarly Activity 05-09 change  50%37.5%  25%12.5%   0%-12.5% -25%-37.5%         R        SAS       SPSS     S-Plus    St...
Scholarly Activity 05-09 change                                                 50%                                       ...
Scholarly Activity 05-09 change                                                 50%                                       ...
David Champagne, CTO
f s    h dr
rh d f srhb      ase
rh d f srhb      ase      rm  r
rmr
sapply(data, function)mapreduce(data, map = function)
library(rmr)mapreduce(…)
Rmr
Rmr      Java, C++
Rmr                  Cascading,      Java, C++                   Crunch
Rmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy                     Cascading,        Java, C++                      Crunch
Rmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy                     Cascading,        Java, C++                      Crunch
Expose MR   Hide MRRmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy                               Cascading,        Java, C++        ...
Expose MR   Hide MR                               Hive, PigRmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy                          ...
Expose MR   Hide MR                               Hive, PigRmr, Rhipe, Dumbo,Rmr                            Cascalog, Pydo...
mapreduce(input, output, map, reduce)
mapreduce(input, output, map, reduce)
mapreduce(input, output, map, reduce)
mapreduce(input, output, map, reduce)
mapreduce(input, output, map, reduce)
mapreduce(input, output, map, reduce)
map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v)    reduce = function(k, vv) keyval(k, length(vv))
map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v)    reduce = function(k, vv) keyval(k, length(vv))
map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v)    reduce = function(k, vv) keyval(k, length(vv))
map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v)    reduce = function(k, vv) keyval(k, length(vv))
map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v)    reduce = function(k, vv) keyval(k, length(vv))
map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v)    reduce = function(k, vv) keyval(k, length(vv))
map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v)    reduce = function(k, vv) keyval(k, length(vv))
map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v)    reduce = function(k, vv) keyval(k, length(vv))
condition = function(x) x > 10
condition = function(x) x > 10out = mapreduce(
condition = function(x) x > 10out = mapreduce(        input = input,
condition = function(x) x > 10out = mapreduce(        input = input,        map = function(k,v)
condition = function(x) x > 10out = mapreduce(        input = input,        map = function(k,v)                 if (condit...
condition = function(x) x > 10out = mapreduce(        input = input,        map = function(k,v)                 if (condit...
x = from.dfs(hdfs.object)hdfs.object = to.dfs(x)
INSERT OVERWRITE TABLE pv_gender_sumSELECT pv_users.gender, count (DISTINCT pv_users.userid)FROM pv_usersGROUP BY pv_users...
INSERT OVERWRITE TABLE pv_gender_sumSELECT pv_users.gender, count (DISTINCT pv_users.userid)FROM pv_usersGROUP BY pv_users...
kmeans =  function(points, ncenters, iterations = 10,           distfun = function(a,b) norm(as.matrix(a-b), type = F)) { ...
kmeans =  function(points, ncenters, iterations = 10,           distfun = function(a,b) norm(as.matrix(a-b), type = F)) { ...
kmeans =  function(points, ncenters, iterations = 10,           distfun = function(a,b) norm(as.matrix(a-b), type = F)) { ...
kmeans =  function(points, ncenters, iterations = 10,           distfun = function(a,b) norm(as.matrix(a-b), type = F)) { ...
kmeans =  function(points, ncenters, iterations = 10,           distfun = function(a,b) norm(as.matrix(a-b), type = F)) { ...
#!/usr/bin/pythonimport sysfrom math import fabsfrom org.apache.pig.scripting import Pigfilename = "student.txt"k = 4toler...
if results.isSuccessful() == "FAILED":        raise "Pig job failed"    iter = results.result("result").iterator()    cent...
import java.io.IOException;import org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;public class FindCentroid exten...
mapreduce(mapreduce(…
mapreduce(mapreduce(…mapreduce(input = c(input1, input2), …)
mapreduce(mapreduce(…mapreduce(input = c(input1, input2), …)equijoin = function(    left.input, right.input, input,    out...
out1 = mapreduce(…)mapreduce(input = out1, <xyz>)mapreduce(input = out1, <abc>)
out1 = mapreduce(…)mapreduce(input = out1, <xyz>)mapreduce(input = out1, <abc>)abstract.job = function(input, output, …) {...
input.format, output.format, format
input.format, output.format, formatcombine
input.format, output.format, formatcombinereduce.on.data.frame
input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backends
input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backendsbackend.parameters
input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backendsbackend.parametersprofiling
input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backendsbackend.parametersprofilingverbose
RHADOOP USERONE FAT CLUSTER AVE.HYDROPOWER CITY, OR 0x0000             RHADOOP@      REVOLUTIONANALYTICS.COM
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
RHadoop, R meets Hadoop
Upcoming SlideShare
Loading in...5
×

RHadoop, R meets Hadoop

79,178

Published on

(Presented by Antonio Piccolboni to Strata 2012 Conference, Feb 29 2012).

Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.

- rhdfs provides file level manipulation for HDFS, the Hadoop file system
- rhbase provides access to HBASE, the hadoop database
- rmr allows to write mapreduce programs in R

rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.

Published in: Technology, News & Politics
2 Comments
36 Likes
Statistics
Notes
No Downloads
Views
Total Views
79,178
On Slideshare
0
From Embeds
0
Number of Embeds
24
Actions
Shares
0
Downloads
953
Comments
2
Likes
36
Embeds 0
No embeds

No notes for slide
  • What is R\nWhat is RHadoop\nOpen source project\nstarted by RevoLution\naims to make R and Hadoop work together\nwhat is revolution\n
  • \n
  • \n
  • \n
  • Faster, assured builds\nLarge Data extensions\nWeb deployments\nTech support\nConsulting service\nTraining\n
  • \n
  • hadoop bring horizontal scalability\nr sophisticated analytics\ncombination could be powerful\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • Hadoop is one project but also a family of projects. We started the integration path with three projects targeting three members of the Hadoop family\n\nHadoop hdfs provides acces to hdfs file system. can be divided into two sub-APis: file level and byte level\n
  • A way to access big data sets\n\n
  • A simple way to write parallel programs &amp;#x2013; everyone will have to\n \n\n
  • Very R-like, building on the functional characteristics of R\n\n
  • Just a library&amp;#xA0;\n
  • \n Much simpler than writing Java\n Not as simple as Hive, Pig at what they do, but more general\n Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.\n
  • \n Much simpler than writing Java\n Not as simple as Hive, Pig at what they do, but more general\n Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.\n
  • \n Much simpler than writing Java\n Not as simple as Hive, Pig at what they do, but more general\n Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.\n
  • \n Much simpler than writing Java\n Not as simple as Hive, Pig at what they do, but more general\n Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.\n
  • \n Much simpler than writing Java\n Not as simple as Hive, Pig at what they do, but more general\n Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.\n
  • \n Much simpler than writing Java\n Not as simple as Hive, Pig at what they do, but more general\n Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.\n
  • \n Much simpler than writing Java\n Not as simple as Hive, Pig at what they do, but more general\n Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.\n
  • mapreduce first an most important element of API\ninput can be as simple as a path\noutput the same or skip for managed space with stubs\nmap reduce simple R functions as opposed to Rhipe\n
  • mapreduce first an most important element of API\ninput can be as simple as a path\noutput the same or skip for managed space with stubs\nmap reduce simple R functions as opposed to Rhipe\n
  • mapreduce first an most important element of API\ninput can be as simple as a path\noutput the same or skip for managed space with stubs\nmap reduce simple R functions as opposed to Rhipe\n
  • mapreduce first an most important element of API\ninput can be as simple as a path\noutput the same or skip for managed space with stubs\nmap reduce simple R functions as opposed to Rhipe\n
  • mapreduce first an most important element of API\ninput can be as simple as a path\noutput the same or skip for managed space with stubs\nmap reduce simple R functions as opposed to Rhipe\n
  • mapreduce first an most important element of API\ninput can be as simple as a path\noutput the same or skip for managed space with stubs\nmap reduce simple R functions as opposed to Rhipe\n
  • mapreduce first an most important element of API\ninput can be as simple as a path\noutput the same or skip for managed space with stubs\nmap reduce simple R functions as opposed to Rhipe\n
  • mapreduce first an most important element of API\ninput can be as simple as a path\noutput the same or skip for managed space with stubs\nmap reduce simple R functions as opposed to Rhipe\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • simple map example -- filtering\nreduce example -- counting\n
  • easy to parametrize jobs\n
  • easy to parametrize jobs\n
  • easy to parametrize jobs\n
  • easy to parametrize jobs\n
  • easy to parametrize jobs\n
  • easy to parametrize jobs\n
  • second pillar of API, the memory-hdfs bridge\n
  • A language like HIVE makes a class of problems easy to solve, but it is not a general tool\n The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities\n
  • A language like HIVE makes a class of problems easy to solve, but it is not a general tool\n The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities\n
  • kmeans implementation in two simple functions\nnote how easy it is to get data in and out of the cluster\n
  • kmeans implementation in two simple functions\nnote how easy it is to get data in and out of the cluster\n
  • kmeans implementation in two simple functions\nnote how easy it is to get data in and out of the cluster\n
  • kmeans implementation in two simple functions\nnote how easy it is to get data in and out of the cluster\n
  • skip quickly to other slides\nnotice three different languages\n
  • \n
  • \n
  • more things you can do combining the elements of the API\n
  • more things you can do combining the elements of the API\n
  • more things you can do combining the elements of the API\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • RHadoop, R meets Hadoop

    1. 1. RHadoop, Hadoop for R
    2. 2. Scholarly Activity 05-09 change 50%37.5% 25%12.5% 0%-12.5% -25%-37.5% R SAS SPSS S-Plus Stata
    3. 3. Scholarly Activity 05-09 change 50% 37.5% 25% 12.5% Packages 0%10000 -12.5% -25% 1000 -37.5% R SAS SPSS S-Plus Stata 100 10 1 2002 2004 2006 2008 2010
    4. 4. Scholarly Activity 05-09 change 50% 37.5% 25% 12.5% Packages 0%10000 -12.5% -25% 1000 -37.5% R SAS SPSS S-Plus Stata 100 10 http://r4stats.com/popularity 1 2002 2004 2006 2008 2010
    5. 5. David Champagne, CTO
    6. 6. f s h dr
    7. 7. rh d f srhb ase
    8. 8. rh d f srhb ase rm r
    9. 9. rmr
    10. 10. sapply(data, function)mapreduce(data, map = function)
    11. 11. library(rmr)mapreduce(…)
    12. 12. Rmr
    13. 13. Rmr Java, C++
    14. 14. Rmr Cascading, Java, C++ Crunch
    15. 15. Rmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy Cascading, Java, C++ Crunch
    16. 16. Rmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy Cascading, Java, C++ Crunch
    17. 17. Expose MR Hide MRRmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy Cascading, Java, C++ Crunch
    18. 18. Expose MR Hide MR Hive, PigRmr, Rhipe, Dumbo,Rmr Pydoop, Hadoopy Cascading, Java, C++ Crunch
    19. 19. Expose MR Hide MR Hive, PigRmr, Rhipe, Dumbo,Rmr Cascalog, Pydoop, Hadoopy Scalding, Scrunch Cascading, Java, C++ Crunch
    20. 20. mapreduce(input, output, map, reduce)
    21. 21. mapreduce(input, output, map, reduce)
    22. 22. mapreduce(input, output, map, reduce)
    23. 23. mapreduce(input, output, map, reduce)
    24. 24. mapreduce(input, output, map, reduce)
    25. 25. mapreduce(input, output, map, reduce)
    26. 26. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
    27. 27. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
    28. 28. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
    29. 29. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
    30. 30. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
    31. 31. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
    32. 32. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
    33. 33. map = function(k, v) if (hash(k) %% 10 == 0)keyval(k, v) reduce = function(k, vv) keyval(k, length(vv))
    34. 34. condition = function(x) x > 10
    35. 35. condition = function(x) x > 10out = mapreduce(
    36. 36. condition = function(x) x > 10out = mapreduce( input = input,
    37. 37. condition = function(x) x > 10out = mapreduce( input = input, map = function(k,v)
    38. 38. condition = function(x) x > 10out = mapreduce( input = input, map = function(k,v) if (condition(v)) keyval(k,v))
    39. 39. condition = function(x) x > 10out = mapreduce( input = input, map = function(k,v) if (condition(v)) keyval(k,v))
    40. 40. x = from.dfs(hdfs.object)hdfs.object = to.dfs(x)
    41. 41. INSERT OVERWRITE TABLE pv_gender_sumSELECT pv_users.gender, count (DISTINCT pv_users.userid)FROM pv_usersGROUP BY pv_users.gender;
    42. 42. INSERT OVERWRITE TABLE pv_gender_sumSELECT pv_users.gender, count (DISTINCT pv_users.userid)FROM pv_usersGROUP BY pv_users.gender;mapreduce(input = mapreduce(input = "pv_users", map = function(k, v) keyval(v[userid], v[gender]), reduce = function(uid, genders) lapply(unique(genders), function(g) keyval(NULL, g)), output = "pv_gender_sum", map = function(x, gender) keyval(gender, 1) reduce = function(gender,counts) keyval(k,sum(unlist(counts)))
    43. 43. kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = F)) { newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters}
    44. 44. kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = F)) { newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters}kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs( mapreduce( input = points, map = if (is.null(centers)) { function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T)}
    45. 45. kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = F)) { newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters}kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs( mapreduce( input = points, map = if (is.null(centers)) { function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T)}
    46. 46. kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = F)) { newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters}kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs( mapreduce( input = points, map = if (is.null(centers)) { function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T)}
    47. 47. kmeans = function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = F)) { newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters}kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs( mapreduce( input = points, map = if (is.null(centers)) { function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T)}
    48. 48. #!/usr/bin/pythonimport sysfrom math import fabsfrom org.apache.pig.scripting import Pigfilename = "student.txt"k = 4tolerance = 0.01MAX_SCORE = 4MIN_SCORE = 0MAX_ITERATION = 100# initial centroid, equally divide the spaceinitial_centroids = ""last_centroids = [None] * kfor i in range(k): last_centroids[i] = MIN_SCORE + float(i)/k*(MAX_SCORE-MIN_SCORE) initial_centroids = initial_centroids + str(last_centroids[i]) if i!=k-1: initial_centroids = initial_centroids + ":"P = Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid($centroids); raw = load student.txt as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = foreach grouped generate group, AVG(centroided.gpa); store result into output; """)converged = Falseiter_num = 0while iter_num<MAX_ITERATION: Q = P.bind({centroids:initial_centroids}) results = Q.runSingle()
    49. 49. if results.isSuccessful() == "FAILED": raise "Pig job failed" iter = results.result("result").iterator() centroids = [None] * k distance_move = 0 # get new centroid of this iteration, caculate the moving distance with last iteration for i in range(k): tuple = iter.next() centroids[i] = float(str(tuple.get(1))) distance_move = distance_move + fabs(last_centroids[i]-centroids[i]) distance_move = distance_move / k; Pig.fs("rmr output") print("iteration " + str(iter_num)) print("average distance moved: " + str(distance_move)) if distance_move<tolerance: sys.stdout.write("k-means converged at centroids: [") sys.stdout.write(",".join(str(v) for v in centroids)) sys.stdout.write("]n") converged = True break last_centroids = centroids[:] initial_centroids = "" for i in range(k): initial_centroids = initial_centroids + str(last_centroids[i]) if i!=k-1: initial_centroids = initial_centroids + ":" iter_num += 1if not converged: print("not converge after " + str(iter_num) + " iterations") sys.stdout.write("last centroids: [") sys.stdout.write(",".join(str(v) for v in last_centroids)) sys.stdout.write("]n")
    50. 50. import java.io.IOException;import org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;public class FindCentroid extends EvalFunc<Double> { double[] centroids; public FindCentroid(String initialCentroid) { String[] centroidStrings = initialCentroid.split(":"); centroids = new double[centroidStrings.length]; for (int i=0;i<centroidStrings.length;i++) centroids[i] = Double.parseDouble(centroidStrings[i]); } @Override public Double exec(Tuple input) throws IOException { double min_distance = Double.MAX_VALUE; double closest_centroid = 0; for (double centroid : centroids) { double distance = Math.abs(centroid - (Double)input.get(0)); if (distance < min_distance) { min_distance = distance; closest_centroid = centroid; } } return closest_centroid; }}
    51. 51. mapreduce(mapreduce(…
    52. 52. mapreduce(mapreduce(…mapreduce(input = c(input1, input2), …)
    53. 53. mapreduce(mapreduce(…mapreduce(input = c(input1, input2), …)equijoin = function( left.input, right.input, input, output, outer, map.left, map.right, reduce, reduce.all)
    54. 54. out1 = mapreduce(…)mapreduce(input = out1, <xyz>)mapreduce(input = out1, <abc>)
    55. 55. out1 = mapreduce(…)mapreduce(input = out1, <xyz>)mapreduce(input = out1, <abc>)abstract.job = function(input, output, …) { … result = mapreduce(input = input, output = output) … result}
    56. 56. input.format, output.format, format
    57. 57. input.format, output.format, formatcombine
    58. 58. input.format, output.format, formatcombinereduce.on.data.frame
    59. 59. input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backends
    60. 60. input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backendsbackend.parameters
    61. 61. input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backendsbackend.parametersprofiling
    62. 62. input.format, output.format, formatcombinereduce.on.data.framelocal, hadoop backendsbackend.parametersprofilingverbose
    63. 63. RHADOOP USERONE FAT CLUSTER AVE.HYDROPOWER CITY, OR 0x0000 RHADOOP@ REVOLUTIONANALYTICS.COM
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×