0
Big Data Analysis With
RHadoop
David Chiu (Yu-Wei, Chiu)
@ML/DM Monday
2014/03/17
About Me
 Co-Founder of NumerInfo
 Ex-Trend Micro Engineer
 ywchiu-tw.appspot.com
R + Hadoop
http://cdn.overdope.com/wp-content/uploads/2014/03/dragon-ball-z-x-hmn-alns-fusion-1.jpg
 Scaling R
Hadoop enables R to do parallel computing
 Do not have to learn new language
Learning to use Java takes tim...
Rhadoop Architecture
rhdfs
rhbase
rmr2
HDFS
MapReduce
Hbase Thrift
Gateway
Streaming API
Hbase
R
 Enable developer to write Mapper/Reducer in
any scripting language(R, python, perl)
 Mapper, reducer, and optional comb...
 Writing MapReduce Using R
 mapreduce function
Mapreduce(input output, map, reduce…)
 Changelog
rmr 3.0.0 (2014/02/10...
 Access HDFS From R
 Exchange data from R dataframe and HDFS
rhdfs
 Exchange data from R to Hbase
 Using Thrift API
rhbase
 Perform common data manipulation operations,
as found in plyr and reshape2
 It provides a familiar plyr-like interface ...
RHadoop
Installation
 R and related packages should be installed on
each tasknode of the cluster
 A Hadoop cluster, CDH3 and higher or Apache...
Getting Ready (Cloudera VM)
 Download
http://www.cloudera.com/content/cloudera-content/cloudera-
docs/DemoVMs/Cloudera-Qu...
CDH 4.4
Get RHadoop
 https://github.com/RevolutionAnalytics/RHadoop
/wiki/Downloads
Installing rmr2 dependencies
 Make sure the package is installed system wise
$ sudo R
> install.packages(c("codetools", "...
Install rmr2
$ wget --no-check-certificate
https://raw.github.com/RevolutionAnalytics/rmr2/3.0.0/build/rmr2_3.0.0.tar.gz
$...
Installing…
 http://cran.r-project.org/src/contrib/Archive/Rcpp/
Downgrade Rcpp
$ wget --no-check-certificate http://cran.r-
project.org/src/contrib/Archive/Rcpp/Rcpp_0.11.0.t
ar.gz
$sudo R CMD INSTALL ...
$ sudo R CMD INSTALL rmr2_3.0.0.tar.gz
Install rmr2 again
Install RHDFS
$ wget -no-check-certificate
https://raw.github.com/RevolutionAnalytics/rhdfs/m
aster/build/rhdfs_1.0.8.tar....
Enable hdfs
> Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
> Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20-
mapreduce/cont...
Javareconf error
$ sudo R CMD javareconf
javareconf with correct JAVA_HOME
$ echo $JAVA_HOME
$ sudo JAVA_HOME=/usr/java/jdk1.6.0_32 R CMD javareconf
MapReduce
With RHadoop
MapReduce
 mapreduce(input, output, map, reduce)
 Like sapply, lapply, tapply within R
Hello World – For Hadoop
http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png
Move File Into HDFS
# Put data into hdfs
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/lib/ha...
Wordcount Mapper
map <- function(k,lines) {
words.list <- strsplit(lines, 's')
words <- unlist(words.list)
return( keyval(...
Wordcount Reducer
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
public static class Reduce extends MapRed...
Call Wordcount
hdfs.root <- 'wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
hdfs.out <- file.path(hdfs.root, 'out')
...
Read data from HDFS
results <- from.dfs(out)
results$key[order(results$val, decreasing
= TRUE)][1:10]
$ hadoop fs –cat /us...
MapReduce Benchmark
> a.time <- proc.time()
> small.ints2=1:100000
> result.normal = sapply(small.ints2, function(x) x^2)
...
sapply
Elapsed 0.982 second
mapreduce
Elapsed 102.755 seconds
 HDFS stores your files as data chunk distributed on
multiple datanodes
 M/R runs multiple programs called mapper on eac...
Its not possible to apply built in machine learning
method on MapReduce Program
kcluster= kmeans((mydata, 4, iter.max=10)
...
kmeans =
function(points, ncenters, iterations = 10, distfun = NULL) {
if(is.null(distfun))
distfun = function(a,b) norm(a...
kmeans.iter =
function(points, distfun, ncenters = dim(centers)[1], centers = NULL)
{
from.dfs(mapreduce(input = points,
m...
One More Thing…
plyrmr
 Perform common data manipulation operations,
as found in plyr and reshape2
 It provides a familiar plyr-like interface ...
Installation plyrmr dependencies
$ yum install libxml2-devel
$ sudo yum install curl-devel
$ sudo R
> Install.packages(c(“...
$ wget -no-check-certificate
https://raw.github.com/RevolutionAnalytics/plyrmr/master/build/plyrmr_0.1.0.tar.gz
$ sudo R C...
> data(mtcars)
> head(mtcars)
> transform(mtcars, carb.per.cyl = carb/cyl)
> library(plyrmr)
> output(input(mtcars), "/tmp...
 where(
select(
mtcars,
carb.per.cyl = carb/cyl,
.replace = FALSE),
carb.per.cyl >= 1)
select and where
https://github.co...
 as.data.frame(
select(
group(
input("/tmp/mtcars"),
cyl),
mean.mpg = mean(mpg)))
Group by
https://github.com/RevolutionA...
Reference
 https://github.com/RevolutionAnalytics/RHadoop/wik
i
 http://www.slideshare.net/RevolutionAnalytics/rhado
op-...
 Website
ywchiu-tw.appspot.com
 Email
david@numerinfo.com
tr.ywchiu@gmail.com
 Company
numerinfo.com
Contact
Big Data Analysis With RHadoop
Upcoming SlideShare
Loading in...5
×

Big Data Analysis With RHadoop

4,813

Published on

Speaking of big data analysis, what comes to mind is possibly using HDFS and MapReduce within Hadoop. But to write a MapReduce program, one must face the problem of learning how to write native java. One might wonder is it possible to use R, the most popular language adapted by data scientist, to implement MapReduce program? And through the integration or R and Hadoop, is it truly one can unleash the power of parallel computing and big data analysis?

This slide introduces how to install RHadoop step by step, and introduces how to write a MapReduce program through R. What is more, this slide will discuss whether RHadoop is really a light for big data analysis, or just another method to write MapReduce Program.

Please mail me if you found any problem toward the slide. EMAIL: tr.ywchiu@gmail.com

  談到巨量資料,通常大家腦海中聯想到的就是使用Hadoop 的 MapReduce 和HDFS,但是撰寫MapReduce,則就必須要學會撰寫Java 或透過Thrift 接口才能撰寫。但R是否有辦法運行在Hadoop 上呢 ? 而使用R + Hadoop,是否就真的能結合R強大的分析功能,分析巨量資料呢 ?
  本次講題將介紹如何Step by step 在Hadoop 上安裝RHadoop相關套件,並介紹如何撰寫R的MapReduce 程式。更重要的是,此次將探討使用RHadoop 是否為巨量資料分析找到一盞明燈? 或者只是另一套實作方法而已?

Published in: Technology
7 Comments
37 Likes
Statistics
Notes
No Downloads
Views
Total Views
4,813
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
479
Comments
7
Likes
37
Embeds 0
No embeds

No notes for slide

Transcript of "Big Data Analysis With RHadoop"

  1. 1. Big Data Analysis With RHadoop David Chiu (Yu-Wei, Chiu) @ML/DM Monday 2014/03/17
  2. 2. About Me  Co-Founder of NumerInfo  Ex-Trend Micro Engineer  ywchiu-tw.appspot.com
  3. 3. R + Hadoop http://cdn.overdope.com/wp-content/uploads/2014/03/dragon-ball-z-x-hmn-alns-fusion-1.jpg
  4. 4.  Scaling R Hadoop enables R to do parallel computing  Do not have to learn new language Learning to use Java takes time Why Using RHadoop
  5. 5. Rhadoop Architecture rhdfs rhbase rmr2 HDFS MapReduce Hbase Thrift Gateway Streaming API Hbase R
  6. 6.  Enable developer to write Mapper/Reducer in any scripting language(R, python, perl)  Mapper, reducer, and optional combiner processes are written to read from standard input and to write to standard output  Streaming Job would have additional overhead of starting a scripting VM Streaming v.s. Native Java.
  7. 7.  Writing MapReduce Using R  mapreduce function Mapreduce(input output, map, reduce…)  Changelog rmr 3.0.0 (2014/02/10): 10X faster than rmr 2.3.0 rmr 2.3.0 (2013/10/07): support plyrmr rmr2
  8. 8.  Access HDFS From R  Exchange data from R dataframe and HDFS rhdfs
  9. 9.  Exchange data from R to Hbase  Using Thrift API rhbase
  10. 10.  Perform common data manipulation operations, as found in plyr and reshape2  It provides a familiar plyr-like interface while hiding many of the mapreduce details  plyr: Tools for splitting, applying and combining data NEW! plyrmr
  11. 11. RHadoop Installation
  12. 12.  R and related packages should be installed on each tasknode of the cluster  A Hadoop cluster, CDH3 and higher or Apache 1.0.2 and higher but limited to mr1, not mr2. Compatibility with mr2 from Apache 2.2.0 or HDP2 Prerequisites
  13. 13. Getting Ready (Cloudera VM)  Download http://www.cloudera.com/content/cloudera-content/cloudera- docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html  This VM runs CentOS 6.2 CDH4.4 R 3.0.1 Java 1.6.0_32
  14. 14. CDH 4.4
  15. 15. Get RHadoop  https://github.com/RevolutionAnalytics/RHadoop /wiki/Downloads
  16. 16. Installing rmr2 dependencies  Make sure the package is installed system wise $ sudo R > install.packages(c("codetools", "R", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "rJava“, “caTools”))
  17. 17. Install rmr2 $ wget --no-check-certificate https://raw.github.com/RevolutionAnalytics/rmr2/3.0.0/build/rmr2_3.0.0.tar.gz $ sudo R CMD INSTALL rmr2_3.0.0.tar.gz
  18. 18. Installing…
  19. 19.  http://cran.r-project.org/src/contrib/Archive/Rcpp/ Downgrade Rcpp
  20. 20. $ wget --no-check-certificate http://cran.r- project.org/src/contrib/Archive/Rcpp/Rcpp_0.11.0.t ar.gz $sudo R CMD INSTALL Rcpp_0.11.0.tar.gz Install Rcpp_0.11.0
  21. 21. $ sudo R CMD INSTALL rmr2_3.0.0.tar.gz Install rmr2 again
  22. 22. Install RHDFS $ wget -no-check-certificate https://raw.github.com/RevolutionAnalytics/rhdfs/m aster/build/rhdfs_1.0.8.tar.gz $ sudo HADOOP_CMD=/usr/bin/hadoop R CMD INSTALL rhdfs_1.0.8.tar.gz
  23. 23. Enable hdfs > Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") > Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20- mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1- cdh4.4.0.jar") > library(rmr2) > library(rhdfs) > hdfs.init()
  24. 24. Javareconf error $ sudo R CMD javareconf
  25. 25. javareconf with correct JAVA_HOME $ echo $JAVA_HOME $ sudo JAVA_HOME=/usr/java/jdk1.6.0_32 R CMD javareconf
  26. 26. MapReduce With RHadoop
  27. 27. MapReduce  mapreduce(input, output, map, reduce)  Like sapply, lapply, tapply within R
  28. 28. Hello World – For Hadoop http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png
  29. 29. Move File Into HDFS # Put data into hdfs Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20- mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1- cdh4.4.0.jar") library(rmr2) library(rhdfs) hdfs.init() hdfs.mkdir(“/user/cloudera/wordcount/data”) hdfs.put("wc_input.txt", "/user/cloudera/wordcount/data") $ hadoop fs –mkdir /user/cloudera/wordcount/data $ hadoop fs –put wc_input.txt /user/cloudera/word/count/data
  30. 30. Wordcount Mapper map <- function(k,lines) { words.list <- strsplit(lines, 's') words <- unlist(words.list) return( keyval(words, 1) ) } public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } #Mapper
  31. 31. Wordcount Reducer reduce <- function(word, counts) { keyval(word, sum(counts)) } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } #Reducer
  32. 32. Call Wordcount hdfs.root <- 'wordcount' hdfs.data <- file.path(hdfs.root, 'data') hdfs.out <- file.path(hdfs.root, 'out') wordcount <- function (input, output=NULL) { mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce) } out <- wordcount(hdfs.data, hdfs.out)
  33. 33. Read data from HDFS results <- from.dfs(out) results$key[order(results$val, decreasing = TRUE)][1:10] $ hadoop fs –cat /user/cloudera/wordcount/out/part-00000 | sort –k 2 –nr | head –n 10
  34. 34. MapReduce Benchmark > a.time <- proc.time() > small.ints2=1:100000 > result.normal = sapply(small.ints2, function(x) x^2) > proc.time() - a.time > b.time <- proc.time() > small.ints= to.dfs(1:100000) > result = mapreduce(input = small.ints, map = function(k,v) cbind(v,v^2)) > proc.time() - b.time
  35. 35. sapply Elapsed 0.982 second
  36. 36. mapreduce Elapsed 102.755 seconds
  37. 37.  HDFS stores your files as data chunk distributed on multiple datanodes  M/R runs multiple programs called mapper on each of the data chunks or blocks. The (key,value) output of these mappers are compiled together as result by reducers.  It takes time for mapper and reducer being spawned on these distributed system. Hadoop Latency
  38. 38. Its not possible to apply built in machine learning method on MapReduce Program kcluster= kmeans((mydata, 4, iter.max=10) Kmeans Clustering
  39. 39. kmeans = function(points, ncenters, iterations = 10, distfun = NULL) { if(is.null(distfun)) distfun = function(a,b) norm(as.matrix(a-b), type = 'F') newCenters = kmeans.iter( points, distfun, ncenters = ncenters) # interatively choosing new centers for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters) } newCenters } Kmeans in MapReduce Style
  40. 40. kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs(mapreduce(input = points, map = if (is.null(centers)) { #give random point as sample function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { #find center of minimum distance distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T) } Kmeans in MapReduce Style
  41. 41. One More Thing… plyrmr
  42. 42.  Perform common data manipulation operations, as found in plyr and reshape2  It provides a familiar plyr-like interface while hiding many of the mapreduce details  plyr: Tools for splitting, applying and combining data NEW! plyrmr
  43. 43. Installation plyrmr dependencies $ yum install libxml2-devel $ sudo yum install curl-devel $ sudo R > Install.packages(c(“ Rcurl”, “httr”), dependencies = TRUE > Install.packages(“devtools”, dependencies = TRUE) > library(devtools) > install_github("pryr", "hadley") > Install.packages(c(“ R.methodsS3”, “hydroPSO”), dependencies = TRUE)
  44. 44. $ wget -no-check-certificate https://raw.github.com/RevolutionAnalytics/plyrmr/master/build/plyrmr_0.1.0.tar.gz $ sudo R CMD INSTALL plyrmr_0.1.0.tar.gz Installation plyrmr
  45. 45. > data(mtcars) > head(mtcars) > transform(mtcars, carb.per.cyl = carb/cyl) > library(plyrmr) > output(input(mtcars), "/tmp/mtcars") > as.data.frame(transform(input("/tmp/mtcars"), carb.per.cyl = carb/cyl)) > output(transform(input("/tmp/mtcars"), carb.per.cyl = carb/cyl), "/tmp/mtcars.out") Transform in plyrmr
  46. 46.  where( select( mtcars, carb.per.cyl = carb/cyl, .replace = FALSE), carb.per.cyl >= 1) select and where https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md
  47. 47.  as.data.frame( select( group( input("/tmp/mtcars"), cyl), mean.mpg = mean(mpg))) Group by https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md
  48. 48. Reference  https://github.com/RevolutionAnalytics/RHadoop/wik i  http://www.slideshare.net/RevolutionAnalytics/rhado op-r-meets-hadoop  http://www.slideshare.net/Hadoop_Summit/enabling- r-on-hadoop
  49. 49.  Website ywchiu-tw.appspot.com  Email david@numerinfo.com tr.ywchiu@gmail.com  Company numerinfo.com Contact
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×