0
A-Tsai (Chung-Tsai Su)
SPN
2013/10/1	
Introduction of R on Hadoop
Agenda	
•  When Should You Use R?
•  When Should You Consider Hadoop?
•  How to use R on Hadoop?
–  Rhadoop
–  R + Hadoop ...
When Should You Use R?
(Page 16)
http://3.bp.blogspot.com/-SbrlR5E0tks/UGCxeL_f5YI/AAAAAAAAL3M/lroU3yF-3_0/s1600/BigDataLandscape.png
https://developers.google.com/appengine/docs/python/images/mapreduce_mapshuffle.png
When should you consider Hadoop?
(Page 576)
(Page 576)
(Page 576)
RHadoop
RHadoop
Packages of RHadoop
http://revolution-computing.typepad.com/.a/6a010534b1db25970b0154359c29bf970c-800wi
RHadoop
Installation (in textbook)devtools
> library(devtools)
> install_url("https://github.com/downloads/RevolutionAnalytics/RHa...
Installation
http://blog.fens.me/rhadoop-rhadoop/
•  Download Rhadoop package from
https://github.com/RevolutionAnalytics/...
First Example: WordCount
Hadoop Portal
An example RHadoop application
•  Mortality Public Use File Documentation
–  The dataset contains a record of every death ...
/home/spndc/src/Rhadoop/mort09.R (1/3)read.fwf read.fwf
.X
mort.schema <- c(
.X0=19, ResidentStatus=1, .X1=40, Education19...
/home/spndc/src/Rhadoop/mort09.R (2/3)
/home/spndc/src/Rhadoop/
mort09.R (3/3)
/home/spndc/src/Rhadoop/mort09_1.R (1/4)
/home/spndc/src/Rhadoop/mort09_1.R (2/4)
/home/spndc/src/Rhadoop/mort09_1.R (3/4)
/home/spndc/src/Rhadoop/mort09_1.R (4/4)
R + Hadoop Streaming
Hadoop Streaming
http://biomedicaloptics.spiedigitallibrary.org/data/Journals/BIOMEDO/23543/125003_1_2.png
#! /usr/bin/env Rscript
mort.schema <- ...
unpack.line <- ...
age.decode <- ...
con <- file("stdin", open="r")
while(lengt...
/home/spndc/src/Rhadoop/reduce.R
#! /usr/bin/env Rscript
cause.decode <- ...
con <- file("stdin", open="r")
current.key <-...
/home/spndc/src/Rhadoop/streaming.sh
#!/bin/sh
/usr/java/hadoop-1.2.0/bin/hadoop
jar /usr/java/hadoop-1.2.0/contrib/stream...
Output
[spndc@localhost hadoop-1.2.0]$ bin/hadoop fs -text averagebycondition/part-00000
Tuberculosis 60.5
Malignant neopl...
RHipe
RHIPE
http://www.datadr.org/index.html
Installation
API
RHIPE v0.65.3
Example
RHIPE v0.65.3
Recommendation System
Live Demo
Conclusions
•  Rhadoop is a good way to scale out, but it might be not the
best way.
•  Rhadoop is still under fast develo...
Reference
•  Rhadoop Wiki
–  https://github.com/RevolutionAnalytics/RHadoop/wiki
•  Rhipe
–  http://www.datadr.org/
•  Rha...
Thank You
Backup
Introduction of R on Hadoop
Introduction of R on Hadoop
Upcoming SlideShare
Loading in...5
×

Introduction of R on Hadoop

1,891

Published on

Published in: Technology, Health & Medicine
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,891
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
127
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Transcript of "Introduction of R on Hadoop"

  1. 1. A-Tsai (Chung-Tsai Su) SPN 2013/10/1 Introduction of R on Hadoop
  2. 2. Agenda •  When Should You Use R? •  When Should You Consider Hadoop? •  How to use R on Hadoop? –  Rhadoop –  R + Hadoop Streaming –  Rhipe •  Demo •  Conclusions
  3. 3. When Should You Use R? (Page 16)
  4. 4. http://3.bp.blogspot.com/-SbrlR5E0tks/UGCxeL_f5YI/AAAAAAAAL3M/lroU3yF-3_0/s1600/BigDataLandscape.png
  5. 5. https://developers.google.com/appengine/docs/python/images/mapreduce_mapshuffle.png
  6. 6. When should you consider Hadoop? (Page 576)
  7. 7. (Page 576)
  8. 8. (Page 576) RHadoop
  9. 9. RHadoop
  10. 10. Packages of RHadoop http://revolution-computing.typepad.com/.a/6a010534b1db25970b0154359c29bf970c-800wi
  11. 11. RHadoop
  12. 12. Installation (in textbook)devtools > library(devtools) > install_url("https://github.com/downloads/RevolutionAnalytics/RHadoop/ rmr_1.3.tar.gz") Installing rmr_1.3.tar.gz from https://github.com/downloads/ RevolutionAnalytics/RHadoop/rmr_1.3.tar.gz Installing rmr Installing dependencies for rmr: ... > # make sure to set HADOOP_HOME to the location of your HADOOP installation, > # HADOOP_CONF to the location of your hadoop config files, and make sure > # that the hadoop bin diretory is on your path > Sys.setenv(HADOOP_HOME="/Users/jadler/src/hadoop-0.20.2-cdh3u4") > Sys.setenv(HADOOP_CONF=paste(Sys.getenv("HADOOP_HOME"), + "/conf", sep="")) > Sys.setenv(PATH=paste(Sys.getenv("PATH"), ":", Sys.getenv("HADOOP_HOME"), + "/bin", sep="")) > install_url("https://github.com/downloads/RevolutionAnalytics/RHadoop/ rhdfs_1.0.4.tar.gz") Installing rhdfs_1.0.4.tar.gz from https://github.com/downloads/ RevolutionAnalytics/RHadoop/rhdfs_1.0.4.tar.gz Installing rhdfs ... > install_url("https://github.com/downloads/RevolutionAnalytics/ RHadoop/rhbase_1.0.4.tar.gz")(Refer to page 581)
  13. 13. Installation http://blog.fens.me/rhadoop-rhadoop/ •  Download Rhadoop package from https://github.com/RevolutionAnalytics/RHadoop/wiki •  $ R CMD javareonf •  $ R –  Install rJava, reshape2, Rcpp, iterators, itertools, digest, RJSONIO, functional, and bitops. •  >q() •  $ R CMD INSTALL rhdfs_1.0.6.tar.gz •  $ R CMD INSTALL rmr2_2.2.2.tar.gz •  Check whether successful installation –  > library(rhdfs) –  > hdfs.init() –  > hdfs.ls(“/user”)
  14. 14. First Example: WordCount
  15. 15. Hadoop Portal
  16. 16. An example RHadoop application •  Mortality Public Use File Documentation –  The dataset contains a record of every death in the United States, including the cause of death and demographic information about the deceased. (in 2009, the mortality data file was 1.1GB and contained 2,441,219 records) $ wget ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/mort2009us.zip $ unzip mort2009us.zip $ hadoop fs -mkdir mort09 $ hadoop fs -copyFromLocal VS09MORT.DUSMCPUB mort09 $ hadoop fs -ls mort09 Found 1 items -rw-r--r-- 3 jadler supergroup 1196197310 2012-08-02 16:31 /user/jadler/mort09/VS09MORT.DUSMCPUB
  17. 17. /home/spndc/src/Rhadoop/mort09.R (1/3)read.fwf read.fwf .X mort.schema <- c( .X0=19, ResidentStatus=1, .X1=40, Education1989=2, Education2003=1, EducationFlag=1,MonthOfDeath=2,.X2=2,Sex=1,AgeDetail=4, AgeSubstitution=1, AgeRecode52=2,AgeRecode27=2,AgeRecode12=2,AgeRecodeInfant22=2, PlaceOfDeath=1,MaritalStatus=1,DayOfWeekofDeath=1,.X3=16, CurrentDataYear=4, InjuryAtWork=1, MannerOfDeath=1, MethodOfDisposition=1, Autopsy=1,.X4=34,ActivityCode=1,PlaceOfInjury=1,ICDCode=4, CauseRecode358=3,.X5=1,CauseRecode113=3,CauseRecode130=3, CauseRecode39=2,.X6=1,Conditions=281,.X8=1,Race=2,BridgeRaceFlag=1, RaceImputationFlag=1,RaceRecode3=1,RaceRecode5=1,.X9=33, HispanicOrigin=3,.X10=1,HispanicOriginRecode=1) > # according to the documentation, each line is 488 characters long > sum(mort.schema) [1] 488
  18. 18. /home/spndc/src/Rhadoop/mort09.R (2/3)
  19. 19. /home/spndc/src/Rhadoop/ mort09.R (3/3)
  20. 20. /home/spndc/src/Rhadoop/mort09_1.R (1/4)
  21. 21. /home/spndc/src/Rhadoop/mort09_1.R (2/4)
  22. 22. /home/spndc/src/Rhadoop/mort09_1.R (3/4)
  23. 23. /home/spndc/src/Rhadoop/mort09_1.R (4/4)
  24. 24. R + Hadoop Streaming
  25. 25. Hadoop Streaming http://biomedicaloptics.spiedigitallibrary.org/data/Journals/BIOMEDO/23543/125003_1_2.png
  26. 26. #! /usr/bin/env Rscript mort.schema <- ... unpack.line <- ... age.decode <- ... con <- file("stdin", open="r") while(length(line <- readLines(con, n=1)) > 0) { parsed <- unpack.line(line,mort.schema) write(paste(parsed[["CauseRecode39"]], age.decode(parsed[["AgeDetail"]]), sep="t"), stdout()) } close(con) /home/spndc/src/Rhadoop/map.R
  27. 27. /home/spndc/src/Rhadoop/reduce.R #! /usr/bin/env Rscript cause.decode <- ... con <- file("stdin", open="r") current.key <- NA cumulative.age <- 0 count <- 0 print.results <- function(k, n, d) { write(paste(cause.decode(k),n/d,sep="t"),stdout()) } while(length(line <- readLines(con, n=1)) > 0) { parsed <- strsplit(line,"t") key <- parsed[[1]][1] value <- type.convert(parsed[[1]][2], as.is=TRUE) if (is.na(current.key)) { current.key <- key } else if (current.key != key) { print.results(current.key, cumulative.age, count) current.key <- key cumulative.age <- 0 count <- 0 } if (!is.na(value)) { cumulative.age <- cumulative.age + value count <- count + 1 } } close(con) print.results(current.key, cumulative.age, count)
  28. 28. /home/spndc/src/Rhadoop/streaming.sh #!/bin/sh /usr/java/hadoop-1.2.0/bin/hadoop jar /usr/java/hadoop-1.2.0/contrib/streaming/hadoop-streaming-1.2.0.jar -input mort09 -output averagebycondition -mapper map.R -reducer reduce.R -file map.R -file reduce.R
  29. 29. Output [spndc@localhost hadoop-1.2.0]$ bin/hadoop fs -text averagebycondition/part-00000 Tuberculosis 60.5 Malignant neoplasms of cervix uteri, corpus uteri and ovary 68.0631578947368 Malignant neoplasm of prostate 78.0705882352941 Malignant neoplasms of urinary tract 72.5656565656566 Non-Hodgkin's lymphoma 69.56 Leukemia 72.8674698795181 Other malignant neoplasms 66.8361581920904 Diabetes mellitus 68.2723404255319 Alzheimer's disease 85.419795221843 Hypertensive heart disease with or without renal disease 68.0833333333333 Ischemic heart diseases 72.1750619322874 Other diseases of heart 74.925 Essential 70.468085106383 Cerebrovascular diseases 76.0950639853748 Atherosclerosis 80.12 … Malignant neoplasm of breast 67.3815789473684
  30. 30. RHipe
  31. 31. RHIPE http://www.datadr.org/index.html
  32. 32. Installation
  33. 33. API RHIPE v0.65.3
  34. 34. Example RHIPE v0.65.3
  35. 35. Recommendation System
  36. 36. Live Demo
  37. 37. Conclusions •  Rhadoop is a good way to scale out, but it might be not the best way. •  Rhadoop is still under fast developing cycle, so you might be aware of the backward compatible issue. •  So far, SPN has no plan to adopt Rhadoop for data analysis. •  One of R fans suggests that using Pig with R will be better than using Rhadoop directly.
  38. 38. Reference •  Rhadoop Wiki –  https://github.com/RevolutionAnalytics/RHadoop/wiki •  Rhipe –  http://www.datadr.org/ •  Rhadoop實踐系列文章: –  http://blog.fens.me/series-rhadoop/ •  阿貝好威的實驗室 –  http://lab.howie.tw/2013/01/Big-Data-Analytic-Weka-vs-Mahout-vs- R.html •  R and Hadoop 整合初體驗 –  http://michaelhsu.tw/2013/05/01/r-and-hadoop-%E5%88%9D %E9%AB%94%E9%A9%97/
  39. 39. Thank You
  40. 40. Backup
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×