Introduction of R on Hadoop

  • 1,407 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,407
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
98
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. A-Tsai (Chung-Tsai Su) SPN 2013/10/1 Introduction of R on Hadoop
  • 2. Agenda •  When Should You Use R? •  When Should You Consider Hadoop? •  How to use R on Hadoop? –  Rhadoop –  R + Hadoop Streaming –  Rhipe •  Demo •  Conclusions
  • 3. When Should You Use R? (Page 16)
  • 4. http://3.bp.blogspot.com/-SbrlR5E0tks/UGCxeL_f5YI/AAAAAAAAL3M/lroU3yF-3_0/s1600/BigDataLandscape.png
  • 5. https://developers.google.com/appengine/docs/python/images/mapreduce_mapshuffle.png
  • 6. When should you consider Hadoop? (Page 576)
  • 7. (Page 576)
  • 8. (Page 576) RHadoop
  • 9. RHadoop
  • 10. Packages of RHadoop http://revolution-computing.typepad.com/.a/6a010534b1db25970b0154359c29bf970c-800wi
  • 11. RHadoop
  • 12. Installation (in textbook)devtools > library(devtools) > install_url("https://github.com/downloads/RevolutionAnalytics/RHadoop/ rmr_1.3.tar.gz") Installing rmr_1.3.tar.gz from https://github.com/downloads/ RevolutionAnalytics/RHadoop/rmr_1.3.tar.gz Installing rmr Installing dependencies for rmr: ... > # make sure to set HADOOP_HOME to the location of your HADOOP installation, > # HADOOP_CONF to the location of your hadoop config files, and make sure > # that the hadoop bin diretory is on your path > Sys.setenv(HADOOP_HOME="/Users/jadler/src/hadoop-0.20.2-cdh3u4") > Sys.setenv(HADOOP_CONF=paste(Sys.getenv("HADOOP_HOME"), + "/conf", sep="")) > Sys.setenv(PATH=paste(Sys.getenv("PATH"), ":", Sys.getenv("HADOOP_HOME"), + "/bin", sep="")) > install_url("https://github.com/downloads/RevolutionAnalytics/RHadoop/ rhdfs_1.0.4.tar.gz") Installing rhdfs_1.0.4.tar.gz from https://github.com/downloads/ RevolutionAnalytics/RHadoop/rhdfs_1.0.4.tar.gz Installing rhdfs ... > install_url("https://github.com/downloads/RevolutionAnalytics/ RHadoop/rhbase_1.0.4.tar.gz")(Refer to page 581)
  • 13. Installation http://blog.fens.me/rhadoop-rhadoop/ •  Download Rhadoop package from https://github.com/RevolutionAnalytics/RHadoop/wiki •  $ R CMD javareonf •  $ R –  Install rJava, reshape2, Rcpp, iterators, itertools, digest, RJSONIO, functional, and bitops. •  >q() •  $ R CMD INSTALL rhdfs_1.0.6.tar.gz •  $ R CMD INSTALL rmr2_2.2.2.tar.gz •  Check whether successful installation –  > library(rhdfs) –  > hdfs.init() –  > hdfs.ls(“/user”)
  • 14. First Example: WordCount
  • 15. Hadoop Portal
  • 16. An example RHadoop application •  Mortality Public Use File Documentation –  The dataset contains a record of every death in the United States, including the cause of death and demographic information about the deceased. (in 2009, the mortality data file was 1.1GB and contained 2,441,219 records) $ wget ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/mort2009us.zip $ unzip mort2009us.zip $ hadoop fs -mkdir mort09 $ hadoop fs -copyFromLocal VS09MORT.DUSMCPUB mort09 $ hadoop fs -ls mort09 Found 1 items -rw-r--r-- 3 jadler supergroup 1196197310 2012-08-02 16:31 /user/jadler/mort09/VS09MORT.DUSMCPUB
  • 17. /home/spndc/src/Rhadoop/mort09.R (1/3)read.fwf read.fwf .X mort.schema <- c( .X0=19, ResidentStatus=1, .X1=40, Education1989=2, Education2003=1, EducationFlag=1,MonthOfDeath=2,.X2=2,Sex=1,AgeDetail=4, AgeSubstitution=1, AgeRecode52=2,AgeRecode27=2,AgeRecode12=2,AgeRecodeInfant22=2, PlaceOfDeath=1,MaritalStatus=1,DayOfWeekofDeath=1,.X3=16, CurrentDataYear=4, InjuryAtWork=1, MannerOfDeath=1, MethodOfDisposition=1, Autopsy=1,.X4=34,ActivityCode=1,PlaceOfInjury=1,ICDCode=4, CauseRecode358=3,.X5=1,CauseRecode113=3,CauseRecode130=3, CauseRecode39=2,.X6=1,Conditions=281,.X8=1,Race=2,BridgeRaceFlag=1, RaceImputationFlag=1,RaceRecode3=1,RaceRecode5=1,.X9=33, HispanicOrigin=3,.X10=1,HispanicOriginRecode=1) > # according to the documentation, each line is 488 characters long > sum(mort.schema) [1] 488
  • 18. /home/spndc/src/Rhadoop/mort09.R (2/3)
  • 19. /home/spndc/src/Rhadoop/ mort09.R (3/3)
  • 20. /home/spndc/src/Rhadoop/mort09_1.R (1/4)
  • 21. /home/spndc/src/Rhadoop/mort09_1.R (2/4)
  • 22. /home/spndc/src/Rhadoop/mort09_1.R (3/4)
  • 23. /home/spndc/src/Rhadoop/mort09_1.R (4/4)
  • 24. R + Hadoop Streaming
  • 25. Hadoop Streaming http://biomedicaloptics.spiedigitallibrary.org/data/Journals/BIOMEDO/23543/125003_1_2.png
  • 26. #! /usr/bin/env Rscript mort.schema <- ... unpack.line <- ... age.decode <- ... con <- file("stdin", open="r") while(length(line <- readLines(con, n=1)) > 0) { parsed <- unpack.line(line,mort.schema) write(paste(parsed[["CauseRecode39"]], age.decode(parsed[["AgeDetail"]]), sep="t"), stdout()) } close(con) /home/spndc/src/Rhadoop/map.R
  • 27. /home/spndc/src/Rhadoop/reduce.R #! /usr/bin/env Rscript cause.decode <- ... con <- file("stdin", open="r") current.key <- NA cumulative.age <- 0 count <- 0 print.results <- function(k, n, d) { write(paste(cause.decode(k),n/d,sep="t"),stdout()) } while(length(line <- readLines(con, n=1)) > 0) { parsed <- strsplit(line,"t") key <- parsed[[1]][1] value <- type.convert(parsed[[1]][2], as.is=TRUE) if (is.na(current.key)) { current.key <- key } else if (current.key != key) { print.results(current.key, cumulative.age, count) current.key <- key cumulative.age <- 0 count <- 0 } if (!is.na(value)) { cumulative.age <- cumulative.age + value count <- count + 1 } } close(con) print.results(current.key, cumulative.age, count)
  • 28. /home/spndc/src/Rhadoop/streaming.sh #!/bin/sh /usr/java/hadoop-1.2.0/bin/hadoop jar /usr/java/hadoop-1.2.0/contrib/streaming/hadoop-streaming-1.2.0.jar -input mort09 -output averagebycondition -mapper map.R -reducer reduce.R -file map.R -file reduce.R
  • 29. Output [spndc@localhost hadoop-1.2.0]$ bin/hadoop fs -text averagebycondition/part-00000 Tuberculosis 60.5 Malignant neoplasms of cervix uteri, corpus uteri and ovary 68.0631578947368 Malignant neoplasm of prostate 78.0705882352941 Malignant neoplasms of urinary tract 72.5656565656566 Non-Hodgkin's lymphoma 69.56 Leukemia 72.8674698795181 Other malignant neoplasms 66.8361581920904 Diabetes mellitus 68.2723404255319 Alzheimer's disease 85.419795221843 Hypertensive heart disease with or without renal disease 68.0833333333333 Ischemic heart diseases 72.1750619322874 Other diseases of heart 74.925 Essential 70.468085106383 Cerebrovascular diseases 76.0950639853748 Atherosclerosis 80.12 … Malignant neoplasm of breast 67.3815789473684
  • 30. RHipe
  • 31. RHIPE http://www.datadr.org/index.html
  • 32. Installation
  • 33. API RHIPE v0.65.3
  • 34. Example RHIPE v0.65.3
  • 35. Recommendation System
  • 36. Live Demo
  • 37. Conclusions •  Rhadoop is a good way to scale out, but it might be not the best way. •  Rhadoop is still under fast developing cycle, so you might be aware of the backward compatible issue. •  So far, SPN has no plan to adopt Rhadoop for data analysis. •  One of R fans suggests that using Pig with R will be better than using Rhadoop directly.
  • 38. Reference •  Rhadoop Wiki –  https://github.com/RevolutionAnalytics/RHadoop/wiki •  Rhipe –  http://www.datadr.org/ •  Rhadoop實踐系列文章: –  http://blog.fens.me/series-rhadoop/ •  阿貝好威的實驗室 –  http://lab.howie.tw/2013/01/Big-Data-Analytic-Weka-vs-Mahout-vs- R.html •  R and Hadoop 整合初體驗 –  http://michaelhsu.tw/2013/05/01/r-and-hadoop-%E5%88%9D %E9%AB%94%E9%A9%97/
  • 39. Thank You
  • 40. Backup