Your SlideShare is downloading. ×
0
RHive : Integrating R and Hive                             Introduction                                                   ...
Analysis of DataFriday, November 11, 11
Analysis of Data                                                            CF      Classifier                             ...
Related Works                    •     RHIPE                    •     RHadoop                                             ...
RHive is inspired by ...                    •     Many analysts have been used R for a long time                    •     ...
RHive Components                   •      Hadoop                          •   store and analyze big data                  ...
RHive - Architecture                   Execute R Function Objects and R Objects                   through Hive Query      ...
RHive API                •         Extension R Functions                      •     rhive.connect       •   rhive.napply  ...
RUDF - R User-defined Functions                   SELECT R(‘R function-name’,col1,col2,...,TYPE)                    •     U...
RUDAF - R User-defined Aggregation Function                            SELECT RA(‘R function-name’,col1,col2,...)          ...
UDTF : unfold and expand                    •     RUDAF only returns string delimited by ‘,’                    •     Conv...
napply and sapply                   rhive.napply(table-name,FUN,col1,...)                   rhive.sapply(table-name,FUN,co...
napply         •       ‘napply’ is similar to R apply function         •       Store big result to HDFS as Hive table   rh...
aggregate                 rhive.aggregate(table-name,hive-FUN,...,goups)            •      RHive aggregation function to a...
Examples - predict flight delay    library(RHive)    rhive.connect()    - Retrieve training set from large dataset stored i...
DEMOFriday, November 11, 11
Conclusion                    •     RHive supports HQL, not MapReduce model style                    •     RHive allows an...
Cooperators                   •      JunHo Cho                   •      Seonghak Hong                   •      Choonghyun ...
How to join RHive project               •          Logo               •          github (https://github.com/nexr/RHive)   ...
References              •       Recardo (https://mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf)              •   ...
jun.cho@nexr.comFriday, November 11, 11
AppendixFriday, November 11, 11
RHIPE                •         the R and Hadoop Integrated Processing Environment                •         Must understand...
RHadoop                  •       Manipulate Hadoop data stores and HBASE directly from R                  •       Write Ma...
hive(Hadoop InteractiVE)                •         R extension facilitating distributed computing via the MapReduce        ...
seuge            •       Simple parallel processing with a fast and easy setup on Amazon’s WS.            •       Parallel...
Ricardo                    •     Integrate R and Jaql (JSON Query Language)                    •     Must know how to use ...
Upcoming SlideShare
Loading in...5
×

Integrate Hive and R

17,097

Published on

Published in: Technology

Transcript of "Integrate Hive and R"

  1. 1. RHive : Integrating R and Hive Introduction JunHo Cho Data Analysis Platform TeamFriday, November 11, 11
  2. 2. Analysis of DataFriday, November 11, 11
  3. 3. Analysis of Data CF Classifier Decision Tree MapReduce Recommendation Graph ClusteringFriday, November 11, 11
  4. 4. Related Works • RHIPE • RHadoop duce R e • hive (Hadoop InteractiVE) Map and • seuge r st un de u st MFriday, November 11, 11
  5. 5. RHive is inspired by ... • Many analysts have been used R for a long time • Many analysts can use SQL language • There are already a lot of statistical functions in R • R needs a capability to analyze big data • Hive supports SQL-like query language (HQL) • Hive supports MapReduce to execute HQL R is the best solution for familiarity Hive is the best solution for capabilityFriday, November 11, 11
  6. 6. RHive Components • Hadoop • store and analyze big data • Hive • use HQL instead of MapReduce programming • R • support friendly environment to analystsFriday, November 11, 11
  7. 7. RHive - Architecture Execute R Function Objects and R Objects through Hive Query Execute Hive Query through R rcal <- function(arg1,arg2) { SELECT R(‘rcal’,col1,col2) coeff * sum(arg1,arg2) from tab1 } RServe 01010100101 01010100101 01010100101 01010010101 01010010101 01010010101 01001010101 01001010101 01001010101 10101000111 10101000111 10101000111 R Function R Object RUDF RUDAFFriday, November 11, 11
  8. 8. RHive API • Extension R Functions • rhive.connect • rhive.napply • rhive.load.table • rhive.query • rhive.sapply • rhive.desc.table • rhive.assign • rhive.aggregate • rhive.export • rhive.list.tables • Extension Hive Functions • RUDF • RUDAF • GenericUDTFExpand • GenericUDTFUnFoldFriday, November 11, 11
  9. 9. RUDF - R User-defined Functions SELECT R(‘R function-name’,col1,col2,...,TYPE) • UDF doesn’t know return type until calling R function • TYPE : return type Example : R function which sums all passed columns sumCols <- function(arg1,...) { sum(arg1,...) } rhive.assign(‘sumCols’,sumCols) rhive.exportAll(‘sumCols’,hadoop-clusters) result <- rhive.query(“SELECT R(‘sumCols’, col1, col2, col3, col4, 0.0) FROM tab”) plot(result)Friday, November 11, 11
  10. 10. RUDAF - R User-defined Aggregation Function SELECT RA(‘R function-name’,col1,col2,...) • R can not manipulate large dataset • Support UDAF’s life cycle • iterate, partial-merge, merge, terminate • Return type is only string delimited by ‘,’ - “data1,data2,data3,...” partial aggregation partial aggregation aggregation values FUN FUN.partial FUN.merge FUN.terminate 01010100101 01010100101 01010100101 01010100101 01010100101 01010100101 01010100101 01010010101 01010010101 01010010101 01010010101 01010010101 01010010101 01010010101 01001010101 01001010101 01001010101 01001010101 01001010101 01001010101 01001010101 10101000111 10101000111 10101000111 10101000111 10101000111 10101000111 10101000111Friday, November 11, 11
  11. 11. UDTF : unfold and expand • RUDAF only returns string delimited by ‘,’ • Convert RUDAF’s result to R data.frame unfold(string_value,type1,type2,...,delimiter) expand(string_value,type,delimiter) RA(‘newcenter’,...) return “num1,num2,num3” per cluster-key select unfold(tb1.x,0.0,0.0,0.0,’,’) as (col1,col2,col3) from (select RA(‘newcenter’, attr1,attr2,attr3,attr4) as x from table group by cluster-keyFriday, November 11, 11
  12. 12. napply and sapply rhive.napply(table-name,FUN,col1,...) rhive.sapply(table-name,FUN,col1,...) • napply : R apply function for Numeric type • sapply : R apply function for String type Example : R function which sums all passed columns sumCols <- function(arg1,...) { sum(arg1,...) } result <- rhive.napply(“tab”, sumCols, col1, col2, col3, col4) rhive.load.table(result)Friday, November 11, 11
  13. 13. napply • ‘napply’ is similar to R apply function • Store big result to HDFS as Hive table rhive.napply <- function(tablename, FUN, col = NULL, ...) { if(is.null(col)) cols <- "" else cols <- paste(",",col) for(element in c(...)) { cols <- paste(cols,",",element) } exportname <- paste(tablename,"_sapply",as.integer(Sys.time()),sep="") ! rhive.assign(exportname,FUN) ! rhive.exportAll(exportname) tmptable <- paste(exportname,”_table”) ! rhive.query( paste("CREATE TABLE ", tmptable," AS SELECT ","R(",exportname,"",cols,",0.0) FROM ",tablename,sep="")) ! tmptable }Friday, November 11, 11
  14. 14. aggregate rhive.aggregate(table-name,hive-FUN,...,goups) • RHive aggregation function to aggregate data stored in HDFS using HIVE Function Example : Aggregate using SUM (Hive aggregation function) result <- rhive.aggregate(“emp”, “SUM”, sal,groups=”deptno”) rhive.load.table(result)Friday, November 11, 11
  15. 15. Examples - predict flight delay library(RHive) rhive.connect() - Retrieve training set from large dataset stored in HDFS train <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand()) train$arrdelay <- as.numeric(train$arrdelay) train$distance <- as.numeric(train$distance) train <- train[!(is.na(train$arrdelay) | is.na(train$distance)),] Native R code model <- lm(arrdelay ~ distance + dayofweek,data=train) - Export R object data rhive.assign("model", model) - Analyze big data using model calculated by R predict_table <- rhive.napply(“airlines”,function(arg1,arg2,arg3) { if(is.null(arg1) | is.null(arg2) | is.null(arg3)) return(0.0) HiveQuery + R code res <- predict.lm(model, data.frame(dayofweek=arg1,arrdelay=arg2,distance=arg3)) return(as.numeric(res)) }, ‘dayofweek’, ‘arrdelay’, ‘distance’)Friday, November 11, 11
  16. 16. DEMOFriday, November 11, 11
  17. 17. Conclusion • RHive supports HQL, not MapReduce model style • RHive allows analytics to do everything in R console • RHive interacts R data and HDFS data • Future & Current Works • Integrate Hadoop HDFS • Support Transform/Map-Reduce Scripts • Distributed Rserve • Support more R style API • Support machine learning algorithms (k-means, classifier, ...)Friday, November 11, 11
  18. 18. Cooperators • JunHo Cho • Seonghak Hong • Choonghyun Ryu YO U !Friday, November 11, 11
  19. 19. How to join RHive project • Logo • github (https://github.com/nexr/RHive) • CRAN (http://cran.r-project.org/web/packages/RHive) • Welcome to join RHive projectFriday, November 11, 11
  20. 20. References • Recardo (https://mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf) • RHIPE (http://ml.stat.purdue.edu/rhipe) • Hive (http://hive.apache.org) • Parallels R by Q. Ethan McCallum and Stephen WestonFriday, November 11, 11
  21. 21. jun.cho@nexr.comFriday, November 11, 11
  22. 22. AppendixFriday, November 11, 11
  23. 23. RHIPE • the R and Hadoop Integrated Processing Environment • Must understand the MapReduce model map <- function() {...} shuffle / sort reduce <- function() {...} Mapper Reducer rmr <- rhmr(map,reduce,...) ProtocolBuf R Fork R R RHMR R Objects (map) R Objects (reduce) R Objects (map, reduce) PersonalServer R Conf HDFSFriday, November 11, 11
  24. 24. RHadoop • Manipulate Hadoop data stores and HBASE directly from R • Write MapReduce models in R using Hadoop Streaming • Must understand the MapReduce model map <- function() {...} reduce <- function() {...} mapreduce(input,output,map,reduce,...) R rhbase rhdfs rmr execute hadoop streaming R R Hadoop Streaming manipulate shuffle / sort store R objecs as file Mapper Reducer HBASE HDFSFriday, November 11, 11
  25. 25. hive(Hadoop InteractiVE) • R extension facilitating distributed computing via the MapReduce paradigm • Provide an interface to Hadoop, HDFS and Hadoop Streaming • Must understand the MapReduce model map <- function() {...} reduce <- function() {...} hive_stream(map,reduce,...) R execute hadoop R R hive DFS hive_stream streaming Hadoop Streaming manipulate shuffle / sort save R script on local Mapper Reducer HDFSFriday, November 11, 11
  26. 26. seuge • Simple parallel processing with a fast and easy setup on Amazon’s WS. • Parallel lapply function for the EMR engine using Hadoop streaming. • Does not support MapReduce model but only Map model. data <- list(...) emrlapply(clusterObject,data,FUN,..) Amazon S3 upload R objects R emrlapply awsFunctions EMR Hadoop Streaming R Mapper save R objects (data + FUN) on local bootstrap (setup R) mapper.RFriday, November 11, 11
  27. 27. Ricardo • Integrate R and Jaql (JSON Query Language) • Must know how to use uncommon query, Jaql • Not open-source Ref : Ricardo-SIGMOD10Friday, November 11, 11
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×