SlideShare a Scribd company logo
1 of 27
Download to read offline
RHive : Integrating R and Hive
                             Introduction



                                                     JunHo Cho
                                           Data Analysis Platform Team




Friday, November 11, 11
Analysis of Data




Friday, November 11, 11
Analysis of Data




                                                            CF      Classifier
                                                                                 Decision Tree
                    MapReduce                      Recommendation
                                                                                Graph
                                                      Clustering




Friday, November 11, 11
Related Works


                    •     RHIPE

                    •     RHadoop
                                                                      duce
                                                                  R e
                    •     hive (Hadoop InteractiVE)
                                                               Map
                                                           and
                    •     seuge
                                                      r st
                                               un de
                                        u st
                                       M



Friday, November 11, 11
RHive is inspired by ...

                    •     Many analysts have been used R for a long time

                    •     Many analysts can use SQL language

                    •     There are already a lot of statistical functions in R

                    •     R needs a capability to analyze big data

                    •     Hive supports SQL-like query language (HQL)

                    •     Hive supports MapReduce to execute HQL




                          R is the best solution for familiarity
                          Hive is the best solution for capability


Friday, November 11, 11
RHive Components


                   •      Hadoop

                          •   store and analyze big data

                   •      Hive

                          •   use HQL instead of MapReduce programming

                   •      R

                          •   support friendly environment to analysts




Friday, November 11, 11
RHive - Architecture
                   Execute R Function Objects and R Objects
                   through Hive Query

                 Execute Hive Query through R
                                                                              rcal <- function(arg1,arg2) {
                                       SELECT R(‘rcal’,col1,col2)                 coeff * sum(arg1,arg2)
                                       from tab1                              }



                                                                                     RServe


                                           01010100101   01010100101   01010100101
                                           01010010101   01010010101   01010010101
                                           01001010101   01001010101   01001010101
                                           10101000111   10101000111   10101000111




                                                      R Function          R Object         RUDF           RUDAF




Friday, November 11, 11
RHive API

                •         Extension R Functions
                      •     rhive.connect       •   rhive.napply        •   rhive.load.table
                      •     rhive.query         •   rhive.sapply        •   rhive.desc.table
                      •     rhive.assign        •   rhive.aggregate
                      •     rhive.export        •   rhive.list.tables




                •         Extension Hive Functions
                      •     RUDF

                      •     RUDAF

                      •     GenericUDTFExpand

                      •     GenericUDTFUnFold


Friday, November 11, 11
RUDF - R User-defined Functions

                   SELECT R(‘R function-name’,col1,col2,...,TYPE)
                    •     UDF doesn’t know return type until calling R function

                          •    TYPE : return type

                Example : R function which sums all passed columns


                sumCols <- function(arg1,...) {
                   sum(arg1,...)
                }
                rhive.assign(‘sumCols’,sumCols)
                rhive.exportAll(‘sumCols’,hadoop-clusters)
                result <- rhive.query(“SELECT R(‘sumCols’, col1, col2, col3, col4, 0.0) FROM tab”)
                plot(result)



Friday, November 11, 11
RUDAF - R User-defined Aggregation Function
                            SELECT RA(‘R function-name’,col1,col2,...)
                    •        R can not manipulate large dataset

                    •        Support UDAF’s life cycle

                           •    iterate, partial-merge, merge, terminate

                    •        Return type is only string delimited by ‘,’ - “data1,data2,data3,...”

                          partial aggregation                                                                  partial aggregation
                                                                 aggregation values



                    FUN                FUN.partial                                                   FUN.merge           FUN.terminate

                 01010100101   01010100101   01010100101   01010100101   01010100101   01010100101   01010100101
                 01010010101   01010010101   01010010101   01010010101   01010010101   01010010101   01010010101
                 01001010101   01001010101   01001010101   01001010101   01001010101   01001010101   01001010101
                 10101000111   10101000111   10101000111   10101000111   10101000111   10101000111   10101000111




Friday, November 11, 11
UDTF : unfold and expand
                    •     RUDAF only returns string delimited by ‘,’

                    •     Convert RUDAF’s result to R data.frame



                   unfold(string_value,type1,type2,...,delimiter)
                   expand(string_value,type,delimiter)

                  RA(‘newcenter’,...) return “num1,num2,num3” per cluster-key

                  select unfold(tb1.x,0.0,0.0,0.0,’,’) as (col1,col2,col3) from (select RA(‘newcenter’,
                  attr1,attr2,attr3,attr4) as x from table group by cluster-key




Friday, November 11, 11
napply and sapply

                   rhive.napply(table-name,FUN,col1,...)
                   rhive.sapply(table-name,FUN,col1,...)
                  •       napply : R apply function for Numeric type

                  •       sapply : R apply function for String type



                      Example : R function which sums all passed columns

                      sumCols <- function(arg1,...) {
                         sum(arg1,...)
                      }
                      result <- rhive.napply(“tab”, sumCols, col1, col2, col3, col4)
                      rhive.load.table(result)




Friday, November 11, 11
napply

         •       ‘napply’ is similar to R apply function

         •       Store big result to HDFS as Hive table


   rhive.napply       <- function(tablename, FUN, col = NULL, ...) {
         if(is.null(col))
              cols <- ""
         else
              cols <- paste(",",col)

         for(element in c(...)) {
              cols <- paste(cols,",",element)
         }

         exportname <- paste(tablename,"_sapply",as.integer(Sys.time()),sep="")

   !     rhive.assign(exportname,FUN)
   !     rhive.exportAll(exportname)
         tmptable <- paste(exportname,”_table”)
   !     rhive.query(
                   paste("CREATE TABLE ", tmptable," AS SELECT ","R('",exportname,"'",cols,",0.0) FROM ",tablename,sep=""))

   !     tmptable
   }




Friday, November 11, 11
aggregate

                 rhive.aggregate(table-name,hive-FUN,...,goups)

            •      RHive aggregation function to aggregate data stored in HDFS using HIVE Function




                  Example : Aggregate using SUM (Hive aggregation function)

                  result <- rhive.aggregate(“emp”, “SUM”, sal,groups=”deptno”)
                  rhive.load.table(result)




Friday, November 11, 11
Examples - predict flight delay
    library(RHive)
    rhive.connect()

    - Retrieve training set from large dataset stored in HDFS
    train <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand())

    train$arrdelay <- as.numeric(train$arrdelay)

    train$distance <- as.numeric(train$distance)

    train <- train[!(is.na(train$arrdelay) | is.na(train$distance)),]                   Native R code
    model <- lm(arrdelay ~ distance + dayofweek,data=train)

    - Export R object data
    rhive.assign("model", model)

    - Analyze big data using model calculated by R
    predict_table <- rhive.napply(“airlines”,function(arg1,arg2,arg3) {

            if(is.null(arg1) | is.null(arg2) | is.null(arg3)) return(0.0)                HiveQuery + R code
            res <- predict.lm(model, data.frame(dayofweek=arg1,arrdelay=arg2,distance=arg3))

            return(as.numeric(res)) }, ‘dayofweek’, ‘arrdelay’, ‘distance’)


Friday, November 11, 11
DEMO

Friday, November 11, 11
Conclusion

                    •     RHive supports HQL, not MapReduce model style

                    •     RHive allows analytics to do everything in R console

                    •     RHive interacts R data and HDFS data



                    •     Future & Current Works
                          •   Integrate Hadoop HDFS

                          •   Support Transform/Map-Reduce Scripts

                          •   Distributed Rserve

                          •   Support more R style API

                          •   Support machine learning algorithms (k-means, classifier, ...)


Friday, November 11, 11
Cooperators


                   •      JunHo Cho

                   •      Seonghak Hong

                   •      Choonghyun Ryu




                                      YO U !
Friday, November 11, 11
How to join RHive project


               •          Logo




               •          github (https://github.com/nexr/RHive)

               •          CRAN (http://cran.r-project.org/web/packages/RHive)

               •          Welcome to join RHive project




Friday, November 11, 11
References



              •       Recardo (https://mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf)

              •       RHIPE (http://ml.stat.purdue.edu/rhipe)

              •       Hive (http://hive.apache.org)

              •       Parallels R by Q. Ethan McCallum and Stephen Weston




Friday, November 11, 11
jun.cho@nexr.com




Friday, November 11, 11
Appendix




Friday, November 11, 11
RHIPE

                •         the R and Hadoop Integrated Processing Environment

                •         Must understand the MapReduce model

            map <- function() {...}                                             shuffle / sort
            reduce <- function() {...}                                Mapper                    Reducer
            rmr <- rhmr(map,reduce,...)
                                                                                ProtocolBuf
                            R
                     Fork                                                R                        R
                          RHMR
                                                           R Objects (map)       R Objects (reduce)

                                R Objects (map, reduce)                        PersonalServer
                                R Conf


                                                           HDFS

Friday, November 11, 11
RHadoop
                  •       Manipulate Hadoop data stores and HBASE directly from R
                  •       Write MapReduce models in R using Hadoop Streaming
                  •       Must understand the MapReduce model


           map <- function() {...}
           reduce <- function() {...}
           mapreduce(input,output,map,reduce,...)


                              R
              rhbase           rhdfs          rmr
                                                           execute hadoop
                                                           streaming
                                                                              R                         R
                                                                                     Hadoop Streaming

                      manipulate                                                      shuffle / sort
                                       store R objecs as file                Mapper                    Reducer


                                   HBASE                                             HDFS

Friday, November 11, 11
hive(Hadoop InteractiVE)
                •         R extension facilitating distributed computing via the MapReduce
                          paradigm
                •         Provide an interface to Hadoop, HDFS and Hadoop Streaming
                •         Must understand the MapReduce model

           map <- function() {...}
           reduce <- function() {...}
           hive_stream(map,reduce,...)


                             R
                                                          execute hadoop     R                         R
               hive        DFS     hive_stream
                                                          streaming
                                                                                    Hadoop Streaming

           manipulate                                                                shuffle / sort
                                 save R script on local                    Mapper                    Reducer


                                                                 HDFS

Friday, November 11, 11
seuge

            •       Simple parallel processing with a fast and easy setup on Amazon’s WS.
            •       Parallel lapply function for the EMR engine using Hadoop streaming.
            •       Does not support MapReduce model but only Map model.



           data <- list(...)
           emrlapply(clusterObject,data,FUN,..)                               Amazon S3
                                                  upload R objects
                                  R

                          emrlapply   awsFunctions                     EMR              Hadoop Streaming
                                                                                                           R



                                                                                                     Mapper
                  save R objects (data + FUN) on local
                                                                       bootstrap (setup R)
                                                                       mapper.R




Friday, November 11, 11
Ricardo
                    •     Integrate R and Jaql (JSON Query Language)
                    •     Must know how to use uncommon query, Jaql
                    •     Not open-source




                                                                       Ref : Ricardo-SIGMOD10



Friday, November 11, 11

More Related Content

What's hot

Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Jeremy Walsh
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of HadoopAsif Ali
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDataWorks Summit
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & StorageIlayaraja P
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture Rupak Roy
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsZubair Nabi
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaEcossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaNelson Forte
 

What's hot (20)

Unit 2
Unit 2Unit 2
Unit 2
 
Apache pig
Apache pigApache pig
Apache pig
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache Giraph
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & Storage
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaEcossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine Luiza
 

Similar to Integrate Hive and R

Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)Revolution Analytics
 
How to use hadoop and r for big data parallel processing
How to use hadoop and r for big data  parallel processingHow to use hadoop and r for big data  parallel processing
How to use hadoop and r for big data parallel processingBryan Downing
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Implementing R7RS on R6RS Scheme
Implementing R7RS on R6RS SchemeImplementing R7RS on R6RS Scheme
Implementing R7RS on R6RS SchemeKato Takashi
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Yukinori Suda
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & HadoopJeffrey Breen
 
サンプルから見るMap reduceコード
サンプルから見るMap reduceコードサンプルから見るMap reduceコード
サンプルから見るMap reduceコードShinpei Ohtani
 
サンプルから見るMapReduceコード
サンプルから見るMapReduceコードサンプルから見るMapReduceコード
サンプルから見るMapReduceコードShinpei Ohtani
 
Hadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionHadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionClaudio Martella
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and RRadek Maciaszek
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahoutbigdatasyd
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 

Similar to Integrate Hive and R (20)

Rhive 0.0 3
Rhive 0.0 3Rhive 0.0 3
Rhive 0.0 3
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
R and-hadoop
R and-hadoopR and-hadoop
R and-hadoop
 
How to use hadoop and r for big data parallel processing
How to use hadoop and r for big data  parallel processingHow to use hadoop and r for big data  parallel processing
How to use hadoop and r for big data parallel processing
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Implementing R7RS on R6RS Scheme
Implementing R7RS on R6RS SchemeImplementing R7RS on R6RS Scheme
Implementing R7RS on R6RS Scheme
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)
 
Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
サンプルから見るMap reduceコード
サンプルから見るMap reduceコードサンプルから見るMap reduceコード
サンプルから見るMap reduceコード
 
サンプルから見るMapReduceコード
サンプルから見るMapReduceコードサンプルから見るMapReduceコード
サンプルから見るMapReduceコード
 
Hadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionHadoop: A Hands-on Introduction
Hadoop: A Hands-on Introduction
 
Special topics in finance lecture 2
Special topics in finance   lecture 2Special topics in finance   lecture 2
Special topics in finance lecture 2
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahout
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
R - the language
R - the languageR - the language
R - the language
 

Recently uploaded

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Integrate Hive and R

  • 1. RHive : Integrating R and Hive Introduction JunHo Cho Data Analysis Platform Team Friday, November 11, 11
  • 2. Analysis of Data Friday, November 11, 11
  • 3. Analysis of Data CF Classifier Decision Tree MapReduce Recommendation Graph Clustering Friday, November 11, 11
  • 4. Related Works • RHIPE • RHadoop duce R e • hive (Hadoop InteractiVE) Map and • seuge r st un de u st M Friday, November 11, 11
  • 5. RHive is inspired by ... • Many analysts have been used R for a long time • Many analysts can use SQL language • There are already a lot of statistical functions in R • R needs a capability to analyze big data • Hive supports SQL-like query language (HQL) • Hive supports MapReduce to execute HQL R is the best solution for familiarity Hive is the best solution for capability Friday, November 11, 11
  • 6. RHive Components • Hadoop • store and analyze big data • Hive • use HQL instead of MapReduce programming • R • support friendly environment to analysts Friday, November 11, 11
  • 7. RHive - Architecture Execute R Function Objects and R Objects through Hive Query Execute Hive Query through R rcal <- function(arg1,arg2) { SELECT R(‘rcal’,col1,col2) coeff * sum(arg1,arg2) from tab1 } RServe 01010100101 01010100101 01010100101 01010010101 01010010101 01010010101 01001010101 01001010101 01001010101 10101000111 10101000111 10101000111 R Function R Object RUDF RUDAF Friday, November 11, 11
  • 8. RHive API • Extension R Functions • rhive.connect • rhive.napply • rhive.load.table • rhive.query • rhive.sapply • rhive.desc.table • rhive.assign • rhive.aggregate • rhive.export • rhive.list.tables • Extension Hive Functions • RUDF • RUDAF • GenericUDTFExpand • GenericUDTFUnFold Friday, November 11, 11
  • 9. RUDF - R User-defined Functions SELECT R(‘R function-name’,col1,col2,...,TYPE) • UDF doesn’t know return type until calling R function • TYPE : return type Example : R function which sums all passed columns sumCols <- function(arg1,...) { sum(arg1,...) } rhive.assign(‘sumCols’,sumCols) rhive.exportAll(‘sumCols’,hadoop-clusters) result <- rhive.query(“SELECT R(‘sumCols’, col1, col2, col3, col4, 0.0) FROM tab”) plot(result) Friday, November 11, 11
  • 10. RUDAF - R User-defined Aggregation Function SELECT RA(‘R function-name’,col1,col2,...) • R can not manipulate large dataset • Support UDAF’s life cycle • iterate, partial-merge, merge, terminate • Return type is only string delimited by ‘,’ - “data1,data2,data3,...” partial aggregation partial aggregation aggregation values FUN FUN.partial FUN.merge FUN.terminate 01010100101 01010100101 01010100101 01010100101 01010100101 01010100101 01010100101 01010010101 01010010101 01010010101 01010010101 01010010101 01010010101 01010010101 01001010101 01001010101 01001010101 01001010101 01001010101 01001010101 01001010101 10101000111 10101000111 10101000111 10101000111 10101000111 10101000111 10101000111 Friday, November 11, 11
  • 11. UDTF : unfold and expand • RUDAF only returns string delimited by ‘,’ • Convert RUDAF’s result to R data.frame unfold(string_value,type1,type2,...,delimiter) expand(string_value,type,delimiter) RA(‘newcenter’,...) return “num1,num2,num3” per cluster-key select unfold(tb1.x,0.0,0.0,0.0,’,’) as (col1,col2,col3) from (select RA(‘newcenter’, attr1,attr2,attr3,attr4) as x from table group by cluster-key Friday, November 11, 11
  • 12. napply and sapply rhive.napply(table-name,FUN,col1,...) rhive.sapply(table-name,FUN,col1,...) • napply : R apply function for Numeric type • sapply : R apply function for String type Example : R function which sums all passed columns sumCols <- function(arg1,...) { sum(arg1,...) } result <- rhive.napply(“tab”, sumCols, col1, col2, col3, col4) rhive.load.table(result) Friday, November 11, 11
  • 13. napply • ‘napply’ is similar to R apply function • Store big result to HDFS as Hive table rhive.napply <- function(tablename, FUN, col = NULL, ...) { if(is.null(col)) cols <- "" else cols <- paste(",",col) for(element in c(...)) { cols <- paste(cols,",",element) } exportname <- paste(tablename,"_sapply",as.integer(Sys.time()),sep="") ! rhive.assign(exportname,FUN) ! rhive.exportAll(exportname) tmptable <- paste(exportname,”_table”) ! rhive.query( paste("CREATE TABLE ", tmptable," AS SELECT ","R('",exportname,"'",cols,",0.0) FROM ",tablename,sep="")) ! tmptable } Friday, November 11, 11
  • 14. aggregate rhive.aggregate(table-name,hive-FUN,...,goups) • RHive aggregation function to aggregate data stored in HDFS using HIVE Function Example : Aggregate using SUM (Hive aggregation function) result <- rhive.aggregate(“emp”, “SUM”, sal,groups=”deptno”) rhive.load.table(result) Friday, November 11, 11
  • 15. Examples - predict flight delay library(RHive) rhive.connect() - Retrieve training set from large dataset stored in HDFS train <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand()) train$arrdelay <- as.numeric(train$arrdelay) train$distance <- as.numeric(train$distance) train <- train[!(is.na(train$arrdelay) | is.na(train$distance)),] Native R code model <- lm(arrdelay ~ distance + dayofweek,data=train) - Export R object data rhive.assign("model", model) - Analyze big data using model calculated by R predict_table <- rhive.napply(“airlines”,function(arg1,arg2,arg3) { if(is.null(arg1) | is.null(arg2) | is.null(arg3)) return(0.0) HiveQuery + R code res <- predict.lm(model, data.frame(dayofweek=arg1,arrdelay=arg2,distance=arg3)) return(as.numeric(res)) }, ‘dayofweek’, ‘arrdelay’, ‘distance’) Friday, November 11, 11
  • 17. Conclusion • RHive supports HQL, not MapReduce model style • RHive allows analytics to do everything in R console • RHive interacts R data and HDFS data • Future & Current Works • Integrate Hadoop HDFS • Support Transform/Map-Reduce Scripts • Distributed Rserve • Support more R style API • Support machine learning algorithms (k-means, classifier, ...) Friday, November 11, 11
  • 18. Cooperators • JunHo Cho • Seonghak Hong • Choonghyun Ryu YO U ! Friday, November 11, 11
  • 19. How to join RHive project • Logo • github (https://github.com/nexr/RHive) • CRAN (http://cran.r-project.org/web/packages/RHive) • Welcome to join RHive project Friday, November 11, 11
  • 20. References • Recardo (https://mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf) • RHIPE (http://ml.stat.purdue.edu/rhipe) • Hive (http://hive.apache.org) • Parallels R by Q. Ethan McCallum and Stephen Weston Friday, November 11, 11
  • 23. RHIPE • the R and Hadoop Integrated Processing Environment • Must understand the MapReduce model map <- function() {...} shuffle / sort reduce <- function() {...} Mapper Reducer rmr <- rhmr(map,reduce,...) ProtocolBuf R Fork R R RHMR R Objects (map) R Objects (reduce) R Objects (map, reduce) PersonalServer R Conf HDFS Friday, November 11, 11
  • 24. RHadoop • Manipulate Hadoop data stores and HBASE directly from R • Write MapReduce models in R using Hadoop Streaming • Must understand the MapReduce model map <- function() {...} reduce <- function() {...} mapreduce(input,output,map,reduce,...) R rhbase rhdfs rmr execute hadoop streaming R R Hadoop Streaming manipulate shuffle / sort store R objecs as file Mapper Reducer HBASE HDFS Friday, November 11, 11
  • 25. hive(Hadoop InteractiVE) • R extension facilitating distributed computing via the MapReduce paradigm • Provide an interface to Hadoop, HDFS and Hadoop Streaming • Must understand the MapReduce model map <- function() {...} reduce <- function() {...} hive_stream(map,reduce,...) R execute hadoop R R hive DFS hive_stream streaming Hadoop Streaming manipulate shuffle / sort save R script on local Mapper Reducer HDFS Friday, November 11, 11
  • 26. seuge • Simple parallel processing with a fast and easy setup on Amazon’s WS. • Parallel lapply function for the EMR engine using Hadoop streaming. • Does not support MapReduce model but only Map model. data <- list(...) emrlapply(clusterObject,data,FUN,..) Amazon S3 upload R objects R emrlapply awsFunctions EMR Hadoop Streaming R Mapper save R objects (data + FUN) on local bootstrap (setup R) mapper.R Friday, November 11, 11
  • 27. Ricardo • Integrate R and Jaql (JSON Query Language) • Must know how to use uncommon query, Jaql • Not open-source Ref : Ricardo-SIGMOD10 Friday, November 11, 11