SlideShare a Scribd company logo
RHive : Integrating R and Hive
                             Introduction



                                                     JunHo Cho
                                           Data Analysis Platform Team




Friday, November 11, 11
Analysis of Data




Friday, November 11, 11
Analysis of Data




                                                            CF      Classifier
                                                                                 Decision Tree
                    MapReduce                      Recommendation
                                                                                Graph
                                                      Clustering




Friday, November 11, 11
Related Works


                    •     RHIPE

                    •     RHadoop
                                                                      duce
                                                                  R e
                    •     hive (Hadoop InteractiVE)
                                                               Map
                                                           and
                    •     seuge
                                                      r st
                                               un de
                                        u st
                                       M



Friday, November 11, 11
RHive is inspired by ...

                    •     Many analysts have been used R for a long time

                    •     Many analysts can use SQL language

                    •     There are already a lot of statistical functions in R

                    •     R needs a capability to analyze big data

                    •     Hive supports SQL-like query language (HQL)

                    •     Hive supports MapReduce to execute HQL




                          R is the best solution for familiarity
                          Hive is the best solution for capability


Friday, November 11, 11
RHive Components


                   •      Hadoop

                          •   store and analyze big data

                   •      Hive

                          •   use HQL instead of MapReduce programming

                   •      R

                          •   support friendly environment to analysts




Friday, November 11, 11
RHive - Architecture
                   Execute R Function Objects and R Objects
                   through Hive Query

                 Execute Hive Query through R
                                                                              rcal <- function(arg1,arg2) {
                                       SELECT R(‘rcal’,col1,col2)                 coeff * sum(arg1,arg2)
                                       from tab1                              }



                                                                                     RServe


                                           01010100101   01010100101   01010100101
                                           01010010101   01010010101   01010010101
                                           01001010101   01001010101   01001010101
                                           10101000111   10101000111   10101000111




                                                      R Function          R Object         RUDF           RUDAF




Friday, November 11, 11
RHive API

                •         Extension R Functions
                      •     rhive.connect       •   rhive.napply        •   rhive.load.table
                      •     rhive.query         •   rhive.sapply        •   rhive.desc.table
                      •     rhive.assign        •   rhive.aggregate
                      •     rhive.export        •   rhive.list.tables




                •         Extension Hive Functions
                      •     RUDF

                      •     RUDAF

                      •     GenericUDTFExpand

                      •     GenericUDTFUnFold


Friday, November 11, 11
RUDF - R User-defined Functions

                   SELECT R(‘R function-name’,col1,col2,...,TYPE)
                    •     UDF doesn’t know return type until calling R function

                          •    TYPE : return type

                Example : R function which sums all passed columns


                sumCols <- function(arg1,...) {
                   sum(arg1,...)
                }
                rhive.assign(‘sumCols’,sumCols)
                rhive.exportAll(‘sumCols’,hadoop-clusters)
                result <- rhive.query(“SELECT R(‘sumCols’, col1, col2, col3, col4, 0.0) FROM tab”)
                plot(result)



Friday, November 11, 11
RUDAF - R User-defined Aggregation Function
                            SELECT RA(‘R function-name’,col1,col2,...)
                    •        R can not manipulate large dataset

                    •        Support UDAF’s life cycle

                           •    iterate, partial-merge, merge, terminate

                    •        Return type is only string delimited by ‘,’ - “data1,data2,data3,...”

                          partial aggregation                                                                  partial aggregation
                                                                 aggregation values



                    FUN                FUN.partial                                                   FUN.merge           FUN.terminate

                 01010100101   01010100101   01010100101   01010100101   01010100101   01010100101   01010100101
                 01010010101   01010010101   01010010101   01010010101   01010010101   01010010101   01010010101
                 01001010101   01001010101   01001010101   01001010101   01001010101   01001010101   01001010101
                 10101000111   10101000111   10101000111   10101000111   10101000111   10101000111   10101000111




Friday, November 11, 11
UDTF : unfold and expand
                    •     RUDAF only returns string delimited by ‘,’

                    •     Convert RUDAF’s result to R data.frame



                   unfold(string_value,type1,type2,...,delimiter)
                   expand(string_value,type,delimiter)

                  RA(‘newcenter’,...) return “num1,num2,num3” per cluster-key

                  select unfold(tb1.x,0.0,0.0,0.0,’,’) as (col1,col2,col3) from (select RA(‘newcenter’,
                  attr1,attr2,attr3,attr4) as x from table group by cluster-key




Friday, November 11, 11
napply and sapply

                   rhive.napply(table-name,FUN,col1,...)
                   rhive.sapply(table-name,FUN,col1,...)
                  •       napply : R apply function for Numeric type

                  •       sapply : R apply function for String type



                      Example : R function which sums all passed columns

                      sumCols <- function(arg1,...) {
                         sum(arg1,...)
                      }
                      result <- rhive.napply(“tab”, sumCols, col1, col2, col3, col4)
                      rhive.load.table(result)




Friday, November 11, 11
napply

         •       ‘napply’ is similar to R apply function

         •       Store big result to HDFS as Hive table


   rhive.napply       <- function(tablename, FUN, col = NULL, ...) {
         if(is.null(col))
              cols <- ""
         else
              cols <- paste(",",col)

         for(element in c(...)) {
              cols <- paste(cols,",",element)
         }

         exportname <- paste(tablename,"_sapply",as.integer(Sys.time()),sep="")

   !     rhive.assign(exportname,FUN)
   !     rhive.exportAll(exportname)
         tmptable <- paste(exportname,”_table”)
   !     rhive.query(
                   paste("CREATE TABLE ", tmptable," AS SELECT ","R('",exportname,"'",cols,",0.0) FROM ",tablename,sep=""))

   !     tmptable
   }




Friday, November 11, 11
aggregate

                 rhive.aggregate(table-name,hive-FUN,...,goups)

            •      RHive aggregation function to aggregate data stored in HDFS using HIVE Function




                  Example : Aggregate using SUM (Hive aggregation function)

                  result <- rhive.aggregate(“emp”, “SUM”, sal,groups=”deptno”)
                  rhive.load.table(result)




Friday, November 11, 11
Examples - predict flight delay
    library(RHive)
    rhive.connect()

    - Retrieve training set from large dataset stored in HDFS
    train <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand())

    train$arrdelay <- as.numeric(train$arrdelay)

    train$distance <- as.numeric(train$distance)

    train <- train[!(is.na(train$arrdelay) | is.na(train$distance)),]                   Native R code
    model <- lm(arrdelay ~ distance + dayofweek,data=train)

    - Export R object data
    rhive.assign("model", model)

    - Analyze big data using model calculated by R
    predict_table <- rhive.napply(“airlines”,function(arg1,arg2,arg3) {

            if(is.null(arg1) | is.null(arg2) | is.null(arg3)) return(0.0)                HiveQuery + R code
            res <- predict.lm(model, data.frame(dayofweek=arg1,arrdelay=arg2,distance=arg3))

            return(as.numeric(res)) }, ‘dayofweek’, ‘arrdelay’, ‘distance’)


Friday, November 11, 11
DEMO

Friday, November 11, 11
Conclusion

                    •     RHive supports HQL, not MapReduce model style

                    •     RHive allows analytics to do everything in R console

                    •     RHive interacts R data and HDFS data



                    •     Future & Current Works
                          •   Integrate Hadoop HDFS

                          •   Support Transform/Map-Reduce Scripts

                          •   Distributed Rserve

                          •   Support more R style API

                          •   Support machine learning algorithms (k-means, classifier, ...)


Friday, November 11, 11
Cooperators


                   •      JunHo Cho

                   •      Seonghak Hong

                   •      Choonghyun Ryu




                                      YO U !
Friday, November 11, 11
How to join RHive project


               •          Logo




               •          github (https://github.com/nexr/RHive)

               •          CRAN (http://cran.r-project.org/web/packages/RHive)

               •          Welcome to join RHive project




Friday, November 11, 11
References



              •       Recardo (https://mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf)

              •       RHIPE (http://ml.stat.purdue.edu/rhipe)

              •       Hive (http://hive.apache.org)

              •       Parallels R by Q. Ethan McCallum and Stephen Weston




Friday, November 11, 11
jun.cho@nexr.com




Friday, November 11, 11
Appendix




Friday, November 11, 11
RHIPE

                •         the R and Hadoop Integrated Processing Environment

                •         Must understand the MapReduce model

            map <- function() {...}                                             shuffle / sort
            reduce <- function() {...}                                Mapper                    Reducer
            rmr <- rhmr(map,reduce,...)
                                                                                ProtocolBuf
                            R
                     Fork                                                R                        R
                          RHMR
                                                           R Objects (map)       R Objects (reduce)

                                R Objects (map, reduce)                        PersonalServer
                                R Conf


                                                           HDFS

Friday, November 11, 11
RHadoop
                  •       Manipulate Hadoop data stores and HBASE directly from R
                  •       Write MapReduce models in R using Hadoop Streaming
                  •       Must understand the MapReduce model


           map <- function() {...}
           reduce <- function() {...}
           mapreduce(input,output,map,reduce,...)


                              R
              rhbase           rhdfs          rmr
                                                           execute hadoop
                                                           streaming
                                                                              R                         R
                                                                                     Hadoop Streaming

                      manipulate                                                      shuffle / sort
                                       store R objecs as file                Mapper                    Reducer


                                   HBASE                                             HDFS

Friday, November 11, 11
hive(Hadoop InteractiVE)
                •         R extension facilitating distributed computing via the MapReduce
                          paradigm
                •         Provide an interface to Hadoop, HDFS and Hadoop Streaming
                •         Must understand the MapReduce model

           map <- function() {...}
           reduce <- function() {...}
           hive_stream(map,reduce,...)


                             R
                                                          execute hadoop     R                         R
               hive        DFS     hive_stream
                                                          streaming
                                                                                    Hadoop Streaming

           manipulate                                                                shuffle / sort
                                 save R script on local                    Mapper                    Reducer


                                                                 HDFS

Friday, November 11, 11
seuge

            •       Simple parallel processing with a fast and easy setup on Amazon’s WS.
            •       Parallel lapply function for the EMR engine using Hadoop streaming.
            •       Does not support MapReduce model but only Map model.



           data <- list(...)
           emrlapply(clusterObject,data,FUN,..)                               Amazon S3
                                                  upload R objects
                                  R

                          emrlapply   awsFunctions                     EMR              Hadoop Streaming
                                                                                                           R



                                                                                                     Mapper
                  save R objects (data + FUN) on local
                                                                       bootstrap (setup R)
                                                                       mapper.R




Friday, November 11, 11
Ricardo
                    •     Integrate R and Jaql (JSON Query Language)
                    •     Must know how to use uncommon query, Jaql
                    •     Not open-source




                                                                       Ref : Ricardo-SIGMOD10



Friday, November 11, 11

More Related Content

What's hot

Apache pig
Apache pigApache pig
Apache pig
Sadiq Basha
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
Muhammad Shahid
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
Jeremy Walsh
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
DataWorks Summit
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
Asif Ali
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDataWorks Summit
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
Brendan Tierney
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & Storage
Ilayaraja P
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture
Rupak Roy
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
Glenn K. Lockwood
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
Zubair Nabi
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaEcossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine Luiza
Nelson Forte
 

What's hot (20)

Unit 2
Unit 2Unit 2
Unit 2
 
Apache pig
Apache pigApache pig
Apache pig
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache Giraph
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & Storage
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaEcossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine Luiza
 

Similar to Integrate Hive and R

Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Cloudera, Inc.
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
Revolution Analytics
 
How to use hadoop and r for big data parallel processing
How to use hadoop and r for big data  parallel processingHow to use hadoop and r for big data  parallel processing
How to use hadoop and r for big data parallel processing
Bryan Downing
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Implementing R7RS on R6RS Scheme
Implementing R7RS on R6RS SchemeImplementing R7RS on R6RS Scheme
Implementing R7RS on R6RS Scheme
Kato Takashi
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Yukinori Suda
 
Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga
DTU - Technical University of Denmark
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
Jeffrey Breen
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
Aaron Cordova
 
サンプルから見るMap reduceコード
サンプルから見るMap reduceコードサンプルから見るMap reduceコード
サンプルから見るMap reduceコードShinpei Ohtani
 
サンプルから見るMapReduceコード
サンプルから見るMapReduceコードサンプルから見るMapReduceコード
サンプルから見るMapReduceコード
Shinpei Ohtani
 
Hadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionHadoop: A Hands-on Introduction
Hadoop: A Hands-on Introduction
Claudio Martella
 
Special topics in finance lecture 2
Special topics in finance   lecture 2Special topics in finance   lecture 2
Special topics in finance lecture 2
Dr. Muhammad Ali Tirmizi., Ph.D.
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
Radek Maciaszek
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahout
bigdatasyd
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
R - the language
R - the languageR - the language
R - the language
Mike Martinez
 

Similar to Integrate Hive and R (20)

Rhive 0.0 3
Rhive 0.0 3Rhive 0.0 3
Rhive 0.0 3
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
How to use hadoop and r for big data parallel processing
How to use hadoop and r for big data  parallel processingHow to use hadoop and r for big data  parallel processing
How to use hadoop and r for big data parallel processing
 
R and-hadoop
R and-hadoopR and-hadoop
R and-hadoop
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Implementing R7RS on R6RS Scheme
Implementing R7RS on R6RS SchemeImplementing R7RS on R6RS Scheme
Implementing R7RS on R6RS Scheme
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)
 
Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
サンプルから見るMap reduceコード
サンプルから見るMap reduceコードサンプルから見るMap reduceコード
サンプルから見るMap reduceコード
 
サンプルから見るMapReduceコード
サンプルから見るMapReduceコードサンプルから見るMapReduceコード
サンプルから見るMapReduceコード
 
Hadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionHadoop: A Hands-on Introduction
Hadoop: A Hands-on Introduction
 
Special topics in finance lecture 2
Special topics in finance   lecture 2Special topics in finance   lecture 2
Special topics in finance lecture 2
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahout
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
R - the language
R - the languageR - the language
R - the language
 

Recently uploaded

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 

Integrate Hive and R

  • 1. RHive : Integrating R and Hive Introduction JunHo Cho Data Analysis Platform Team Friday, November 11, 11
  • 2. Analysis of Data Friday, November 11, 11
  • 3. Analysis of Data CF Classifier Decision Tree MapReduce Recommendation Graph Clustering Friday, November 11, 11
  • 4. Related Works • RHIPE • RHadoop duce R e • hive (Hadoop InteractiVE) Map and • seuge r st un de u st M Friday, November 11, 11
  • 5. RHive is inspired by ... • Many analysts have been used R for a long time • Many analysts can use SQL language • There are already a lot of statistical functions in R • R needs a capability to analyze big data • Hive supports SQL-like query language (HQL) • Hive supports MapReduce to execute HQL R is the best solution for familiarity Hive is the best solution for capability Friday, November 11, 11
  • 6. RHive Components • Hadoop • store and analyze big data • Hive • use HQL instead of MapReduce programming • R • support friendly environment to analysts Friday, November 11, 11
  • 7. RHive - Architecture Execute R Function Objects and R Objects through Hive Query Execute Hive Query through R rcal <- function(arg1,arg2) { SELECT R(‘rcal’,col1,col2) coeff * sum(arg1,arg2) from tab1 } RServe 01010100101 01010100101 01010100101 01010010101 01010010101 01010010101 01001010101 01001010101 01001010101 10101000111 10101000111 10101000111 R Function R Object RUDF RUDAF Friday, November 11, 11
  • 8. RHive API • Extension R Functions • rhive.connect • rhive.napply • rhive.load.table • rhive.query • rhive.sapply • rhive.desc.table • rhive.assign • rhive.aggregate • rhive.export • rhive.list.tables • Extension Hive Functions • RUDF • RUDAF • GenericUDTFExpand • GenericUDTFUnFold Friday, November 11, 11
  • 9. RUDF - R User-defined Functions SELECT R(‘R function-name’,col1,col2,...,TYPE) • UDF doesn’t know return type until calling R function • TYPE : return type Example : R function which sums all passed columns sumCols <- function(arg1,...) { sum(arg1,...) } rhive.assign(‘sumCols’,sumCols) rhive.exportAll(‘sumCols’,hadoop-clusters) result <- rhive.query(“SELECT R(‘sumCols’, col1, col2, col3, col4, 0.0) FROM tab”) plot(result) Friday, November 11, 11
  • 10. RUDAF - R User-defined Aggregation Function SELECT RA(‘R function-name’,col1,col2,...) • R can not manipulate large dataset • Support UDAF’s life cycle • iterate, partial-merge, merge, terminate • Return type is only string delimited by ‘,’ - “data1,data2,data3,...” partial aggregation partial aggregation aggregation values FUN FUN.partial FUN.merge FUN.terminate 01010100101 01010100101 01010100101 01010100101 01010100101 01010100101 01010100101 01010010101 01010010101 01010010101 01010010101 01010010101 01010010101 01010010101 01001010101 01001010101 01001010101 01001010101 01001010101 01001010101 01001010101 10101000111 10101000111 10101000111 10101000111 10101000111 10101000111 10101000111 Friday, November 11, 11
  • 11. UDTF : unfold and expand • RUDAF only returns string delimited by ‘,’ • Convert RUDAF’s result to R data.frame unfold(string_value,type1,type2,...,delimiter) expand(string_value,type,delimiter) RA(‘newcenter’,...) return “num1,num2,num3” per cluster-key select unfold(tb1.x,0.0,0.0,0.0,’,’) as (col1,col2,col3) from (select RA(‘newcenter’, attr1,attr2,attr3,attr4) as x from table group by cluster-key Friday, November 11, 11
  • 12. napply and sapply rhive.napply(table-name,FUN,col1,...) rhive.sapply(table-name,FUN,col1,...) • napply : R apply function for Numeric type • sapply : R apply function for String type Example : R function which sums all passed columns sumCols <- function(arg1,...) { sum(arg1,...) } result <- rhive.napply(“tab”, sumCols, col1, col2, col3, col4) rhive.load.table(result) Friday, November 11, 11
  • 13. napply • ‘napply’ is similar to R apply function • Store big result to HDFS as Hive table rhive.napply <- function(tablename, FUN, col = NULL, ...) { if(is.null(col)) cols <- "" else cols <- paste(",",col) for(element in c(...)) { cols <- paste(cols,",",element) } exportname <- paste(tablename,"_sapply",as.integer(Sys.time()),sep="") ! rhive.assign(exportname,FUN) ! rhive.exportAll(exportname) tmptable <- paste(exportname,”_table”) ! rhive.query( paste("CREATE TABLE ", tmptable," AS SELECT ","R('",exportname,"'",cols,",0.0) FROM ",tablename,sep="")) ! tmptable } Friday, November 11, 11
  • 14. aggregate rhive.aggregate(table-name,hive-FUN,...,goups) • RHive aggregation function to aggregate data stored in HDFS using HIVE Function Example : Aggregate using SUM (Hive aggregation function) result <- rhive.aggregate(“emp”, “SUM”, sal,groups=”deptno”) rhive.load.table(result) Friday, November 11, 11
  • 15. Examples - predict flight delay library(RHive) rhive.connect() - Retrieve training set from large dataset stored in HDFS train <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand()) train$arrdelay <- as.numeric(train$arrdelay) train$distance <- as.numeric(train$distance) train <- train[!(is.na(train$arrdelay) | is.na(train$distance)),] Native R code model <- lm(arrdelay ~ distance + dayofweek,data=train) - Export R object data rhive.assign("model", model) - Analyze big data using model calculated by R predict_table <- rhive.napply(“airlines”,function(arg1,arg2,arg3) { if(is.null(arg1) | is.null(arg2) | is.null(arg3)) return(0.0) HiveQuery + R code res <- predict.lm(model, data.frame(dayofweek=arg1,arrdelay=arg2,distance=arg3)) return(as.numeric(res)) }, ‘dayofweek’, ‘arrdelay’, ‘distance’) Friday, November 11, 11
  • 17. Conclusion • RHive supports HQL, not MapReduce model style • RHive allows analytics to do everything in R console • RHive interacts R data and HDFS data • Future & Current Works • Integrate Hadoop HDFS • Support Transform/Map-Reduce Scripts • Distributed Rserve • Support more R style API • Support machine learning algorithms (k-means, classifier, ...) Friday, November 11, 11
  • 18. Cooperators • JunHo Cho • Seonghak Hong • Choonghyun Ryu YO U ! Friday, November 11, 11
  • 19. How to join RHive project • Logo • github (https://github.com/nexr/RHive) • CRAN (http://cran.r-project.org/web/packages/RHive) • Welcome to join RHive project Friday, November 11, 11
  • 20. References • Recardo (https://mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf) • RHIPE (http://ml.stat.purdue.edu/rhipe) • Hive (http://hive.apache.org) • Parallels R by Q. Ethan McCallum and Stephen Weston Friday, November 11, 11
  • 23. RHIPE • the R and Hadoop Integrated Processing Environment • Must understand the MapReduce model map <- function() {...} shuffle / sort reduce <- function() {...} Mapper Reducer rmr <- rhmr(map,reduce,...) ProtocolBuf R Fork R R RHMR R Objects (map) R Objects (reduce) R Objects (map, reduce) PersonalServer R Conf HDFS Friday, November 11, 11
  • 24. RHadoop • Manipulate Hadoop data stores and HBASE directly from R • Write MapReduce models in R using Hadoop Streaming • Must understand the MapReduce model map <- function() {...} reduce <- function() {...} mapreduce(input,output,map,reduce,...) R rhbase rhdfs rmr execute hadoop streaming R R Hadoop Streaming manipulate shuffle / sort store R objecs as file Mapper Reducer HBASE HDFS Friday, November 11, 11
  • 25. hive(Hadoop InteractiVE) • R extension facilitating distributed computing via the MapReduce paradigm • Provide an interface to Hadoop, HDFS and Hadoop Streaming • Must understand the MapReduce model map <- function() {...} reduce <- function() {...} hive_stream(map,reduce,...) R execute hadoop R R hive DFS hive_stream streaming Hadoop Streaming manipulate shuffle / sort save R script on local Mapper Reducer HDFS Friday, November 11, 11
  • 26. seuge • Simple parallel processing with a fast and easy setup on Amazon’s WS. • Parallel lapply function for the EMR engine using Hadoop streaming. • Does not support MapReduce model but only Map model. data <- list(...) emrlapply(clusterObject,data,FUN,..) Amazon S3 upload R objects R emrlapply awsFunctions EMR Hadoop Streaming R Mapper save R objects (data + FUN) on local bootstrap (setup R) mapper.R Friday, November 11, 11
  • 27. Ricardo • Integrate R and Jaql (JSON Query Language) • Must know how to use uncommon query, Jaql • Not open-source Ref : Ricardo-SIGMOD10 Friday, November 11, 11