RHive tutorials - Basic functions
Upcoming SlideShare
Loading in...5
×
 

RHive tutorials - Basic functions

on

  • 6,576 views

One can learn how to use basic functions in RHive as reading this document.

One can learn how to use basic functions in RHive as reading this document.
This document was updated at 5th March 2012.

Statistics

Views

Total Views
6,576
Views on SlideShare
6,575
Embed Views
1

Actions

Likes
4
Downloads
288
Comments
1

1 Embed 1

http://www.facebook.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Updated at 5th March 2012.
    Descriptions were appended about
    rhive.basic.by,
    rhive.basic.cut,
    rhive.basic.cut2,
    rhive.basic.merge,
    rhive.basic.mode,
    rhive.basic.range,
    rhive.basic.scale,
    rhive.basic.t.test,
    rhive.basic.xtabs,
    rhive.block.sample

    Thanks to contributors.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    RHive tutorials - Basic functions RHive tutorials - Basic functions Document Transcript

    • RHive tutorial - basic functionsThis tutorial explains how to load RHive library and use basic Functions forRHive.Loading RHiveLoad RHive with the method used when using any R package. Load RHivelike below:library(RHive)  But before loading RHive, you must not forget to configure HADOOP_HOMEand HIVE_HOME environmentAnd if they are not set then you can temporarily set them before loading thelibrary, like as follows.HADOOP_HOME is the home directory where Hadoop is installed andHIVE_HOME is the home directory where Hive is installed.Consult RHive tutorial - RHive installation and setting for details onenvironment variables.Sys.setenv(HIVE_HOME="/service/hive-­‐0.7.1")  Sys.setenv(HADOOP_HOME="/service/hadoop-­‐0.20.203.0")  library(RHive)  rhive.initrhive.init is a procedure that internally initializes and if, before loading RHive,environment variables were calibrated accurately then they will automaticallyrun.But if these environment variable were not configured while RHive was loadedvia library(RHIve) then the following error message will result.rhive.connect()  Error  in  .jcall("java/lang/Class",  "Ljava/lang/Class;",  "forName",  cl,    :      No  running  JVM  detected.  Maybe  .jinit()  would  help.  Error  in  .jfindClass(as.character(class))  :      No  running  JVM  detected.  Maybe  .jinit()  would  help.  
    • For this case then designate HADOOP_HOME and HADOOP_HOME asshown below or exit R then configure environment variables and restart R.Sys.setenv(HIVE_HOME="/service/hive-­‐0.7.1")  Sys.setenv(HADOOP_HOME="/service/hadoop-­‐0.20.203.0")  rhive.init()  Or,close  R  export  HIVE_HOME="/service/hive-­‐0.7.1"  export  HADOOP_HOME="/service/hadoop-­‐0.20.203.0"  open  R  rhive.connectAll Functions of RHive will only work after having connected to Hive server.If before using other Functions of RHive, you have not established aconnection by using the rhive.connect Function,All RHive Functions will malfunction and produce the following errors whenrunning.Error  in  .jcast(hiveclient[[1]],  new.class  =  "org/apache/hadoop/hive/service/HiveClient",    :      cannot  cast  anything  but  Java  objects  Establishing a connection with Hive server to use RHive is simple with thefollowing:rhive.connect()  The example above can additionally assign a few more things.rhiveConnection  <-­‐  rhive.connect("10.1.1.1")  In the case the user’s Hive server is installed to a server other than the onewith RHive installed, and has to remotely connect,a connection can be made by handing arguments over to the rhive.connectFunction.
    • Then if you have multiple Hadoop and Hive clusters, then after making theright configurations to have RHive activated, and you want to switch betweenthe Hives thenjust like using DB client such as MySQL, you should make connections andhand it over to the Functions via arguments to explicitly select connection.rhive.queryIf the user has experience in using Hive, then he/she probably knows thatHive supports SQL syntax to handle the data for Map/Reduce and HDFS.rhive.query gives SQL to Hive and receives results from Hive.Users who know SQL syntax will find this a frequently encountered example.rhive.query("SELECT  *  FROM  usarrests")  If you run the example above then you will see the contents of a table named‘usarrests’ printed on the screen.Or, on top of printing the returned result on the screen, you can also assign toa data.frame object those results.resultDF  <-­‐  rhive.query("SELECT  *  FROM  usarrests")  A thing to beware of is if the data returned from rhive.query is bigger than theRHive server’s memory or laptop’s, exhaustion of available memory willinduce an error message.That is why you must not receive and put into object any data of such size.It is better to first create a temporary table and then put the results of the SQLto the temporary table.You can do it as the following.rhive.query("  CREATE  TABLE  new_usarrests  (      rowname        string,      murder        double,      assault              int,      urbanpop              int,      rape        double  )")  
    •     rhive.query("INSERT  OVERWRITE  TABLE  new_usarrests  SELECT  *   FROM  usarrest")  Consult a Hive document for a detailed account of how to use Hive SQL.rhive.closeIf you have finished using Hive and do not wish to use RHive Functions anylonger, you can use the rhive.close Function to terminate the connection. rhive.close()  Alternatively, you can assign a specific connection to close it. conn  <-­‐  rhive.connect()   rhive.close(conn)  rhive.list.tablesThe rhive.list.tables Function returns the results of tables in Hive. rhive.list.tables()                tab_name   1                  aids2   2  new_usarrests   3          usarrests  This is effectively identical to this: rhive.query("SHOW  TABLES")  rhive.desc.tableThe rhive.desc.table Function shows the description of the chosen table. rhive.desc.table("usarrests")  
    •    col_name  data_type  comment   1    rowname        string   2      murder        double   3    assault              int   4  urbanpop              int   5          rape        double  This is effectively identical to this: rhive.query("DESC  usarrests")  rhive.load.tableThe rhive.load.table Function loads Hive tables’ contents as R’s data.frameobject. df1  <-­‐  rhive.load.table("usarrests")   df1  This is effectively identical to this: df1  <-­‐  rhive.query("SELECT  *  FROM  usarrests")   df1  rhive.write.tableThe rhive.write.table Function is the antithesis of rhive.load.table.But it is more useful than rhive.load.table.If you wish to add data to a table located in Hive, you must first make a table.But using rhive.write.table does not require any additional work, and simplycreates R’s dataframe into Hive and inserts all data. head(UScrime)          M  So    Ed  Po1  Po2    LF    M.F  Pop    NW    U1  U2  GDP   Ineq          Prob        Time        y   1  151    1    91    58    56  510    950    33  301  108  41  394    261  0.084602   26.2011    791  
    • 2  143    0  113  103    95  583  1012    13  102    96  36  557    194  0.029599  25.2999  1635  3  142    1    89    45    44  533    969    18  219    94  33  318    250  0.083401  24.3006    578  4  136    0  121  149  141  577    994  157    80  102  39  673    167  0.015801  29.9012  1969  5  141    0  121  109  101  591    985    18    30    91  20  578    174  0.041399  21.2998  1234  6  121    0  110  118  115  547    964    25    44    84  29  689    126  0.034201  20.9995    682      rhive.write.table(UScrime)  [1]  "UScrime"      rhive.list.tables()                tab_name  1                  aids2  2  new_usarrests  3          usarrests  4              uscrime      rhive.query("SELECT  *  FROM  uscrime  LIMIT  10")        rowname      m  so    ed  po1  po2    lf      mf  pop    nw    u1  u2  gdp  ineq          prob        time  1                1  151    1    91    58    56  510    950    33  301  108  41  394    261  0.084602  26.2011  2                2  143    0  113  103    95  583  1012    13  102    96  36  557    194  0.029599  25.2999  3                3  142    1    89    45    44  533    969    18  219    94  33  318    250  0.083401  24.3006  4                4  136    0  121  149  141  577    994  157    80  102  39  673    167  0.015801  29.9012  5                5  141    0  121  109  101  591    985    18    30    91  20  578    174  0.041399  21.2998  6                6  121    0  110  118  115  547    964    25    44    84  29  689    126  
    • 0.034201  20.9995  7                7  127    1  111    82    79  519    982      4  139    97  38  620    168  0.042100  20.6993  8                8  131    1  109  115  109  542    969    50  179    79  35  472    206  0.040099  24.5988  9                9  157    1    90    65    62  553    955    39  286    81  28  421    239  0.071697  29.4001  10            10  140    0  118    71    68  632  1029      7    15  100  24  526    174  0.044498  19.5994              y  1      791  2    1635  3      578  4    1969  5    1234  6      682  7      963  8    1555  9      856  10    705  The rhive.write.table Function encounters an error and does not work if thetable to be saved into Hive already exists.Hence, if attempting to save to Hive any dataframes with the same name andsymbol as any table already in Hive, it is imperative that you delete thembefore using rhive.write.table.if  (rhive.exist.table("uscrime"))  {      rhive.query("DROP  TABLE  uscrime")  }      rhive.write.table(UScrime)  
    • RHive - alias functionsRHive’s Functions look similar to S3 generic’s naming rules but many areactually not generic. This is for the S3 generic Functions which RHive may ormay not support in the future.For users who detest confusion wrought by Functions that, despite containing“.” yet still do not count as generic, there exist some Functions with differentnames but serve the same roles. The following alias Functions are such asdescribed below.hiveConnectThis is same as rhive.connect.hiveQueryThis is same as rhive.query.hiveCloseThis is same as hive.close.hiveListTablesThis is same as hive.list.tables.hiveDescTableThis is same as hive.desc.table.hiveLoadTableThis is same as hive.load.table.
    • rhive.basic.cutrhive.basic.cut converts one numerical column from a table to one factorizedcolumn. First, the range of the numerical column is divided into intervals, andthe values in the numerical column are factorized according to which intervalthey fall. Rhive.basic.cut receives the following six arguments, tablename(atable name), col(a numerical column name), breaks, right, summary, andforcedRef. breaks are numerical cut points for the numerical column. rightindicates if the ends of the intervals are open or closed. If TRUE, the intervalsare closed on the right and open on the left. If not, vice versa. summary =TRUE spits out total counts of numerical values corresponding to the intervals.If FALSE, the name of a new table updated by the factorized table is returned.forcedRef = TRUE forces rhive.basic.cut to return a table name instead of adata frame for forcedRef = FALSE. The defaults of right, summary,and forcedRef are TRUE, FALSE, and TRUE respectively.Example for summary = FALSE>  table_name  =  rhive.basic.cut(tablename  =  "iris",  col  =  "sepallength",  breaks  =  seq(0,  5,  0.5),  right  =  FALSE,  summary  =  FALSE,  forcedRef  =  TRUE)  >  table_name  [1]  "rhive_result_1330382904"  attr(,"result:size")  [1]  4296  >  results  =  rhive.query("select  *  from  rhive_result_1330382904")  >  head(results)      rowname  sepalwidth  petallength  petalwidth  species  sepallength  1              1                3.5                  1.4                0.2    setosa                NULL  2              2                3.0                  1.4                0.2    setosa      [4.5,5.0)  3              3                3.2                  1.3                0.2    setosa      [4.5,5.0)  4              4                3.1                  1.5                0.2    setosa      [4.5,5.0)  5              5                3.6                  1.4                0.2    setosa                NULL  6              6                3.9                  1.7                0.4    setosa                NULL  Example for summary = TRUE
    • >  summary  =  rhive.basic.cut(tablename  =  "iris",  col  =  "sepallength",  breaks  =  seq(0,  5,  0.5),  right  =  FALSE,  summary  =  TRUE,  forcedRef  =  TRUE)  >  summary            NULL  [4.0,4.5)  [4.5,5.0)              128                  4                18  rhive.basic.cut2rhive.basic.cut2 converts two numerical columns from a table to two factorizedcolumns. That is, the range of each numerical column is divided into intervals,and the values in each numerical column are factorized according to whichinterval they fall. Rhive.basic.cut2 receives the following eight arguments,tablename(a table name), col1, col2(two column names), breaks1, breaks2,right, keepCol, and forcedRef. breaks1 and breaks2 are numerical cut pointsfor the two numerical columns. right indicates if the ends of the intervals areopen or closed. If TRUE, the intervals are closed on the right and open on theleft. If not, vice versa. keepCol = TRUE makes the two numerical columnskept even after the conversion. Otherwise, the factorized columns replace theoriginal numerical columns. forcedRef = TRUE forces rhive.basic.cut to returna table name instead of a data frame for forcedRef = FALSE. The defaults ofright, summary, and forcedRef are TRUE, FALSE, and TRUE respectively.Example for right = TRUE and keepCol = FALSE> table_name = rhive.basic.cut2(tablename = "iris", col1 = "sepallength", col2= "petallength", breaks1 = seq(0, 5, 0.5), breaks2 = seq(0, 5, 0.5), right =TRUE, keepCol = FALSE, forcedRef = TRUE)> table_name[1] "rhive_result_1330385833"attr(,"result:size")[1] 5272> results = rhive.query("select * from rhive_result_1330385833")> head(results)
    • rowname sepalwidth petalwidth species sepallength petallength rep1 1 3.5 0.2 setosa NULL (1.0,1.5] 12 2 3.0 0.2 setosa (4.5,5.0] (1.0,1.5] 13              3                3.2                0.2    setosa      (4.5,5.0]      (1.0,1.5]      1  4              4                3.1                0.2    setosa      (4.5,5.0]      (1.0,1.5]      1  5              5                3.6                0.2    setosa      (4.5,5.0]      (1.0,1.5]      1  6              6                3.9                0.4    setosa                NULL      (1.5,2.0]      1  Example for right = FALSE and keepCol = TRUE>  table_name  =  rhive.basic.cut2(tablename  =  "iris",  col1  =  "sepallength",  col2  =  "petallength",  breaks1  =  seq(0,  5,  0.5),  breaks2  =  seq(0,  5,  0.5),  right  =  FALSE,  keepCol  =  TRUE,  forcedRef  =  TRUE)  >  table_name  [1]  "rhive_result_1330315663"  attr(,"result:size")  [1]  6374  >  results  =  rhive.query("select  *  from  rhive_result_1330315663")  >  head(results)      rowname  sepalwidth  petalwidth  species  sepallength  sepallength_cut  petallength  petallength_cut  rep  1              1                3.5                0.2    setosa                  5.1                        NULL                  1.4              [1.0,1.5)      1  2              2                3.0                0.2    setosa                  4.9              [4.5,5.0)                  1.4              [1.0,1.5)      1  3              3                3.2                0.2    setosa                  4.7              [4.5,5.0)                  1.3              [1.0,1.5)      1  4              4                3.1                0.2    setosa                  4.6              [4.5,5.0)                  1.5              [1.5,2.0)      1  
    • 5              5                3.6                0.2    setosa                  5.0                        NULL                  1.4              [1.0,1.5)      1  rhive.basic.xtabsrhive.basic.xtabs makes a contingency table from cross-classifying factors. Aformula object and a table name are used as input arguments and acontingency table with matrix format is returned based on the given formula.For instance, two column names, agegp and alcgp from a table are cross-classifying factors in this formula, "ncontrols ~ agegp + alcgp".Also, observations for each combination of the cross-classifyingfactors are summed up through another column name, ncontrols.Example for esoph data>  xtab_formula    =  as.formula(paste("ncontrols","~",  "agegp",  "+","alcgp",sep  =""))  >  xtab_formula  ncontrols  ~  agegp  +  alcgp  >  table_result  =  rhive.basic.xtabs(formula  =  xtab_formula,  tablename  =  "esoph")  >  head(table_result)                alcgp  agegp      0-­‐39g/day  120+  40-­‐79  80-­‐119      25-­‐34                61        5        45            5      35-­‐44                89      10        80          20      45-­‐54                78      15        81          39      55-­‐64                89      26        84          43      65-­‐74                71        8        53          29      75+                    27        3        12            2  rhive.basic.t.testThe rhive.basic.t.test Function runs Welchs t-test on two samples. In this casethe two samples mean difference is tested while holding the alternativehypothesis, "two samples mean difference is not 0." Thus, two-side test isperformed.
    • The following is an example of test the mean difference between the irisessepal widths and petal widths. Pay attention to how the Functions that usedthe "sepallength" and "petallength" variables were called.>  rhive.basic.t.test("iris",  "sepallength",  "iris",  "petallength")  [1]  "t  =  13.1422338118038,  df  =  211.542688378717,  p-­‐value  =  0,  mean  of  x  :  5.84333333333333,  mean  of  y  :  3.758"  $statistic                t    13.14223    $parameter              df    211.5427    $p.value  [1]  0  $estimate  $estimate[[1]]  mean  of  x      5.843333    $estimate[[2]]  mean  of  y            3.758    >  Interpreting the results gives you a p-value of 0, thus revealing a differencebetween the means of petal width and sepal width. The resulting statistics areconverted as an R list Object, and the string made from amassed statistics isprinted onto console.Iris data is 150 observation cases provided by R. Using this data for Rs t.testresults in a slightly off t-statistic of 13.0984. This is due to the variance usedby t.test Function to find t-statistic is sample variance, while rhive.basic.t.testFunction uses population variance. Like the example scenario, in the case oflittle data, t-statistic deviance may exist but the larger the data gets thedeviance dwindles. With rhive.basic.t.test being a Function made for massivedata analysis in mind, population variance is used for speedy calculations.
    • rhive.block.sampleThe percent argument is an optional argument that sets the percentage ofdata to extract from the total data. It has a default value of 0.01, which meansit extracts 0.01% of the total data. But this percent arguments value is not theratio of the actually sampled data count to the total data count but more akinto the ratio of Blocks to the total Blocks. Thus, rhive.block.sample Functiontakes Samples by the Block.Thus the entire data may be returned when using the rhive.block.sampleFunction on Hive Tables of small data size. This occurs when the data issmaller than the Block size set in Hive.The seed variable is for specifying the Random Seed used when executingBlock Sampling in Hive. Should the Random Seeds be identical, Hives BlockSampling returns the same results. Thus in order to guarantee RandomSamples for every sampling, it is best to assign a value for the seed variablein rhive.block.sample, by using the Sample Function of R.The subset variable is an optional variable that can specify the condition forthe data to be extracted from the Table targeted by Hive, when returningSample Block. This argument uses the character type and corresponds to thewhere clause in Hive HQL. Thus it must use syntax appropriate for HQLswhere clause.rhive.block.sample Functions return values are the character values of thename of the Hive Table that contain Sample Block results. That is, therhive.block.sample Function uses Sample Block to automatically create atemporary Hive Table and return that Tables name. The following exampleinvolves sampling data worth 0.01% of the Hive Table calledlistvirtualmachines. This example used Rs sample Function for the RandomSeed to be used during Block Sampling of Hive.seedNumber  <-­‐  sample(1:2^16,  1)      rhive.block.sample("listvirtualmachines",  seed=seedNumber  )      [1]  "rhive_sblk_1330404552"
    • As per this example, a Hive Table of the name "rhive_sblk_1330404552"bearing 0.01% worth of data from the Hive Table, "listvirtualmachines", hasbeen created.rhive.basic.scaleThe rhive.basic.scale function converts numerical data with 0 average and 1deviation. Input table name for the first argument, and the output columnname for the second.In the returned list, there is added a "scaled_column name" column saved asa string. This is also approachable/editable in RHive, along with/just like otherHive tables.scaled  <-­‐  rhive.basic.scale("iris",  "sepallength")  attr(scaled,  "scaled:center")  #  [1]  5.843333  attr(scaled,  "scaled:scale")  #  [1]  0.8253013  >  rhive.desc.table(scaled[[1]])  col_name  data_type  comment  #  1                        rowname        string  #  2                  sepalwidth        double  #  3                petallength        double  #  4                  petalwidth        double  #  5                        species        string  #  6                sepallength        double  #  7  sacled_sepallength        double  rhive.basic.byThe rhive.basic.by Function consists of code that runs group by for aspecified/particular column. Thus the code below excecutes/applies group byfor "species" column, and returns the result of applying the sum Function on
    • "sepallength". In the results you will find the sum of each species andsepallength.rhive.basic.by("iris",  "species",  "sum","sepallength")  #  species      sum  #  1          setosa  250.3  #  2  versicolor  296.8  #  3    virginica  329.4  rhive.basic.mergerhive.basic.merge makes new data set from merging two tables, based ontheir common rows.#  checking  data    rhive.query(select  *  from  iris  limit  5)      rowname  sepallength  sepalwidth  petallength  petalwidth  species  1              1                  5.1                3.5                  1.4                0.2    setosa  2              2                  4.9                3.0                  1.4                0.2    setosa  3              3                  4.7                3.2                  1.3                0.2    setosa  4              4                  4.6                3.1                  1.5                0.2    setosa  5              5                  5.0                3.6                  1.4                0.2    setosa        rhive.query(select  *  from  usarrests  limit  5)            rowname  murder  assault  urbanpop  rape  1        Alabama      13.2          236              58  21.2  2          Alaska      10.0          263              48  44.5  3        Arizona        8.1          294              80  31.0  4      Arkansas        8.8          190              50  19.5  5  California        9.0          276              91  40.6      ##rhive.basic.merge    rhive.basic.merge(iris,usarrests,by.x=sepallength,by.y=
    • murder)        sepallength  sepalwidth  petallength  petalwidth        species  assault  urbanpop  rape  rowname  1                    4.3                3.0                  1.1                0.1          setosa          102              62  16.5            14  2                    4.4                2.9                  1.4                0.2          setosa          149              85  16.3              9  3                    4.4                3.0                  1.3                0.2          setosa          149              85  16.3            39  4                    4.4                3.2                  1.3                0.2          setosa          149              85  16.3            43  5                    4.9                3.1                  1.5                0.1          setosa          159              67  29.3            10  Merge is similar with ‘join’ in SQL. Followings are same with that.#  Use  join  to  extract  and  print  the  names  of  all  rows  not  found  to  be  common  after  merging.  #  Should  row  names  overlap,  only  print  out  the  name  of  the  former  row.    rhive.big.query(select  a.sepallength,a.sepalwidth,a.petallength,a.petalwidth,a.species,b.assault,b.urbanpop,b.rape,a.rowname  from  iris  a  join  usarrests  b  on  a.sepallength  =  b.murder)        sepallength  sepalwidth  petallength  petalwidth        species  assault  urbanpop  rape  rowname  1                    4.3                3.0                  1.1                0.1          setosa          102              62  16.5            14  2                    4.4                2.9                  1.4                0.2          setosa          149              85  16.3              9  3                    4.4                3.0                  1.3                0.2          setosa          149              85  16.3            39  4                    4.4                3.2                  1.3                0.2          setosa          149              85  16.3            43  5                    4.9                3.1                  1.5                0.1          setosa          159              67  29.3            10  
    • rhive.basic.moderhive.basic.mode returns the mode and its frequency within a specified row ofthe Hive table.rhive.basic.mode(iris,  sepallength)      sepallength  freq  1                      5      10  rhive.basic.rangerhive.basic.range returns the greatest and lowest values within the specifiednumerical row of the Hive table.rhive.basic.range(iris,  sepallength)  [1]  4.3  7.9