R hive tutorial - apply functions and map reduce


Published on

This is a tutorial about RHive apply functions and map/reduce part

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

R hive tutorial - apply functions and map reduce

  1. 1. RHive tutorial - apply functions and map/reduce RHive supports the use of R’s syntax and characteristics by connecting to Hive for massive data processing. Hence RHive supports several Functions similar to apply-like Functions provided in R, and supports the making of map/reduce code by writing scripts in a similar way as Hadoop streaming. These Functions and features will be expanded upon in future versions. rhive.napply, rhive.sapply rhive.napply and rhive.sapply are the same Function. Their only difference is whether the returned value is numeric or character type. napply’s argument Function must return a numeric type. sapply’s argument Function returns a character type. The two Functions’ arguments take** table names, R Functions that run for each record, and columns. In rhive.napply/rhive.sapply, the number of arguments after the “Function to be applied” must equal to the number of arguments of the “Function to be applied” itself. Under the surface, these Functions are actually using RHive’s UDF support feature, so if you already know how Rhive’s UDF works then you will easily understand this as well. The following example of uses the two Functions. First make a table for testing. rhive.write.table('iris')   The following is an example of using rhive.napply. rhive.napply('iris',  function(column1)  {     column1  *  10   },     'sepallength')     rhive.napply('iris',  function(column1)  {    
  2. 2. column1  *  10   },  'sepallength')     [1]  "iris_napply1323970435_table"   rhive.desc.table("iris_napply1323970435_table")     col_name  data_type  comment     1  _c0  double   The following is an example of using rhive.sapply. rhive.sapply('iris',  function(column1)  {     as.character(column1  *  10)     },     'sepallength')     [1]  "iris_sapply1323970891_table"   rhive.desc.table("iris_sapply1323970891_table")     col_name  data_type  comment     1  _c0  string   Do note when using these Functions that these Functions do not return data.frame but return the name of the table temporarily made within the Functions themselves. The user should reprocess then delete these returned tables which the Functions made as results of processing. This is because it is generally impossible to receive massive data through standard output or data.frame. rhive.mrapply, rhive.mapapply, rhive.reduceapply These Functions have names similar to the ones mentioned before, but they actually make Hadoop’s map/reduce into a form that resembles using Hadoop streaming. You can use these Functions to implement wordcount which is frequently seen in Hadoop streaming examples. Users who wish to write code in traditional Map/Reduce style will need these Functions. It is very easy to use them. rhive.mapapply takes the tables and columns inputted as arguments and runs them with another Functions inputted as an argument but only runs it with** map. rhive.reduceapply only performs a reduce.
  3. 3. And rhive.mrapply performs both map and reduce. Use rhive.mapapply if you are making something that only requires the map procedure. Use rhive.reduceapply if you are making something that only requires the reduce procedure. Use rhive.mrapply if you need both procedures. You’ll probably find yourself using rhive.mrapply more often than not. The following is a wordcount example using rhive.mrapply. First let’s make a dataset for applying wordcount. We’ll be using a text web browser called lynx to crawl the R introduction page and save it to a Hive table. First install lynx. If you have a different text file and prefer this rather than installing lynx, then that is fine as well. yum  install  lynx   Save the downloaded file to a Hive table. Open  R     system("lynx  -­‐-­‐dump  http://cran.r-­‐project.org/doc/manuals/R-­‐ intro.html  >  /tmp/r-­‐intro.txt")     rintro  <-­‐  readLines("/tmp/r-­‐intro.txt")     unlink("/tmp/r-­‐intro.txt")     rintro  <-­‐  data.frame(rintro)     colnames(rintro)  <-­‐  c("rawtext")     rhive.write.table(rintro)     [1]  "rintro"   rhive.desc.table("rintro")     col_name  data_type  comment     1  rowname  string     2  rawtext  string   The RHive code that performs a wordcount on the text file called “rintro” is as follows:
  4. 4. map  <-­‐  function(key,  value)  {     if(is.null(value))  {     put(NA,1)     }     lapply(value,  function(v)  {     lapply(strsplit(x=v,  split="  ")[[1]],  function(word)   put(word,1))     })     }     reduce  <-­‐  function(key,  values)  {     put(key,  sum(as.numeric(values)))     }     result  <-­‐  rhive.mrapply("rintro",  map,  reduce,  c("rowname",   "rawtext"),  c("word","one"),  by="word",  c("word","one"),   c("word","count"))     head(result)     word  count     1  26927   2  "  1   3  "%!%"  1   4  "+")  1   5  "."  1   6  ".GlobalEnv"  3   The above example is similar to a Map/Reduce code in Hadoop streaming style, but rhive.mrapply in the last part of the example is probably unfamiliar with users. RHive also processes Map/Reduce by using Hive. Thus input must receive table name and column as basic input parameters. And because it is quite difficult to automatically find out the inputs and outputs of each step of map and reduce, the user must assign which are the inputs and outputs and what names will be used for aliases. Take  a  look  at  the  last  paragraph  of  the  above  example  again. rhive.mrapply("rintro", map, reduce, c("rowname", "rawtext"), c("word","one"), by="word", c("word","one"), c("word","count"))
  5. 5. The first argument, “rintro”, is the name of the table to be processed. The map and reduce coming thereafter are each symbols of the R Functions, respectively processing map and reduce. The c("rowname", "rawtext") after that are the “rintro” table’s columns that will be taken as arguments for map Function, and its first value is the column to be used as key, the second is the column to be used as its value. The fifth argument, c(“word”, “one”), refers to map’s output and the sixth argument, by=”word”, refers to the column to be aggregated among map’s output. The seventh and the eighth are respectively the input and output of reduce. Confusion may ensue with so many arguments, but they are necessary for Hive’s map/reduce. The future will see many improvements for this Function.. Under the surface, Rhive.mrapply actually processes map/reduce-related tasks by making Hive SQL. This means the user does not have to use rhive.mrapply to process above examples but the user can directly use the Function provided by RHive and Hive table. But this too consists of an unfamiliar and difficult syntax, so RHive came to support Functions such as these. rhive.mapapply and rhive.reduceapply are used in almost the same way as rhive.mrapply so this tutorial omits explaining it. If you already have a Map/Reduce module with Hadoop streaming or Hadoop library used and wish to convert it to RHive, there may be many differences during the conversion process. RHive does not replace Hadoop streaming, nor does it replace Hadoop library. But it is merely a convenient measure for helping R users approach Hadoop and Hive. Should you be attempting to take a high-performance Map/Reduce module or binary executable file written in C/C++ (and others) and attempt to convert it into map/reduce and running it, then Hadoop library and Hadoop streaming may yet be a better choice.