SlideShare a Scribd company logo
1 of 19
Download to read offline
RHive tutorial - advanced functions
RHive supports basic Functions, HDFS, “apply”-like Functions for
map/reducing, and advanced Functions that support UDF, UDAF, and UDTF.
Advanced Functions are composed of UDF/UDAF Functions used to
implement Hive’s UDF and UDAF with R.

If you use these Functions you can apply R down to the lower levels of
complexity** and perform Map/Reduce programming that uses Hadoop and
Hive. You can also use R to deal with a large portion of complex algorithms or
data processing procedures in Data Mining.

RHive - UDF and UDAF functions
Hive provides UDF(User Defined Function), UDAF(User Defined Aggregate
Function), and UDTF(User Defined Table create Function).
UDF refers to Functions contained in Hive that uses SQL syntax such as
count, avg, min, and max to perform calculations for one column or multiple
columns.
Hive comes with support for several UDFs and allows users to add more.
But originally, UDF, UDAF, and UDTF must all be written in Java and cannot
be directly made with the R language.
RHive can write UDF and UDAF into R language and make them useable in
Hive, and take the UDTF readymade by RHive and use them for split a
complex column into appropriated columns.

This tutorial explains how to implement these UDFs and UDAFs through
examples using RHive.

rhive.assign
The rhive.assign Function assigns the Functions and variables made in R so
that they may be referenced from Hive.
The reason why such assigning is required is: we need to select variables and
Functions to use for distributed environment.
One thing to beware is rhive.assign only performs essential tasks before
deploying the Functions, variables, and Objects in distributed environment, but
rhive.assign doesn’t deploy to distributed environments.

It’s just for making preparations and its normally used with rhive.export or
rhive.exportAll, which does the actual deployment.
Deployment here means, like Hadoop’s Map/Reduce, putting the Functions
and Objects to use in Hive in standby within Job node, and load them
whenever needed for processing data.
The rhive.assign Function takes in 2 arguments, the first of which is the alias
of the character type of the symbol created in R, and the other argument is the
symbol to be deployed.
If the symbol that will be deployed into the distributed environment has the
name, “newsum”, then use rhive.assign to assign it like this:

newsum	
  <-­‐	
  function(value)	
  {	
  
	
  	
  value	
  +	
  1	
  
}	
  
rhive.assign("newsum",	
  newsum)	
  

The syntax may seem a bit odd but this is due to R’s structural problems and
will be improved in the future.
The first argument just needs to make the symbol that will enter as the second
argument into a string.
Normally, inserting a string same as the symbol makes working with it easier,
but this is not an obligation. However, in the case of changing the name you
must keep in mind that the assigned symbol’s name will change during use:
just for the sake of confusion it’s best not to change that name.

Look at the following example to figure out how to use rhive.assign.

The following example makes a Function, “sum3values”, and uses
rhive.assign to deploy it to a distributed environment.

sum3values	
  <-­‐	
  function(a,b,c)	
  {	
  
	
  	
  a	
  +	
  b	
  +	
  c	
  
}	
  
	
  	
  
rhive.assign("sum3values",	
  sum3values)	
  
[1]	
  TRUE	
  

You can also assign objects that are not Functions.

coef1	
  <-­‐	
  3.141593	
  
rhive.assign("coef1",	
  coef1)	
  

Anything, including dataframes, are viable as well.
library(MASS)	
  
>	
  head(cats)	
  
	
  	
  Sex	
  Bwt	
  Hwt	
  
1	
  	
  	
  F	
  2.0	
  7.0	
  
2	
  	
  	
  F	
  2.0	
  7.4	
  
3	
  	
  	
  F	
  2.0	
  9.5	
  
4	
  	
  	
  F	
  2.1	
  7.2	
  
5	
  	
  	
  F	
  2.1	
  7.3	
  
6	
  	
  	
  F	
  2.1	
  7.6	
  
>	
  rhive.assign("cats",	
  cats)	
  
[1]	
  TRUE	
  

Objects and dataframe can also be used with rhive.assign.
But if these objects become deployed to the distributed environment then in
that environment they don’t exist as shared memory but as local memory, so
be careful.

rhive.export
rhive.export prepares objects made in R by actually deploying them.
rhive.export is used along with rhive.assign and the names of the objects it
gets for arguments must be ready to be deployed via rhive.assign.

You can easily learn how it is used from the following example.

sum3values	
  <-­‐	
  function(a,b,c)	
  {	
  
	
  	
  a	
  +	
  b	
  +	
  c	
  
}	
  
	
  	
  
rhive.assign("sum3values",	
  sum3values)	
  
	
  	
  
rhive.export(sum3values)	
  

Aside from the first argument, exportname, rhive.export has many other
arguments.
But these are for server and ports that are separately assignable so as to
allow them to be used in a complexly set up environment—hence normal
circumstances do not require them.
If you need to set up an environment complex enough to warrant the use of
these things then peruse a manual or contact the RHive development team.

rhive.exportAll
Although rhive.exportAll is functionally similar to rhive.export, their difference
lies in that rhive.exportAll serves to entirely deploys all symbols starting with
the same string for the first argument.
This function serves to deploy UDAF Functions written in R.
UDAF Functions, written in R, have a limitation of 4 Functions starting with the
same name. Made to easily deploy these, UDAF Functions are a type of
Function from rhive.export.

The following example shows the making of 4 Functions and deploying them
to a distributed environment.

sumAllColumns	
  <-­‐	
  function(prev,	
  values)	
  {	
  
	
  	
  	
  	
  if	
  (is.null(prev))	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  prev	
  <-­‐	
  rep(0.0,	
  length(values))	
  
	
  	
  	
  	
  }	
  
	
  	
  
	
  	
  	
  	
  prev	
  +	
  values	
  
}	
  
	
  	
  
sumAllColumns.partial	
  <-­‐	
  function(values)	
  {	
  
	
  	
  values	
  
}	
  
	
  	
  
sumAllColumns.merge	
  <-­‐	
  function(prev,	
  values)	
  {	
  
	
  	
  if	
  (is.null(prev))	
  {	
  
	
  	
  	
  	
  prev	
  <-­‐	
  rep(0.0,	
  length(values))	
  
	
  	
  }	
  
	
  	
  prev	
  +	
  values	
  
}	
  
	
  	
  
sumAllColumns.terminate	
  <-­‐	
  function(values)	
  {	
  
	
  	
  	
  	
  values	
  
}	
  
	
  	
  
rhive.assign("sumAllColumns",	
  sumAllColumns)	
  
rhive.assign("sumAllColumns.partial",	
  sumAllColumns.partial)	
  
rhive.assign("sumAllColumns.merge",	
  sumAllColumns.merge)	
  
rhive.assign("sumAllColumns.terminate",	
  
sumAllColumns.terminate)	
  
	
  	
  
rhive.exportAll("sumAllColumns")	
  

The last line is actually same as the following:

#	
  rhive.exportAll("sumAllColumns")	
  
	
  	
  
rhive.export("sumAllColumns")	
  
rhive.export("sumAllColumns.partial")	
  
rhive.export("sumAllColumns.merge")	
  
rhive.export("sumAllColumns.terminate")	
  

rhive.assign, rhive.export, and rhive.exportAll only serve up until the task of
preparing the Functions and Objects to be sent to Rserves currently in the
distributed environment.
In order to actually use these R Functions and Objects then you need to write
them into the SQL syntax that will be taken by rhive.query.

In the following examples, you will learn how to run the deployed Functions.

RHive - UDF usage
Now you must prepare a table for applying UDFs.
If there is no proper table in Hive for UDFs, create one.
This tutorial will convert data called USArrests into a Hive table.
USArrests is a small bit of data which R possesses by default.

Like below, convert the USArrests into Hive table and save it.
rhive.write.table(USArrests)	
  
	
  	
  
rhive.query("SELECT	
  *	
  FROM	
  USArrests	
  LIMIT	
  10")	
  
	
  	
  	
  	
  	
  	
  	
  rowname	
  murder	
  assault	
  urbanpop	
  rape	
  
1	
  	
  	
  	
  	
  	
  Alabama	
  	
  	
  13.2	
  	
  	
  	
  	
  236	
  	
  	
  	
  	
  	
  	
  58	
  21.2	
  
2	
  	
  	
  	
  	
  	
  	
  Alaska	
  	
  	
  10.0	
  	
  	
  	
  	
  263	
  	
  	
  	
  	
  	
  	
  48	
  44.5	
  
3	
  	
  	
  	
  	
  	
  Arizona	
  	
  	
  	
  8.1	
  	
  	
  	
  	
  294	
  	
  	
  	
  	
  	
  	
  80	
  31.0	
  
4	
  	
  	
  	
  	
  Arkansas	
  	
  	
  	
  8.8	
  	
  	
  	
  	
  190	
  	
  	
  	
  	
  	
  	
  50	
  19.5	
  
5	
  	
  	
  California	
  	
  	
  	
  9.0	
  	
  	
  	
  	
  276	
  	
  	
  	
  	
  	
  	
  91	
  40.6	
  
6	
  	
  	
  	
  	
  Colorado	
  	
  	
  	
  7.9	
  	
  	
  	
  	
  204	
  	
  	
  	
  	
  	
  	
  78	
  38.7	
  
7	
  	
  Connecticut	
  	
  	
  	
  3.3	
  	
  	
  	
  	
  110	
  	
  	
  	
  	
  	
  	
  77	
  11.1	
  
8	
  	
  	
  	
  	
  Delaware	
  	
  	
  	
  5.9	
  	
  	
  	
  	
  238	
  	
  	
  	
  	
  	
  	
  72	
  15.8	
  
9	
  	
  	
  	
  	
  	
  Florida	
  	
  	
  15.4	
  	
  	
  	
  	
  335	
  	
  	
  	
  	
  	
  	
  80	
  31.9	
  
10	
  	
  	
  	
  	
  Georgia	
  	
  	
  17.4	
  	
  	
  	
  	
  211	
  	
  	
  	
  	
  	
  	
  60	
  25.8	
  

The next example counts the total number of records in the USArrests table.

rhive.query("SELECT	
  COUNT(*)	
  FROM	
  USArrests")	
  
	
  	
  X_c0	
  
1	
  	
  	
  50	
  


The COUNT Function used in the above example is a Function used in SQL
and is also a Hive UDF.
Users familiar with the SQL syntax would know it as the most used included in
SQL Function.
Now you will be introduced to how to write and run a Function in R that does
the same thing as what the COUNT Function does.

First look at USArrests table’s description like below.

rhive.desc.table("USArrests")	
  
	
  	
  col_name	
  data_type	
  comment	
  
1	
  	
  rowname	
  	
  	
  	
  string	
  
2	
  	
  	
  murder	
  	
  	
  	
  double	
  
3	
  	
  assault	
  	
  	
  	
  	
  	
  	
  int	
  
4	
  urbanpop	
  	
  	
  	
  	
  	
  	
  int	
  
5	
  	
  	
  	
  	
  rape	
  	
  	
  	
  double	
  

Now we’ll make and execute a Function that gets the total sum of all the
values of the murder, assault, and rape columns.

The following is the entire code for that.

library(RHive)	
  
rhive.connect()	
  
	
  	
  
sumCrimes	
  <-­‐	
  function(column1,	
  column2,	
  column3)	
  {	
  
	
  	
  column1	
  +	
  column2	
  +	
  column3	
  
}	
  
	
  	
  
rhive.assign("sumCrimes",	
  sumCrimes)	
  
rhive.export("sumCrimes")	
  
	
  	
  
rhive.query("SELECT	
  rowname,	
  urbanpop,	
  R('sumCrimes',	
  murder,	
  
assault,	
  rape,	
  0.0)	
  FROM	
  usarrests")	
  
	
  	
  
rhive.close()	
  

The results are as follows.

rhive.query("SELECT	
  rowname,	
  urbanpop,	
  R('sumCrimes',	
  murder,	
  
assault,	
  rape,	
  0.0)	
  AS	
  crimes	
  FROM	
  usarrests")	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  rowname	
  urbanpop	
  crimes	
  
1	
  	
  	
  	
  	
  	
  	
  	
  	
  Alabama	
  	
  	
  	
  	
  	
  	
  58	
  	
  270.4	
  
2	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Alaska	
  	
  	
  	
  	
  	
  	
  48	
  	
  317.5	
  
3	
  	
  	
  	
  	
  	
  	
  	
  	
  Arizona	
  	
  	
  	
  	
  	
  	
  80	
  	
  333.1	
  
...	
  
48	
  	
  West	
  Virginia	
  	
  	
  	
  	
  	
  	
  39	
  	
  	
  96.0	
  
49	
  	
  	
  	
  	
  	
  Wisconsin	
  	
  	
  	
  	
  	
  	
  66	
  	
  	
  66.4	
  
50	
  	
  	
  	
  	
  	
  	
  	
  Wyoming	
  	
  	
  	
  	
  	
  	
  60	
  	
  183.4	
  

The important thing about the example above, is the Function named “R()”
written within the SQL syntax.
This Function is Hive’s UDF Function, not R’s Function.
More precisely, it’s a Function added onto Hive by RHive to allow RHive to
process R’s Functions in Hive.
R() Functions are Functions like sum, avg, or min. The R() Function calls its
first argument and receive returned values and send them to Hive.
So, to explain the R() Function used in the SQL syntax below:

SELECT	
  rowname,	
  urbanpop,	
  R('sumCrime',	
  murder,	
  assault,	
  rape,	
  
0.0)	
  FROM	
  usarrests	
  	
  

‘sumCrime’ is the name of the R Function deployed by rhive.export, and the
murder, assault and rape that follows ‘sumCrime’ are USArrests Hive table
column names.
And 0.0, the last argument entered into the R() Function, is the type of the
value which R() Function will return.
Enter 0.0 if the subCrime Function will return a numeric value and enter “” if it
will return a character type.
For example if the R Function, subCrime, returns a value of the character type
then enter the following:

rhive.query("SELECT	
  rowname,	
  urbanpop,	
  R('sumCrime',	
  murder,	
  
assault,	
  rape,	
  "")	
  FROM	
  usarrests")	
  


At the end, the syntax below returns a result composed of 3 columns, just like
it was seen in the aforementioned result.

SELECT	
  rowname,	
  urbanpop,	
  R('sumCrime',	
  murder,	
  assault,	
  rape,	
  
0.0)	
  FROM	
  usarrests	
  

Beware, R() Function returns only 1 value in this particular example.
That is, if you look at the SQL syntax results in the example above, you can
see R() Function ends up creating a new column value.


Actually the SQL syntax used above can actually be processed with nothing
more than Hive SQL.
The two syntaxes below actually show the same results.
RHive UDF SQL

rhive.query("SELECT	
  rowname,	
  urbanpop,	
  R('sumCrime',	
  murder,	
  
assault,	
  rape,	
  "")	
  FROM	
  usarrests")	
  

Hive SQL

rhive.query("SELECT	
  rowname,	
  urbanpop,	
  murder	
  +	
  assault	
  +	
  rape	
  
AS	
  crimes	
  FROM	
  usarrests")	
  

This tutorial uses something not useful for the sake of presenting an easy-to-
learn example.
Hive supports many arithmetics for UDF and column.

If you use RHive to process massive data and there is a solution Hive SQL
already supports it, it is recommended to use that solution.
The following URL contains relevant details.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

If Hive does not support the feature but you still need to perform complex
calculations for multiple columns, data mining, or machine learning then RHive
UDF feature would be very useful for such tasks.

RHive - UDAF 1
As something similar to UDF, there is something called UDAF.
UDAF is there to support aggregation. It is a Function that enables processing
of data from combined by column in “group by” phrase in SQL syntax.
Like UDF, UDAF also needs to create a module in Java to add new things to
Hive, but RHive enables the use of R language in creating them.

This wasn’t explained but you have already seen 4 UDAF Functions called
sumAllcolumns.
Functions that start with the name, sumAllcolumns, are Functions that have to
do with adding all the values of inputted arguments.
You may find yourself asking several questions.

First of them may be, “Why does the UDAF simultaneously need 4 Functions?”
and second may be, “Where can UDAF be used?”

To best understand this, it’s better for you to take a look at the runnable code.
First, in order to make a table for suitable application of Functions made for
UDAF, take the iris data which R’s datasets contain by default and upload it to
Hive.

rhive.write.table(iris)	
  
	
  	
  
[1]	
  "iris"	
  
>	
  rhive.list.tables()	
  
	
  	
  	
  	
  	
  	
  	
  tab_name	
  
1	
  	
  	
  	
  	
  	
  	
  	
  	
  aids2	
  
2	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  iris	
  
3	
  new_usarrests	
  
4	
  	
  	
  	
  	
  usarrests	
  
	
  	
  
rhive.desc.table("iris")	
  
	
  	
  	
  	
  	
  col_name	
  data_type	
  comment	
  
1	
  	
  	
  	
  	
  rowname	
  	
  	
  	
  string	
  
2	
  sepallength	
  	
  	
  	
  double	
  
3	
  	
  sepalwidth	
  	
  	
  	
  double	
  
4	
  petallength	
  	
  	
  	
  double	
  
5	
  	
  petalwidth	
  	
  	
  	
  double	
  
6	
  	
  	
  	
  	
  species	
  	
  	
  	
  string	
  
	
  	
  
rhive.query("SELECT	
  *	
  FROM	
  iris	
  LIMIT	
  10")	
  
	
  	
  	
  rowname	
  sepallength	
  sepalwidth	
  petallength	
  petalwidth	
  
species	
  
1	
  	
  	
  	
  	
  	
  	
  	
  1	
  	
  	
  	
  	
  	
  	
  	
  	
  5.1	
  	
  	
  	
  	
  	
  	
  	
  3.5	
  	
  	
  	
  	
  	
  	
  	
  	
  1.4	
  	
  	
  	
  	
  	
  	
  	
  0.2	
  	
  setos
a	
  
2	
  	
  	
  	
  	
  	
  	
  	
  2	
  	
  	
  	
  	
  	
  	
  	
  	
  4.9	
  	
  	
  	
  	
  	
  	
  	
  3.0	
  	
  	
  	
  	
  	
  	
  	
  	
  1.4	
  	
  	
  	
  	
  	
  	
  	
  0.2	
  	
  setos
a	
  
3	
  	
  	
  	
  	
  	
  	
  	
  3	
  	
  	
  	
  	
  	
  	
  	
  	
  4.7	
  	
  	
  	
  	
  	
  	
  	
  3.2	
  	
  	
  	
  	
  	
  	
  	
  	
  1.3	
  	
  	
  	
  	
  	
  	
  	
  0.2	
  	
  setos
a	
  
4	
  	
  	
  	
  	
  	
  	
  	
  4	
  	
  	
  	
  	
  	
  	
  	
  	
  4.6	
  	
  	
  	
  	
  	
  	
  	
  3.1	
  	
  	
  	
  	
  	
  	
  	
  	
  1.5	
  	
  	
  	
  	
  	
  	
  	
  0.2	
  	
  setos
a	
  
5	
  	
  	
  	
  	
  	
  	
  	
  5	
  	
  	
  	
  	
  	
  	
  	
  	
  5.0	
  	
  	
  	
  	
  	
  	
  	
  3.6	
  	
  	
  	
  	
  	
  	
  	
  	
  1.4	
  	
  	
  	
  	
  	
  	
  	
  0.2	
  	
  setos
a	
  
6	
  	
  	
  	
  	
  	
  	
  	
  6	
  	
  	
  	
  	
  	
  	
  	
  	
  5.4	
  	
  	
  	
  	
  	
  	
  	
  3.9	
  	
  	
  	
  	
  	
  	
  	
  	
  1.7	
  	
  	
  	
  	
  	
  	
  	
  0.4	
  	
  setos
a	
  
7	
  	
  	
  	
  	
  	
  	
  	
  7	
  	
  	
  	
  	
  	
  	
  	
  	
  4.6	
  	
  	
  	
  	
  	
  	
  	
  3.4	
  	
  	
  	
  	
  	
  	
  	
  	
  1.4	
  	
  	
  	
  	
  	
  	
  	
  0.3	
  	
  setos
a	
  
8	
  	
  	
  	
  	
  	
  	
  	
  8	
  	
  	
  	
  	
  	
  	
  	
  	
  5.0	
  	
  	
  	
  	
  	
  	
  	
  3.4	
  	
  	
  	
  	
  	
  	
  	
  	
  1.5	
  	
  	
  	
  	
  	
  	
  	
  0.2	
  	
  setos
a	
  
9	
  	
  	
  	
  	
  	
  	
  	
  9	
  	
  	
  	
  	
  	
  	
  	
  	
  4.4	
  	
  	
  	
  	
  	
  	
  	
  2.9	
  	
  	
  	
  	
  	
  	
  	
  	
  1.4	
  	
  	
  	
  	
  	
  	
  	
  0.2	
  	
  setos
a	
  
10	
  	
  	
  	
  	
  	
  10	
  	
  	
  	
  	
  	
  	
  	
  	
  4.9	
  	
  	
  	
  	
  	
  	
  	
  3.1	
  	
  	
  	
  	
  	
  	
  	
  	
  1.5	
  	
  	
  	
  	
  	
  	
  	
  0.1	
  	
  setos
a	
  

You can gain a general view of how iris data is composed.

Now we shall take those with the same values in the species column, gather
them up, and get the sum of each column’s values.
The entirety of the completed code, after running, should result like this:

sumAllColumns	
  <-­‐	
  function(prev,	
  values)	
  {	
  
	
  	
  if	
  (is.null(prev))	
  {	
  
	
  	
  	
  	
  	
  	
  prev	
  <-­‐	
  rep(0.0,	
  length(values))	
  
	
  	
  }	
  
	
  	
  prev	
  +	
  values	
  
}	
  
	
  	
  
sumAllColumns.partial	
  <-­‐	
  function(values)	
  {	
  
	
  	
  values	
  
}	
  
	
  	
  
sumAllColumns.merge	
  <-­‐	
  function(prev,	
  values)	
  {	
  
	
  	
  if	
  (is.null(prev))	
  {	
  
	
  	
  	
  	
  prev	
  <-­‐	
  rep(0.0,	
  length(values))	
  
	
  	
  }	
  
	
  	
  prev	
  +	
  values	
  
}	
  
	
  	
  
sumAllColumns.terminate	
  <-­‐	
  function(values)	
  {	
  
	
  	
  values	
  
}	
  
	
  	
  
rhive.assign("sumAllColumns",	
  sumAllColumns)	
  
rhive.assign("sumAllColumns.partial",	
  sumAllColumns.partial)	
  
rhive.assign("sumAllColumns.merge",	
  sumAllColumns.merge)	
  
rhive.assign("sumAllColumns.terminate",	
  
sumAllColumns.terminate)	
  
	
  	
  
rhive.exportAll("sumAllColumns")	
  
	
  	
  
result	
  <-­‐	
  rhive.query("	
  
	
  	
  SELECT	
  species,	
  RA('sumAllColumns',	
  sepallength,	
  sepalwidth,	
  
petallength,	
  petalwidth)	
  
	
  	
  FROM	
  iris	
  
	
  	
  GROUP	
  BY	
  species")	
  
	
  	
  
print(result)	
  
	
  	
  	
  	
  	
  species	
  
1	
  	
  	
  	
  	
  setosa	
  
2	
  versicolor	
  
3	
  	
  virginica	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  X_c1	
  
1	
  
250.29999999999998,171.40000000000003,73.10000000000001,12.2999
99999999995	
  
2	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  296.8,138.50000000000003,212.999999
99999997,66.3	
  
3	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  329.3999999999999,148.7,277.59999999999997,101.2
9999999999998	
  
If you take a look at the last printed results from the run results, you will see 3
newly created records each with 2 columns.

First thing to note is the RA() Function which is in the SQL syntax which in
turn is sent via rhive.query.
This is similar to the R() Function.

RA() Function’s returns only one value, and is always of the character type.
And Hive processes the returned results and finally sends them to RHive.
One thing to beware is, RA() Function is UDAF so you must use SQL’s
GROUP BY syntax along with it.
Peruse the Hive document for further details.

To explain the SQL syntax above in detail:

SELECT	
  species,	
  RA('sumAllColumns',	
  sepallength,	
  sepalwidth,	
  
petallength,	
  petalwidth)	
  
	
  	
  FROM	
  iris	
  
	
  	
  GROUP	
  BY	
  species	
  

It’s basically making a separate calculation for the values of columns with the
“species” name which should be aggregated to respective groups using
“GROUP BY”.

The first argument sumAllColumns is the common prefix of 4 UDAF Functions.
The rest are columns to be processed. And the RA Function returns a value
using all the arguments.

By now, you probably have unanswered questions raised and perhaps new
ones as well. Those yet to be explained will be explained in the next section.

RHive - UDAF 2
Now we’ll talk about the 4 UDAF R Functions aforementioned in the previous
examples and their disadvantages.


A total of 4 Functions were made for UDAF and they are:

     •    sumAllColumns
     •    sumAllColumns.partial
     •    sumAllColumns.merge
     •    sumAllColumns.terminate
UDAF Functions for RHive must have a common prefix. Each 4 Functions
should end with .partial, .merge, .terminate, and one without any postfix.
The naming rule must be kept in RHive.

The reason why 4 Functions are required is because there are 4 points in the
Map/Reduce procedure where the Functions can run while Hive UDAF is
running.
Suppose you made 4 UDAF Functions that begin with the name “foobar”.
Each Function will run in the following locations:

   •   foobar – where Map’s aggregation is done (The combine step, to be precise)
   •   foobar.partial – Where the aggregated result is sent to reduce
   •   foobar.merge – Where Reduce’s aggregation is done
   •   foobar.terminate – Where Reduce is terminated.


And
foobar and foobar.merge do similar things. They have to take 2 arguments.
foobar.partial and foobar.terminate are also similar but they each have to take
only 1 argument.

Now you might still have question as to how these functions work.
To gain an understanding of this, you must grasp the flow of their workings.

   •   foobar - combine aggregation (map)
   •   foobar.partial - combine return (map)
   •   foobar.merge - reduce aggregation
   •   foobar.terminate - reduce terminate


The 2 combine steps might pass.
If you do not become exact with your configurations then it will be difficult to
figure out when these Functions pass and run.
Thus making all 4 Functions is a must.
For a more advanced knowhow and complete understanding, peruse the Hive
and Hadoop documents.

If you want to forgo complete understanding and just want to know how to use
it, you must remember this: in order to use RHive’s UDAF support, you need
to make 4 Functions and follow the rules.

RHive - UDAF 3
This is about the workings and arguments of the 4 UDAF Functions.
From the aforementioned sumAllColumns Functions, take a look at
sumAllColumns and sumAllColumns.merge.
The two Functions have the same code.

sumAllColumns	
  <-­‐	
  function(prev,	
  values)	
  {	
  
	
  	
  if	
  (is.null(prev))	
  {	
  
	
  	
  	
  	
  	
  	
  prev	
  <-­‐	
  rep(0.0,	
  length(values))	
  
	
  	
  }	
  
	
  	
  prev	
  +	
  values	
  
}	
  
	
  	
  
sumAllColumns.merge	
  <-­‐	
  function(prev,	
  values)	
  {	
  
	
  	
  if	
  (is.null(prev))	
  {	
  
	
  	
  	
  	
  prev	
  <-­‐	
  rep(0.0,	
  length(values))	
  
	
  	
  }	
  
	
  	
  prev	
  +	
  values	
  
}	
  

These Functions need to process 2 arguments.
The first argument is the returned values from sumAllColumns.merge and
sumAllColumns.
The second argument is the value regarding record which Hive hands over.
The first and the second arguments are actually all vector or list.

Because it is handed over as vector, you need to remember the sequence of
the columns inputted into the SQL syntax.

The “prev” handed over as the first argument is actually a returned value from
the previous step.
So when the Function is first run it is sent over a NULL value.
Thus within the Function, you can use is.null to see whether it is a NULL value.

if	
  (is.null(prev))	
  {	
  
	
  	
  	
  	
  	
  	
  prev	
  <-­‐	
  rep(0.0,	
  length(values))	
  
	
  	
  }	
  

So the values that are recursively processed and returned from the last
Record, are sent over as the arguments for sumAllColumns.partial and
sumAllColumns.terminate.
The two Functions’ codes are identical.

sumAllColumns.partial	
  <-­‐	
  function(values)	
  {	
  
	
  	
  values	
  
}	
  
	
  	
  
sumAllColumns.terminate	
  <-­‐	
  function(values)	
  {	
  
	
  	
  values	
  
}	
  

These two Functions do not recursively run, therefore they can only receive
one argument. And the result of sumAllColumns.terminate, is sent to Hive and
forms one column value.

In the above example you can each see 2 Functions with identical codes, but
this is an easy example made for the sake of making a simple tutorial. In
actual practice, the 4 Functions’ codes may all differ depending on context.
Even if all 4 Functions have identical codes, there must always be 4 Functions.

RHive - UDTF
This section is devoted to streamline the codes you’ve written so far to be
more graceful.
It is quite difficult to handle something that just comes out as a single string
value.
There is a need to split it to its individual constituents and the unfold Function,
a UDTF Function supported by RHive, is proper for this.

The result of this is the result of running SQL syntax shown in a previous
example.

print(result)	
  
	
  	
  	
  	
  	
  species	
  
1	
  	
  	
  	
  	
  setosa	
  
2	
  versicolor	
  
3	
  	
  virginica	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  X_c1	
  
1	
  
250.29999999999998,171.40000000000003,73.10000000000001,12.2999
99999999995	
  
2	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  296.8,138.50000000000003,212.999999
99999997,66.3	
  
3	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  329.3999999999999,148.7,277.59999999999997,101.2
9999999999998	
  

The 2nd column, X_c1, is a value made by UDAF and it consists of character
type.
You can also see the values are distinguished by “,”s between them.
To make this back into a numeric vector, R Functions like strsplit must be
used. However, even if there are no problems with using that when there is a
small number of Records, a problem occurs otherwise.
The example above only has 3 Records but when applying the same
procedure for big tables, you might encounter millions of Records.
Hence the values returned by UDAF must be each split and made into column
values.

To do this you need subquery and UDTF, and the code altered after using
UDTF is as such:

sumAllColumns	
  <-­‐	
  function(prev,	
  values)	
  {	
  
	
  	
  if	
  (is.null(prev))	
  {	
  
	
  	
  	
  	
  	
  	
  prev	
  <-­‐	
  rep(0.0,	
  length(values))	
  
	
  	
  }	
  
	
  	
  prev	
  +	
  values	
  
}	
  
	
  	
  
sumAllColumns.partial	
  <-­‐	
  function(values)	
  {	
  
	
  	
  values	
  
}	
  
	
  	
  
sumAllColumns.merge	
  <-­‐	
  function(prev,	
  values)	
  {	
  
	
  	
  if	
  (is.null(prev))	
  {	
  
	
  	
  	
  	
  prev	
  <-­‐	
  rep(0.0,	
  length(values))	
  
	
  	
  }	
  
 	
  prev	
  +	
  values	
  
}	
  
	
  	
  
sumAllColumns.terminate	
  <-­‐	
  function(values)	
  {	
  
	
  	
  values	
  
}	
  
	
  	
  
rhive.assign("sumAllColumns",	
  sumAllColumns)	
  
rhive.assign("sumAllColumns.partial",	
  sumAllColumns.partial)	
  
rhive.assign("sumAllColumns.merge",	
  sumAllColumns.merge)	
  
rhive.assign("sumAllColumns.terminate",	
  
sumAllColumns.terminate)	
  
	
  	
  
rhive.exportAll("sumAllColumns")	
  
	
  	
  
result	
  <-­‐	
  rhive.query(	
  
	
  	
  "SELECT	
  unfold(dummytable.dummycolumn,	
  0.0,	
  0.0,	
  0.0,	
  0.0,	
  
',')	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  AS	
  (sepallength,	
  sepalwidth,	
  petallength,	
  
petalwidth)	
  
	
  	
  	
  FROM	
  (	
  
	
  	
  	
  	
  	
  SELECT	
  RA('sumAllColumns',	
  sepallength,	
  sepalwidth,	
  
petallength,	
  petalwidth)	
  AS	
  dummycolumn	
  
	
  	
  	
  	
  	
  FROM	
  iris	
  
	
  	
  	
  	
  	
  GROUP	
  BY	
  species	
  
	
  	
  	
  	
  	
  )	
  dummytable")	
  
	
  	
  
print(result)	
  
	
  	
  sepallength	
  sepalwidth	
  petallength	
  petalwidth	
  
1	
  	
  	
  	
  	
  	
  	
  250.3	
  	
  	
  	
  	
  	
  171.4	
  	
  	
  	
  	
  	
  	
  	
  73.1	
  	
  	
  	
  	
  	
  	
  12.3	
  
2	
  	
  	
  	
  	
  	
  	
  296.8	
  	
  	
  	
  	
  	
  138.5	
  	
  	
  	
  	
  	
  	
  213.0	
  	
  	
  	
  	
  	
  	
  66.3	
  
3	
  	
  	
  	
  	
  	
  	
  329.4	
  	
  	
  	
  	
  	
  148.7	
  	
  	
  	
  	
  	
  	
  277.6	
  	
  	
  	
  	
  	
  101.3	
  
SQL syntax** became a bit complex compared to prior examples, but in the
final result, you can see that the UDAF return values are all split into columns
by the “unfold” UDTF.
Unfold is the UDTF Function supported by RHive, so there is no need to
separately apply R code.
And Hive’s UDTF must be used alone.
This is not solvable in RHive because this depends on Hive.

Examples so far only performed Map/Reduce once.
If you are trying to use RHive for a very complex task then you may need
multiple Map/Reduces.
This is often seen in normal Map/Reduce implementations or Hadoop
streaming implementations.
If you attempt to combine these Map/Reduces, you may need to make
temporary tables to save results and then delete them later.

More Related Content

What's hot

Introduction to scoop and its functions
Introduction to scoop and its functionsIntroduction to scoop and its functions
Introduction to scoop and its functionsRupak Roy
 
Scoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMSScoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMSRupak Roy
 
Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Rupak Roy
 
Manipulating Data using DPLYR in R Studio
Manipulating Data using DPLYR in R StudioManipulating Data using DPLYR in R Studio
Manipulating Data using DPLYR in R StudioRupak Roy
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Cloning Oracle EBS R12: A Step by Step Procedure
Cloning Oracle EBS R12: A Step by Step ProcedureCloning Oracle EBS R12: A Step by Step Procedure
Cloning Oracle EBS R12: A Step by Step ProcedureOrazer Technologies
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
CCF #1: Taking the reins of your data with Hiera 5
CCF #1: Taking the reins of your data with Hiera 5CCF #1: Taking the reins of your data with Hiera 5
CCF #1: Taking the reins of your data with Hiera 5davidmogar
 
R Jobs on the Cloud
R Jobs on the CloudR Jobs on the Cloud
R Jobs on the CloudJohn Doxaras
 
exp-7-pig installation.pptx
exp-7-pig installation.pptxexp-7-pig installation.pptx
exp-7-pig installation.pptxvishal choudhary
 
1 Installing & getting started with R
1 Installing & getting started with R1 Installing & getting started with R
1 Installing & getting started with Rnaroranisha
 
Create & Execute First Hadoop MapReduce Project in.pptx
Create & Execute First Hadoop MapReduce Project in.pptxCreate & Execute First Hadoop MapReduce Project in.pptx
Create & Execute First Hadoop MapReduce Project in.pptxvishal choudhary
 
twp-integrating-hadoop-data-with-or-130063
twp-integrating-hadoop-data-with-or-130063twp-integrating-hadoop-data-with-or-130063
twp-integrating-hadoop-data-with-or-130063Madhusudan Anand
 
My sql with querys
My sql with querysMy sql with querys
My sql with querysNIRMAL FELIX
 
1 installing & Getting Started with R
1 installing & Getting Started with R1 installing & Getting Started with R
1 installing & Getting Started with RDr Nisha Arora
 

What's hot (20)

Introduction to scoop and its functions
Introduction to scoop and its functionsIntroduction to scoop and its functions
Introduction to scoop and its functions
 
Scoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMSScoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMS
 
Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode
 
Manipulating Data using DPLYR in R Studio
Manipulating Data using DPLYR in R StudioManipulating Data using DPLYR in R Studio
Manipulating Data using DPLYR in R Studio
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
Hadoop on osx
Hadoop on osxHadoop on osx
Hadoop on osx
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Cloning Oracle EBS R12: A Step by Step Procedure
Cloning Oracle EBS R12: A Step by Step ProcedureCloning Oracle EBS R12: A Step by Step Procedure
Cloning Oracle EBS R12: A Step by Step Procedure
 
Rac nonrac clone
Rac nonrac cloneRac nonrac clone
Rac nonrac clone
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
CCF #1: Taking the reins of your data with Hiera 5
CCF #1: Taking the reins of your data with Hiera 5CCF #1: Taking the reins of your data with Hiera 5
CCF #1: Taking the reins of your data with Hiera 5
 
R Jobs on the Cloud
R Jobs on the CloudR Jobs on the Cloud
R Jobs on the Cloud
 
exp-7-pig installation.pptx
exp-7-pig installation.pptxexp-7-pig installation.pptx
exp-7-pig installation.pptx
 
1 Installing & getting started with R
1 Installing & getting started with R1 Installing & getting started with R
1 Installing & getting started with R
 
Create & Execute First Hadoop MapReduce Project in.pptx
Create & Execute First Hadoop MapReduce Project in.pptxCreate & Execute First Hadoop MapReduce Project in.pptx
Create & Execute First Hadoop MapReduce Project in.pptx
 
Ex-8-hive.pptx
Ex-8-hive.pptxEx-8-hive.pptx
Ex-8-hive.pptx
 
Rhive 0.0 3
Rhive 0.0 3Rhive 0.0 3
Rhive 0.0 3
 
twp-integrating-hadoop-data-with-or-130063
twp-integrating-hadoop-data-with-or-130063twp-integrating-hadoop-data-with-or-130063
twp-integrating-hadoop-data-with-or-130063
 
My sql with querys
My sql with querysMy sql with querys
My sql with querys
 
1 installing & Getting Started with R
1 installing & Getting Started with R1 installing & Getting Started with R
1 installing & Getting Started with R
 

Viewers also liked

R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopRevolution Analytics
 
RHive tutorial 4: RHive 튜토리얼 4 - UDF, UDTF, UDAF 함수
RHive tutorial 4: RHive 튜토리얼 4 - UDF, UDTF, UDAF 함수RHive tutorial 4: RHive 튜토리얼 4 - UDF, UDTF, UDAF 함수
RHive tutorial 4: RHive 튜토리얼 4 - UDF, UDTF, UDAF 함수Aiden Seonghak Hong
 
Comscore US Mobile App Report - June 2014 data
Comscore US Mobile App Report  - June 2014 dataComscore US Mobile App Report  - June 2014 data
Comscore US Mobile App Report - June 2014 dataLudovic Privat
 
Introduccion a Apache Spark
Introduccion a Apache SparkIntroduccion a Apache Spark
Introduccion a Apache SparkGustavo Arjones
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Docker networking basics & coupling with Software Defined Networks
Docker networking basics & coupling with Software Defined NetworksDocker networking basics & coupling with Software Defined Networks
Docker networking basics & coupling with Software Defined NetworksAdrien Blind
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Spark Summit
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
 
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...Spark Summit
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkDatabricks
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit
 

Viewers also liked (16)

R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with Hadoop
 
RHive tutorial 4: RHive 튜토리얼 4 - UDF, UDTF, UDAF 함수
RHive tutorial 4: RHive 튜토리얼 4 - UDF, UDTF, UDAF 함수RHive tutorial 4: RHive 튜토리얼 4 - UDF, UDTF, UDAF 함수
RHive tutorial 4: RHive 튜토리얼 4 - UDF, UDTF, UDAF 함수
 
Comscore US Mobile App Report - June 2014 data
Comscore US Mobile App Report  - June 2014 dataComscore US Mobile App Report  - June 2014 data
Comscore US Mobile App Report - June 2014 data
 
Introduccion a Apache Spark
Introduccion a Apache SparkIntroduccion a Apache Spark
Introduccion a Apache Spark
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
RHadoop, R meets Hadoop
RHadoop, R meets HadoopRHadoop, R meets Hadoop
RHadoop, R meets Hadoop
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Docker networking basics & coupling with Software Defined Networks
Docker networking basics & coupling with Software Defined NetworksDocker networking basics & coupling with Software Defined Networks
Docker networking basics & coupling with Software Defined Networks
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
 

Similar to RHive - advanced UDF and UDAF functions in R for Hive

Get started with R lang
Get started with R langGet started with R lang
Get started with R langsenthil0809
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Guy Lebanon
 
Presentation on use of r statistics
Presentation on use of r statisticsPresentation on use of r statistics
Presentation on use of r statisticsKrishna Dhakal
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data Jay Nagar
 
Workshop presentation hands on r programming
Workshop presentation hands on r programmingWorkshop presentation hands on r programming
Workshop presentation hands on r programmingNimrita Koul
 
Reproducibility with R
Reproducibility with RReproducibility with R
Reproducibility with RMartin Jung
 
Functions & closures
Functions & closuresFunctions & closures
Functions & closuresKnoldus Inc.
 
Functions & Closures in Scala
Functions & Closures in ScalaFunctions & Closures in Scala
Functions & Closures in ScalaKnoldus Inc.
 
Modeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.pptModeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.pptanshikagoel52
 
Terrain AnalysisBackgroundAircraft frequently rely on terrain el.pdf
Terrain AnalysisBackgroundAircraft frequently rely on terrain el.pdfTerrain AnalysisBackgroundAircraft frequently rely on terrain el.pdf
Terrain AnalysisBackgroundAircraft frequently rely on terrain el.pdffeetshoemart
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 

Similar to RHive - advanced UDF and UDAF functions in R for Hive (20)

Get started with R lang
Get started with R langGet started with R lang
Get started with R lang
 
Matlab Manual
Matlab ManualMatlab Manual
Matlab Manual
 
Unit 3
Unit 3Unit 3
Unit 3
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)
 
Presentation on use of r statistics
Presentation on use of r statisticsPresentation on use of r statistics
Presentation on use of r statistics
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
 
Workshop presentation hands on r programming
Workshop presentation hands on r programmingWorkshop presentation hands on r programming
Workshop presentation hands on r programming
 
Oct.22nd.Presentation.Final
Oct.22nd.Presentation.FinalOct.22nd.Presentation.Final
Oct.22nd.Presentation.Final
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
Reproducibility with R
Reproducibility with RReproducibility with R
Reproducibility with R
 
Functions & Closures in Scala
Functions & Closures in ScalaFunctions & Closures in Scala
Functions & Closures in Scala
 
Functions & closures
Functions & closuresFunctions & closures
Functions & closures
 
Functions & Closures in Scala
Functions & Closures in ScalaFunctions & Closures in Scala
Functions & Closures in Scala
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
 
Lecture1 r
Lecture1 rLecture1 r
Lecture1 r
 
Modeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.pptModeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.ppt
 
Terrain AnalysisBackgroundAircraft frequently rely on terrain el.pdf
Terrain AnalysisBackgroundAircraft frequently rely on terrain el.pdfTerrain AnalysisBackgroundAircraft frequently rely on terrain el.pdf
Terrain AnalysisBackgroundAircraft frequently rely on terrain el.pdf
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Unit 1 - R Programming (Part 2).pptx
Unit 1 - R Programming (Part 2).pptxUnit 1 - R Programming (Part 2).pptx
Unit 1 - R Programming (Part 2).pptx
 

More from Aiden Seonghak Hong

RHive tutorial supplement 3: RHive 튜토리얼 부록 3 - RStudio 설치
RHive tutorial supplement 3: RHive 튜토리얼 부록 3 - RStudio 설치RHive tutorial supplement 3: RHive 튜토리얼 부록 3 - RStudio 설치
RHive tutorial supplement 3: RHive 튜토리얼 부록 3 - RStudio 설치Aiden Seonghak Hong
 
RHive tutorial supplement 2: RHive 튜토리얼 부록 2 - Hive 설치
RHive tutorial supplement 2: RHive 튜토리얼 부록 2 - Hive 설치RHive tutorial supplement 2: RHive 튜토리얼 부록 2 - Hive 설치
RHive tutorial supplement 2: RHive 튜토리얼 부록 2 - Hive 설치Aiden Seonghak Hong
 
RHive tutorial supplement 1: RHive 튜토리얼 부록 1 - Hadoop 설치
RHive tutorial supplement 1: RHive 튜토리얼 부록 1 - Hadoop 설치RHive tutorial supplement 1: RHive 튜토리얼 부록 1 - Hadoop 설치
RHive tutorial supplement 1: RHive 튜토리얼 부록 1 - Hadoop 설치Aiden Seonghak Hong
 
RHive tutorial 5: RHive 튜토리얼 5 - apply 함수와 맵리듀스
RHive tutorial 5: RHive 튜토리얼 5 - apply 함수와 맵리듀스RHive tutorial 5: RHive 튜토리얼 5 - apply 함수와 맵리듀스
RHive tutorial 5: RHive 튜토리얼 5 - apply 함수와 맵리듀스Aiden Seonghak Hong
 
RHive tutorial 3: RHive 튜토리얼 3 - HDFS 함수
RHive tutorial 3: RHive 튜토리얼 3 - HDFS 함수RHive tutorial 3: RHive 튜토리얼 3 - HDFS 함수
RHive tutorial 3: RHive 튜토리얼 3 - HDFS 함수Aiden Seonghak Hong
 
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수Aiden Seonghak Hong
 
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정Aiden Seonghak Hong
 
R hive tutorial supplement 2 - Installing Hive
R hive tutorial supplement 2 - Installing HiveR hive tutorial supplement 2 - Installing Hive
R hive tutorial supplement 2 - Installing HiveAiden Seonghak Hong
 
R hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopR hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopAiden Seonghak Hong
 

More from Aiden Seonghak Hong (11)

IoT and Big data with R
IoT and Big data with RIoT and Big data with R
IoT and Big data with R
 
RHive tutorial supplement 3: RHive 튜토리얼 부록 3 - RStudio 설치
RHive tutorial supplement 3: RHive 튜토리얼 부록 3 - RStudio 설치RHive tutorial supplement 3: RHive 튜토리얼 부록 3 - RStudio 설치
RHive tutorial supplement 3: RHive 튜토리얼 부록 3 - RStudio 설치
 
RHive tutorial supplement 2: RHive 튜토리얼 부록 2 - Hive 설치
RHive tutorial supplement 2: RHive 튜토리얼 부록 2 - Hive 설치RHive tutorial supplement 2: RHive 튜토리얼 부록 2 - Hive 설치
RHive tutorial supplement 2: RHive 튜토리얼 부록 2 - Hive 설치
 
RHive tutorial supplement 1: RHive 튜토리얼 부록 1 - Hadoop 설치
RHive tutorial supplement 1: RHive 튜토리얼 부록 1 - Hadoop 설치RHive tutorial supplement 1: RHive 튜토리얼 부록 1 - Hadoop 설치
RHive tutorial supplement 1: RHive 튜토리얼 부록 1 - Hadoop 설치
 
RHive tutorial 5: RHive 튜토리얼 5 - apply 함수와 맵리듀스
RHive tutorial 5: RHive 튜토리얼 5 - apply 함수와 맵리듀스RHive tutorial 5: RHive 튜토리얼 5 - apply 함수와 맵리듀스
RHive tutorial 5: RHive 튜토리얼 5 - apply 함수와 맵리듀스
 
RHive tutorial 3: RHive 튜토리얼 3 - HDFS 함수
RHive tutorial 3: RHive 튜토리얼 3 - HDFS 함수RHive tutorial 3: RHive 튜토리얼 3 - HDFS 함수
RHive tutorial 3: RHive 튜토리얼 3 - HDFS 함수
 
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
 
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
 
R hive tutorial 1
R hive tutorial 1R hive tutorial 1
R hive tutorial 1
 
R hive tutorial supplement 2 - Installing Hive
R hive tutorial supplement 2 - Installing HiveR hive tutorial supplement 2 - Installing Hive
R hive tutorial supplement 2 - Installing Hive
 
R hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopR hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing Hadoop
 

Recently uploaded

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

RHive - advanced UDF and UDAF functions in R for Hive

  • 1. RHive tutorial - advanced functions RHive supports basic Functions, HDFS, “apply”-like Functions for map/reducing, and advanced Functions that support UDF, UDAF, and UDTF. Advanced Functions are composed of UDF/UDAF Functions used to implement Hive’s UDF and UDAF with R. If you use these Functions you can apply R down to the lower levels of complexity** and perform Map/Reduce programming that uses Hadoop and Hive. You can also use R to deal with a large portion of complex algorithms or data processing procedures in Data Mining. RHive - UDF and UDAF functions Hive provides UDF(User Defined Function), UDAF(User Defined Aggregate Function), and UDTF(User Defined Table create Function). UDF refers to Functions contained in Hive that uses SQL syntax such as count, avg, min, and max to perform calculations for one column or multiple columns. Hive comes with support for several UDFs and allows users to add more. But originally, UDF, UDAF, and UDTF must all be written in Java and cannot be directly made with the R language. RHive can write UDF and UDAF into R language and make them useable in Hive, and take the UDTF readymade by RHive and use them for split a complex column into appropriated columns. This tutorial explains how to implement these UDFs and UDAFs through examples using RHive. rhive.assign The rhive.assign Function assigns the Functions and variables made in R so that they may be referenced from Hive. The reason why such assigning is required is: we need to select variables and Functions to use for distributed environment. One thing to beware is rhive.assign only performs essential tasks before deploying the Functions, variables, and Objects in distributed environment, but rhive.assign doesn’t deploy to distributed environments. It’s just for making preparations and its normally used with rhive.export or rhive.exportAll, which does the actual deployment. Deployment here means, like Hadoop’s Map/Reduce, putting the Functions and Objects to use in Hive in standby within Job node, and load them whenever needed for processing data.
  • 2. The rhive.assign Function takes in 2 arguments, the first of which is the alias of the character type of the symbol created in R, and the other argument is the symbol to be deployed. If the symbol that will be deployed into the distributed environment has the name, “newsum”, then use rhive.assign to assign it like this: newsum  <-­‐  function(value)  {      value  +  1   }   rhive.assign("newsum",  newsum)   The syntax may seem a bit odd but this is due to R’s structural problems and will be improved in the future. The first argument just needs to make the symbol that will enter as the second argument into a string. Normally, inserting a string same as the symbol makes working with it easier, but this is not an obligation. However, in the case of changing the name you must keep in mind that the assigned symbol’s name will change during use: just for the sake of confusion it’s best not to change that name. Look at the following example to figure out how to use rhive.assign. The following example makes a Function, “sum3values”, and uses rhive.assign to deploy it to a distributed environment. sum3values  <-­‐  function(a,b,c)  {      a  +  b  +  c   }       rhive.assign("sum3values",  sum3values)   [1]  TRUE   You can also assign objects that are not Functions. coef1  <-­‐  3.141593   rhive.assign("coef1",  coef1)   Anything, including dataframes, are viable as well.
  • 3. library(MASS)   >  head(cats)      Sex  Bwt  Hwt   1      F  2.0  7.0   2      F  2.0  7.4   3      F  2.0  9.5   4      F  2.1  7.2   5      F  2.1  7.3   6      F  2.1  7.6   >  rhive.assign("cats",  cats)   [1]  TRUE   Objects and dataframe can also be used with rhive.assign. But if these objects become deployed to the distributed environment then in that environment they don’t exist as shared memory but as local memory, so be careful. rhive.export rhive.export prepares objects made in R by actually deploying them. rhive.export is used along with rhive.assign and the names of the objects it gets for arguments must be ready to be deployed via rhive.assign. You can easily learn how it is used from the following example. sum3values  <-­‐  function(a,b,c)  {      a  +  b  +  c   }       rhive.assign("sum3values",  sum3values)       rhive.export(sum3values)   Aside from the first argument, exportname, rhive.export has many other arguments. But these are for server and ports that are separately assignable so as to allow them to be used in a complexly set up environment—hence normal
  • 4. circumstances do not require them. If you need to set up an environment complex enough to warrant the use of these things then peruse a manual or contact the RHive development team. rhive.exportAll Although rhive.exportAll is functionally similar to rhive.export, their difference lies in that rhive.exportAll serves to entirely deploys all symbols starting with the same string for the first argument. This function serves to deploy UDAF Functions written in R. UDAF Functions, written in R, have a limitation of 4 Functions starting with the same name. Made to easily deploy these, UDAF Functions are a type of Function from rhive.export. The following example shows the making of 4 Functions and deploying them to a distributed environment. sumAllColumns  <-­‐  function(prev,  values)  {          if  (is.null(prev))  {                  prev  <-­‐  rep(0.0,  length(values))          }              prev  +  values   }       sumAllColumns.partial  <-­‐  function(values)  {      values   }       sumAllColumns.merge  <-­‐  function(prev,  values)  {      if  (is.null(prev))  {          prev  <-­‐  rep(0.0,  length(values))      }      prev  +  values   }      
  • 5. sumAllColumns.terminate  <-­‐  function(values)  {          values   }       rhive.assign("sumAllColumns",  sumAllColumns)   rhive.assign("sumAllColumns.partial",  sumAllColumns.partial)   rhive.assign("sumAllColumns.merge",  sumAllColumns.merge)   rhive.assign("sumAllColumns.terminate",   sumAllColumns.terminate)       rhive.exportAll("sumAllColumns")   The last line is actually same as the following: #  rhive.exportAll("sumAllColumns")       rhive.export("sumAllColumns")   rhive.export("sumAllColumns.partial")   rhive.export("sumAllColumns.merge")   rhive.export("sumAllColumns.terminate")   rhive.assign, rhive.export, and rhive.exportAll only serve up until the task of preparing the Functions and Objects to be sent to Rserves currently in the distributed environment. In order to actually use these R Functions and Objects then you need to write them into the SQL syntax that will be taken by rhive.query. In the following examples, you will learn how to run the deployed Functions. RHive - UDF usage Now you must prepare a table for applying UDFs. If there is no proper table in Hive for UDFs, create one. This tutorial will convert data called USArrests into a Hive table. USArrests is a small bit of data which R possesses by default. Like below, convert the USArrests into Hive table and save it.
  • 6. rhive.write.table(USArrests)       rhive.query("SELECT  *  FROM  USArrests  LIMIT  10")                rowname  murder  assault  urbanpop  rape   1            Alabama      13.2          236              58  21.2   2              Alaska      10.0          263              48  44.5   3            Arizona        8.1          294              80  31.0   4          Arkansas        8.8          190              50  19.5   5      California        9.0          276              91  40.6   6          Colorado        7.9          204              78  38.7   7    Connecticut        3.3          110              77  11.1   8          Delaware        5.9          238              72  15.8   9            Florida      15.4          335              80  31.9   10          Georgia      17.4          211              60  25.8   The next example counts the total number of records in the USArrests table. rhive.query("SELECT  COUNT(*)  FROM  USArrests")      X_c0   1      50   The COUNT Function used in the above example is a Function used in SQL and is also a Hive UDF. Users familiar with the SQL syntax would know it as the most used included in SQL Function. Now you will be introduced to how to write and run a Function in R that does the same thing as what the COUNT Function does. First look at USArrests table’s description like below. rhive.desc.table("USArrests")      col_name  data_type  comment   1    rowname        string   2      murder        double  
  • 7. 3    assault              int   4  urbanpop              int   5          rape        double   Now we’ll make and execute a Function that gets the total sum of all the values of the murder, assault, and rape columns. The following is the entire code for that. library(RHive)   rhive.connect()       sumCrimes  <-­‐  function(column1,  column2,  column3)  {      column1  +  column2  +  column3   }       rhive.assign("sumCrimes",  sumCrimes)   rhive.export("sumCrimes")       rhive.query("SELECT  rowname,  urbanpop,  R('sumCrimes',  murder,   assault,  rape,  0.0)  FROM  usarrests")       rhive.close()   The results are as follows. rhive.query("SELECT  rowname,  urbanpop,  R('sumCrimes',  murder,   assault,  rape,  0.0)  AS  crimes  FROM  usarrests")                      rowname  urbanpop  crimes   1                  Alabama              58    270.4   2                    Alaska              48    317.5   3                  Arizona              80    333.1   ...   48    West  Virginia              39      96.0   49            Wisconsin              66      66.4  
  • 8. 50                Wyoming              60    183.4   The important thing about the example above, is the Function named “R()” written within the SQL syntax. This Function is Hive’s UDF Function, not R’s Function. More precisely, it’s a Function added onto Hive by RHive to allow RHive to process R’s Functions in Hive. R() Functions are Functions like sum, avg, or min. The R() Function calls its first argument and receive returned values and send them to Hive. So, to explain the R() Function used in the SQL syntax below: SELECT  rowname,  urbanpop,  R('sumCrime',  murder,  assault,  rape,   0.0)  FROM  usarrests     ‘sumCrime’ is the name of the R Function deployed by rhive.export, and the murder, assault and rape that follows ‘sumCrime’ are USArrests Hive table column names. And 0.0, the last argument entered into the R() Function, is the type of the value which R() Function will return. Enter 0.0 if the subCrime Function will return a numeric value and enter “” if it will return a character type. For example if the R Function, subCrime, returns a value of the character type then enter the following: rhive.query("SELECT  rowname,  urbanpop,  R('sumCrime',  murder,   assault,  rape,  "")  FROM  usarrests")   At the end, the syntax below returns a result composed of 3 columns, just like it was seen in the aforementioned result. SELECT  rowname,  urbanpop,  R('sumCrime',  murder,  assault,  rape,   0.0)  FROM  usarrests   Beware, R() Function returns only 1 value in this particular example. That is, if you look at the SQL syntax results in the example above, you can see R() Function ends up creating a new column value. Actually the SQL syntax used above can actually be processed with nothing more than Hive SQL. The two syntaxes below actually show the same results.
  • 9. RHive UDF SQL rhive.query("SELECT  rowname,  urbanpop,  R('sumCrime',  murder,   assault,  rape,  "")  FROM  usarrests")   Hive SQL rhive.query("SELECT  rowname,  urbanpop,  murder  +  assault  +  rape   AS  crimes  FROM  usarrests")   This tutorial uses something not useful for the sake of presenting an easy-to- learn example. Hive supports many arithmetics for UDF and column. If you use RHive to process massive data and there is a solution Hive SQL already supports it, it is recommended to use that solution. The following URL contains relevant details. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF If Hive does not support the feature but you still need to perform complex calculations for multiple columns, data mining, or machine learning then RHive UDF feature would be very useful for such tasks. RHive - UDAF 1 As something similar to UDF, there is something called UDAF. UDAF is there to support aggregation. It is a Function that enables processing of data from combined by column in “group by” phrase in SQL syntax. Like UDF, UDAF also needs to create a module in Java to add new things to Hive, but RHive enables the use of R language in creating them. This wasn’t explained but you have already seen 4 UDAF Functions called sumAllcolumns. Functions that start with the name, sumAllcolumns, are Functions that have to do with adding all the values of inputted arguments. You may find yourself asking several questions. First of them may be, “Why does the UDAF simultaneously need 4 Functions?” and second may be, “Where can UDAF be used?” To best understand this, it’s better for you to take a look at the runnable code.
  • 10. First, in order to make a table for suitable application of Functions made for UDAF, take the iris data which R’s datasets contain by default and upload it to Hive. rhive.write.table(iris)       [1]  "iris"   >  rhive.list.tables()                tab_name   1                  aids2   2                    iris   3  new_usarrests   4          usarrests       rhive.desc.table("iris")            col_name  data_type  comment   1          rowname        string   2  sepallength        double   3    sepalwidth        double   4  petallength        double   5    petalwidth        double   6          species        string       rhive.query("SELECT  *  FROM  iris  LIMIT  10")        rowname  sepallength  sepalwidth  petallength  petalwidth   species   1                1                  5.1                3.5                  1.4                0.2    setos a   2                2                  4.9                3.0                  1.4                0.2    setos a   3                3                  4.7                3.2                  1.3                0.2    setos a   4                4                  4.6                3.1                  1.5                0.2    setos a  
  • 11. 5                5                  5.0                3.6                  1.4                0.2    setos a   6                6                  5.4                3.9                  1.7                0.4    setos a   7                7                  4.6                3.4                  1.4                0.3    setos a   8                8                  5.0                3.4                  1.5                0.2    setos a   9                9                  4.4                2.9                  1.4                0.2    setos a   10            10                  4.9                3.1                  1.5                0.1    setos a   You can gain a general view of how iris data is composed. Now we shall take those with the same values in the species column, gather them up, and get the sum of each column’s values. The entirety of the completed code, after running, should result like this: sumAllColumns  <-­‐  function(prev,  values)  {      if  (is.null(prev))  {              prev  <-­‐  rep(0.0,  length(values))      }      prev  +  values   }       sumAllColumns.partial  <-­‐  function(values)  {      values   }       sumAllColumns.merge  <-­‐  function(prev,  values)  {      if  (is.null(prev))  {          prev  <-­‐  rep(0.0,  length(values))      }      prev  +  values  
  • 12. }       sumAllColumns.terminate  <-­‐  function(values)  {      values   }       rhive.assign("sumAllColumns",  sumAllColumns)   rhive.assign("sumAllColumns.partial",  sumAllColumns.partial)   rhive.assign("sumAllColumns.merge",  sumAllColumns.merge)   rhive.assign("sumAllColumns.terminate",   sumAllColumns.terminate)       rhive.exportAll("sumAllColumns")       result  <-­‐  rhive.query("      SELECT  species,  RA('sumAllColumns',  sepallength,  sepalwidth,   petallength,  petalwidth)      FROM  iris      GROUP  BY  species")       print(result)            species   1          setosa   2  versicolor   3    virginica                                                                                                                                                  X_c1   1   250.29999999999998,171.40000000000003,73.10000000000001,12.2999 99999999995   2                                                      296.8,138.50000000000003,212.999999 99999997,66.3   3                            329.3999999999999,148.7,277.59999999999997,101.2 9999999999998  
  • 13. If you take a look at the last printed results from the run results, you will see 3 newly created records each with 2 columns. First thing to note is the RA() Function which is in the SQL syntax which in turn is sent via rhive.query. This is similar to the R() Function. RA() Function’s returns only one value, and is always of the character type. And Hive processes the returned results and finally sends them to RHive. One thing to beware is, RA() Function is UDAF so you must use SQL’s GROUP BY syntax along with it. Peruse the Hive document for further details. To explain the SQL syntax above in detail: SELECT  species,  RA('sumAllColumns',  sepallength,  sepalwidth,   petallength,  petalwidth)      FROM  iris      GROUP  BY  species   It’s basically making a separate calculation for the values of columns with the “species” name which should be aggregated to respective groups using “GROUP BY”. The first argument sumAllColumns is the common prefix of 4 UDAF Functions. The rest are columns to be processed. And the RA Function returns a value using all the arguments. By now, you probably have unanswered questions raised and perhaps new ones as well. Those yet to be explained will be explained in the next section. RHive - UDAF 2 Now we’ll talk about the 4 UDAF R Functions aforementioned in the previous examples and their disadvantages. A total of 4 Functions were made for UDAF and they are: • sumAllColumns • sumAllColumns.partial • sumAllColumns.merge • sumAllColumns.terminate
  • 14. UDAF Functions for RHive must have a common prefix. Each 4 Functions should end with .partial, .merge, .terminate, and one without any postfix. The naming rule must be kept in RHive. The reason why 4 Functions are required is because there are 4 points in the Map/Reduce procedure where the Functions can run while Hive UDAF is running. Suppose you made 4 UDAF Functions that begin with the name “foobar”. Each Function will run in the following locations: • foobar – where Map’s aggregation is done (The combine step, to be precise) • foobar.partial – Where the aggregated result is sent to reduce • foobar.merge – Where Reduce’s aggregation is done • foobar.terminate – Where Reduce is terminated. And foobar and foobar.merge do similar things. They have to take 2 arguments. foobar.partial and foobar.terminate are also similar but they each have to take only 1 argument. Now you might still have question as to how these functions work. To gain an understanding of this, you must grasp the flow of their workings. • foobar - combine aggregation (map) • foobar.partial - combine return (map) • foobar.merge - reduce aggregation • foobar.terminate - reduce terminate The 2 combine steps might pass. If you do not become exact with your configurations then it will be difficult to figure out when these Functions pass and run. Thus making all 4 Functions is a must. For a more advanced knowhow and complete understanding, peruse the Hive and Hadoop documents. If you want to forgo complete understanding and just want to know how to use it, you must remember this: in order to use RHive’s UDAF support, you need to make 4 Functions and follow the rules. RHive - UDAF 3 This is about the workings and arguments of the 4 UDAF Functions. From the aforementioned sumAllColumns Functions, take a look at
  • 15. sumAllColumns and sumAllColumns.merge. The two Functions have the same code. sumAllColumns  <-­‐  function(prev,  values)  {      if  (is.null(prev))  {              prev  <-­‐  rep(0.0,  length(values))      }      prev  +  values   }       sumAllColumns.merge  <-­‐  function(prev,  values)  {      if  (is.null(prev))  {          prev  <-­‐  rep(0.0,  length(values))      }      prev  +  values   }   These Functions need to process 2 arguments. The first argument is the returned values from sumAllColumns.merge and sumAllColumns. The second argument is the value regarding record which Hive hands over. The first and the second arguments are actually all vector or list. Because it is handed over as vector, you need to remember the sequence of the columns inputted into the SQL syntax. The “prev” handed over as the first argument is actually a returned value from the previous step. So when the Function is first run it is sent over a NULL value. Thus within the Function, you can use is.null to see whether it is a NULL value. if  (is.null(prev))  {              prev  <-­‐  rep(0.0,  length(values))      }   So the values that are recursively processed and returned from the last Record, are sent over as the arguments for sumAllColumns.partial and
  • 16. sumAllColumns.terminate. The two Functions’ codes are identical. sumAllColumns.partial  <-­‐  function(values)  {      values   }       sumAllColumns.terminate  <-­‐  function(values)  {      values   }   These two Functions do not recursively run, therefore they can only receive one argument. And the result of sumAllColumns.terminate, is sent to Hive and forms one column value. In the above example you can each see 2 Functions with identical codes, but this is an easy example made for the sake of making a simple tutorial. In actual practice, the 4 Functions’ codes may all differ depending on context. Even if all 4 Functions have identical codes, there must always be 4 Functions. RHive - UDTF This section is devoted to streamline the codes you’ve written so far to be more graceful. It is quite difficult to handle something that just comes out as a single string value. There is a need to split it to its individual constituents and the unfold Function, a UDTF Function supported by RHive, is proper for this. The result of this is the result of running SQL syntax shown in a previous example. print(result)            species   1          setosa   2  versicolor   3    virginica                                                                                                                                                  X_c1  
  • 17. 1   250.29999999999998,171.40000000000003,73.10000000000001,12.2999 99999999995   2                                                      296.8,138.50000000000003,212.999999 99999997,66.3   3                            329.3999999999999,148.7,277.59999999999997,101.2 9999999999998   The 2nd column, X_c1, is a value made by UDAF and it consists of character type. You can also see the values are distinguished by “,”s between them. To make this back into a numeric vector, R Functions like strsplit must be used. However, even if there are no problems with using that when there is a small number of Records, a problem occurs otherwise. The example above only has 3 Records but when applying the same procedure for big tables, you might encounter millions of Records. Hence the values returned by UDAF must be each split and made into column values. To do this you need subquery and UDTF, and the code altered after using UDTF is as such: sumAllColumns  <-­‐  function(prev,  values)  {      if  (is.null(prev))  {              prev  <-­‐  rep(0.0,  length(values))      }      prev  +  values   }       sumAllColumns.partial  <-­‐  function(values)  {      values   }       sumAllColumns.merge  <-­‐  function(prev,  values)  {      if  (is.null(prev))  {          prev  <-­‐  rep(0.0,  length(values))      }  
  • 18.    prev  +  values   }       sumAllColumns.terminate  <-­‐  function(values)  {      values   }       rhive.assign("sumAllColumns",  sumAllColumns)   rhive.assign("sumAllColumns.partial",  sumAllColumns.partial)   rhive.assign("sumAllColumns.merge",  sumAllColumns.merge)   rhive.assign("sumAllColumns.terminate",   sumAllColumns.terminate)       rhive.exportAll("sumAllColumns")       result  <-­‐  rhive.query(      "SELECT  unfold(dummytable.dummycolumn,  0.0,  0.0,  0.0,  0.0,   ',')                          AS  (sepallength,  sepalwidth,  petallength,   petalwidth)        FROM  (            SELECT  RA('sumAllColumns',  sepallength,  sepalwidth,   petallength,  petalwidth)  AS  dummycolumn            FROM  iris            GROUP  BY  species            )  dummytable")       print(result)      sepallength  sepalwidth  petallength  petalwidth   1              250.3            171.4                73.1              12.3   2              296.8            138.5              213.0              66.3   3              329.4            148.7              277.6            101.3  
  • 19. SQL syntax** became a bit complex compared to prior examples, but in the final result, you can see that the UDAF return values are all split into columns by the “unfold” UDTF. Unfold is the UDTF Function supported by RHive, so there is no need to separately apply R code. And Hive’s UDTF must be used alone. This is not solvable in RHive because this depends on Hive. Examples so far only performed Map/Reduce once. If you are trying to use RHive for a very complex task then you may need multiple Map/Reduces. This is often seen in normal Map/Reduce implementations or Hadoop streaming implementations. If you attempt to combine these Map/Reduces, you may need to make temporary tables to save results and then delete them later.