The document provides information on advanced functions in RHive including UDF, UDAF, and UDTF. It explains:
1) RHive allows writing UDF and UDAF functions in R and deploying them for use in Hive, allowing R functions to be used at lower complexity levels in Hive's MapReduce programming.
2) The rhive.assign, rhive.export, and rhive.exportAll functions are used to deploy R functions and objects to the distributed Hive environment for processing large datasets.
3) An example demonstrates creating a sum function in R, assigning it using rhive.assign, and executing it on the USArrests Hive table using rhive
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
RHive - advanced UDF and UDAF functions in R for Hive
1. RHive tutorial - advanced functions
RHive supports basic Functions, HDFS, “apply”-like Functions for
map/reducing, and advanced Functions that support UDF, UDAF, and UDTF.
Advanced Functions are composed of UDF/UDAF Functions used to
implement Hive’s UDF and UDAF with R.
If you use these Functions you can apply R down to the lower levels of
complexity** and perform Map/Reduce programming that uses Hadoop and
Hive. You can also use R to deal with a large portion of complex algorithms or
data processing procedures in Data Mining.
RHive - UDF and UDAF functions
Hive provides UDF(User Defined Function), UDAF(User Defined Aggregate
Function), and UDTF(User Defined Table create Function).
UDF refers to Functions contained in Hive that uses SQL syntax such as
count, avg, min, and max to perform calculations for one column or multiple
columns.
Hive comes with support for several UDFs and allows users to add more.
But originally, UDF, UDAF, and UDTF must all be written in Java and cannot
be directly made with the R language.
RHive can write UDF and UDAF into R language and make them useable in
Hive, and take the UDTF readymade by RHive and use them for split a
complex column into appropriated columns.
This tutorial explains how to implement these UDFs and UDAFs through
examples using RHive.
rhive.assign
The rhive.assign Function assigns the Functions and variables made in R so
that they may be referenced from Hive.
The reason why such assigning is required is: we need to select variables and
Functions to use for distributed environment.
One thing to beware is rhive.assign only performs essential tasks before
deploying the Functions, variables, and Objects in distributed environment, but
rhive.assign doesn’t deploy to distributed environments.
It’s just for making preparations and its normally used with rhive.export or
rhive.exportAll, which does the actual deployment.
Deployment here means, like Hadoop’s Map/Reduce, putting the Functions
and Objects to use in Hive in standby within Job node, and load them
whenever needed for processing data.
2. The rhive.assign Function takes in 2 arguments, the first of which is the alias
of the character type of the symbol created in R, and the other argument is the
symbol to be deployed.
If the symbol that will be deployed into the distributed environment has the
name, “newsum”, then use rhive.assign to assign it like this:
newsum
<-‐
function(value)
{
value
+
1
}
rhive.assign("newsum",
newsum)
The syntax may seem a bit odd but this is due to R’s structural problems and
will be improved in the future.
The first argument just needs to make the symbol that will enter as the second
argument into a string.
Normally, inserting a string same as the symbol makes working with it easier,
but this is not an obligation. However, in the case of changing the name you
must keep in mind that the assigned symbol’s name will change during use:
just for the sake of confusion it’s best not to change that name.
Look at the following example to figure out how to use rhive.assign.
The following example makes a Function, “sum3values”, and uses
rhive.assign to deploy it to a distributed environment.
sum3values
<-‐
function(a,b,c)
{
a
+
b
+
c
}
rhive.assign("sum3values",
sum3values)
[1]
TRUE
You can also assign objects that are not Functions.
coef1
<-‐
3.141593
rhive.assign("coef1",
coef1)
Anything, including dataframes, are viable as well.
3. library(MASS)
>
head(cats)
Sex
Bwt
Hwt
1
F
2.0
7.0
2
F
2.0
7.4
3
F
2.0
9.5
4
F
2.1
7.2
5
F
2.1
7.3
6
F
2.1
7.6
>
rhive.assign("cats",
cats)
[1]
TRUE
Objects and dataframe can also be used with rhive.assign.
But if these objects become deployed to the distributed environment then in
that environment they don’t exist as shared memory but as local memory, so
be careful.
rhive.export
rhive.export prepares objects made in R by actually deploying them.
rhive.export is used along with rhive.assign and the names of the objects it
gets for arguments must be ready to be deployed via rhive.assign.
You can easily learn how it is used from the following example.
sum3values
<-‐
function(a,b,c)
{
a
+
b
+
c
}
rhive.assign("sum3values",
sum3values)
rhive.export(sum3values)
Aside from the first argument, exportname, rhive.export has many other
arguments.
But these are for server and ports that are separately assignable so as to
allow them to be used in a complexly set up environment—hence normal
4. circumstances do not require them.
If you need to set up an environment complex enough to warrant the use of
these things then peruse a manual or contact the RHive development team.
rhive.exportAll
Although rhive.exportAll is functionally similar to rhive.export, their difference
lies in that rhive.exportAll serves to entirely deploys all symbols starting with
the same string for the first argument.
This function serves to deploy UDAF Functions written in R.
UDAF Functions, written in R, have a limitation of 4 Functions starting with the
same name. Made to easily deploy these, UDAF Functions are a type of
Function from rhive.export.
The following example shows the making of 4 Functions and deploying them
to a distributed environment.
sumAllColumns
<-‐
function(prev,
values)
{
if
(is.null(prev))
{
prev
<-‐
rep(0.0,
length(values))
}
prev
+
values
}
sumAllColumns.partial
<-‐
function(values)
{
values
}
sumAllColumns.merge
<-‐
function(prev,
values)
{
if
(is.null(prev))
{
prev
<-‐
rep(0.0,
length(values))
}
prev
+
values
}
5. sumAllColumns.terminate
<-‐
function(values)
{
values
}
rhive.assign("sumAllColumns",
sumAllColumns)
rhive.assign("sumAllColumns.partial",
sumAllColumns.partial)
rhive.assign("sumAllColumns.merge",
sumAllColumns.merge)
rhive.assign("sumAllColumns.terminate",
sumAllColumns.terminate)
rhive.exportAll("sumAllColumns")
The last line is actually same as the following:
#
rhive.exportAll("sumAllColumns")
rhive.export("sumAllColumns")
rhive.export("sumAllColumns.partial")
rhive.export("sumAllColumns.merge")
rhive.export("sumAllColumns.terminate")
rhive.assign, rhive.export, and rhive.exportAll only serve up until the task of
preparing the Functions and Objects to be sent to Rserves currently in the
distributed environment.
In order to actually use these R Functions and Objects then you need to write
them into the SQL syntax that will be taken by rhive.query.
In the following examples, you will learn how to run the deployed Functions.
RHive - UDF usage
Now you must prepare a table for applying UDFs.
If there is no proper table in Hive for UDFs, create one.
This tutorial will convert data called USArrests into a Hive table.
USArrests is a small bit of data which R possesses by default.
Like below, convert the USArrests into Hive table and save it.
6. rhive.write.table(USArrests)
rhive.query("SELECT
*
FROM
USArrests
LIMIT
10")
rowname
murder
assault
urbanpop
rape
1
Alabama
13.2
236
58
21.2
2
Alaska
10.0
263
48
44.5
3
Arizona
8.1
294
80
31.0
4
Arkansas
8.8
190
50
19.5
5
California
9.0
276
91
40.6
6
Colorado
7.9
204
78
38.7
7
Connecticut
3.3
110
77
11.1
8
Delaware
5.9
238
72
15.8
9
Florida
15.4
335
80
31.9
10
Georgia
17.4
211
60
25.8
The next example counts the total number of records in the USArrests table.
rhive.query("SELECT
COUNT(*)
FROM
USArrests")
X_c0
1
50
The COUNT Function used in the above example is a Function used in SQL
and is also a Hive UDF.
Users familiar with the SQL syntax would know it as the most used included in
SQL Function.
Now you will be introduced to how to write and run a Function in R that does
the same thing as what the COUNT Function does.
First look at USArrests table’s description like below.
rhive.desc.table("USArrests")
col_name
data_type
comment
1
rowname
string
2
murder
double
7. 3
assault
int
4
urbanpop
int
5
rape
double
Now we’ll make and execute a Function that gets the total sum of all the
values of the murder, assault, and rape columns.
The following is the entire code for that.
library(RHive)
rhive.connect()
sumCrimes
<-‐
function(column1,
column2,
column3)
{
column1
+
column2
+
column3
}
rhive.assign("sumCrimes",
sumCrimes)
rhive.export("sumCrimes")
rhive.query("SELECT
rowname,
urbanpop,
R('sumCrimes',
murder,
assault,
rape,
0.0)
FROM
usarrests")
rhive.close()
The results are as follows.
rhive.query("SELECT
rowname,
urbanpop,
R('sumCrimes',
murder,
assault,
rape,
0.0)
AS
crimes
FROM
usarrests")
rowname
urbanpop
crimes
1
Alabama
58
270.4
2
Alaska
48
317.5
3
Arizona
80
333.1
...
48
West
Virginia
39
96.0
49
Wisconsin
66
66.4
8. 50
Wyoming
60
183.4
The important thing about the example above, is the Function named “R()”
written within the SQL syntax.
This Function is Hive’s UDF Function, not R’s Function.
More precisely, it’s a Function added onto Hive by RHive to allow RHive to
process R’s Functions in Hive.
R() Functions are Functions like sum, avg, or min. The R() Function calls its
first argument and receive returned values and send them to Hive.
So, to explain the R() Function used in the SQL syntax below:
SELECT
rowname,
urbanpop,
R('sumCrime',
murder,
assault,
rape,
0.0)
FROM
usarrests
‘sumCrime’ is the name of the R Function deployed by rhive.export, and the
murder, assault and rape that follows ‘sumCrime’ are USArrests Hive table
column names.
And 0.0, the last argument entered into the R() Function, is the type of the
value which R() Function will return.
Enter 0.0 if the subCrime Function will return a numeric value and enter “” if it
will return a character type.
For example if the R Function, subCrime, returns a value of the character type
then enter the following:
rhive.query("SELECT
rowname,
urbanpop,
R('sumCrime',
murder,
assault,
rape,
"")
FROM
usarrests")
At the end, the syntax below returns a result composed of 3 columns, just like
it was seen in the aforementioned result.
SELECT
rowname,
urbanpop,
R('sumCrime',
murder,
assault,
rape,
0.0)
FROM
usarrests
Beware, R() Function returns only 1 value in this particular example.
That is, if you look at the SQL syntax results in the example above, you can
see R() Function ends up creating a new column value.
Actually the SQL syntax used above can actually be processed with nothing
more than Hive SQL.
The two syntaxes below actually show the same results.
9. RHive UDF SQL
rhive.query("SELECT
rowname,
urbanpop,
R('sumCrime',
murder,
assault,
rape,
"")
FROM
usarrests")
Hive SQL
rhive.query("SELECT
rowname,
urbanpop,
murder
+
assault
+
rape
AS
crimes
FROM
usarrests")
This tutorial uses something not useful for the sake of presenting an easy-to-
learn example.
Hive supports many arithmetics for UDF and column.
If you use RHive to process massive data and there is a solution Hive SQL
already supports it, it is recommended to use that solution.
The following URL contains relevant details.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
If Hive does not support the feature but you still need to perform complex
calculations for multiple columns, data mining, or machine learning then RHive
UDF feature would be very useful for such tasks.
RHive - UDAF 1
As something similar to UDF, there is something called UDAF.
UDAF is there to support aggregation. It is a Function that enables processing
of data from combined by column in “group by” phrase in SQL syntax.
Like UDF, UDAF also needs to create a module in Java to add new things to
Hive, but RHive enables the use of R language in creating them.
This wasn’t explained but you have already seen 4 UDAF Functions called
sumAllcolumns.
Functions that start with the name, sumAllcolumns, are Functions that have to
do with adding all the values of inputted arguments.
You may find yourself asking several questions.
First of them may be, “Why does the UDAF simultaneously need 4 Functions?”
and second may be, “Where can UDAF be used?”
To best understand this, it’s better for you to take a look at the runnable code.
10. First, in order to make a table for suitable application of Functions made for
UDAF, take the iris data which R’s datasets contain by default and upload it to
Hive.
rhive.write.table(iris)
[1]
"iris"
>
rhive.list.tables()
tab_name
1
aids2
2
iris
3
new_usarrests
4
usarrests
rhive.desc.table("iris")
col_name
data_type
comment
1
rowname
string
2
sepallength
double
3
sepalwidth
double
4
petallength
double
5
petalwidth
double
6
species
string
rhive.query("SELECT
*
FROM
iris
LIMIT
10")
rowname
sepallength
sepalwidth
petallength
petalwidth
species
1
1
5.1
3.5
1.4
0.2
setos
a
2
2
4.9
3.0
1.4
0.2
setos
a
3
3
4.7
3.2
1.3
0.2
setos
a
4
4
4.6
3.1
1.5
0.2
setos
a
11. 5
5
5.0
3.6
1.4
0.2
setos
a
6
6
5.4
3.9
1.7
0.4
setos
a
7
7
4.6
3.4
1.4
0.3
setos
a
8
8
5.0
3.4
1.5
0.2
setos
a
9
9
4.4
2.9
1.4
0.2
setos
a
10
10
4.9
3.1
1.5
0.1
setos
a
You can gain a general view of how iris data is composed.
Now we shall take those with the same values in the species column, gather
them up, and get the sum of each column’s values.
The entirety of the completed code, after running, should result like this:
sumAllColumns
<-‐
function(prev,
values)
{
if
(is.null(prev))
{
prev
<-‐
rep(0.0,
length(values))
}
prev
+
values
}
sumAllColumns.partial
<-‐
function(values)
{
values
}
sumAllColumns.merge
<-‐
function(prev,
values)
{
if
(is.null(prev))
{
prev
<-‐
rep(0.0,
length(values))
}
prev
+
values
12. }
sumAllColumns.terminate
<-‐
function(values)
{
values
}
rhive.assign("sumAllColumns",
sumAllColumns)
rhive.assign("sumAllColumns.partial",
sumAllColumns.partial)
rhive.assign("sumAllColumns.merge",
sumAllColumns.merge)
rhive.assign("sumAllColumns.terminate",
sumAllColumns.terminate)
rhive.exportAll("sumAllColumns")
result
<-‐
rhive.query("
SELECT
species,
RA('sumAllColumns',
sepallength,
sepalwidth,
petallength,
petalwidth)
FROM
iris
GROUP
BY
species")
print(result)
species
1
setosa
2
versicolor
3
virginica
X_c1
1
250.29999999999998,171.40000000000003,73.10000000000001,12.2999
99999999995
2
296.8,138.50000000000003,212.999999
99999997,66.3
3
329.3999999999999,148.7,277.59999999999997,101.2
9999999999998
13. If you take a look at the last printed results from the run results, you will see 3
newly created records each with 2 columns.
First thing to note is the RA() Function which is in the SQL syntax which in
turn is sent via rhive.query.
This is similar to the R() Function.
RA() Function’s returns only one value, and is always of the character type.
And Hive processes the returned results and finally sends them to RHive.
One thing to beware is, RA() Function is UDAF so you must use SQL’s
GROUP BY syntax along with it.
Peruse the Hive document for further details.
To explain the SQL syntax above in detail:
SELECT
species,
RA('sumAllColumns',
sepallength,
sepalwidth,
petallength,
petalwidth)
FROM
iris
GROUP
BY
species
It’s basically making a separate calculation for the values of columns with the
“species” name which should be aggregated to respective groups using
“GROUP BY”.
The first argument sumAllColumns is the common prefix of 4 UDAF Functions.
The rest are columns to be processed. And the RA Function returns a value
using all the arguments.
By now, you probably have unanswered questions raised and perhaps new
ones as well. Those yet to be explained will be explained in the next section.
RHive - UDAF 2
Now we’ll talk about the 4 UDAF R Functions aforementioned in the previous
examples and their disadvantages.
A total of 4 Functions were made for UDAF and they are:
• sumAllColumns
• sumAllColumns.partial
• sumAllColumns.merge
• sumAllColumns.terminate
14. UDAF Functions for RHive must have a common prefix. Each 4 Functions
should end with .partial, .merge, .terminate, and one without any postfix.
The naming rule must be kept in RHive.
The reason why 4 Functions are required is because there are 4 points in the
Map/Reduce procedure where the Functions can run while Hive UDAF is
running.
Suppose you made 4 UDAF Functions that begin with the name “foobar”.
Each Function will run in the following locations:
• foobar – where Map’s aggregation is done (The combine step, to be precise)
• foobar.partial – Where the aggregated result is sent to reduce
• foobar.merge – Where Reduce’s aggregation is done
• foobar.terminate – Where Reduce is terminated.
And
foobar and foobar.merge do similar things. They have to take 2 arguments.
foobar.partial and foobar.terminate are also similar but they each have to take
only 1 argument.
Now you might still have question as to how these functions work.
To gain an understanding of this, you must grasp the flow of their workings.
• foobar - combine aggregation (map)
• foobar.partial - combine return (map)
• foobar.merge - reduce aggregation
• foobar.terminate - reduce terminate
The 2 combine steps might pass.
If you do not become exact with your configurations then it will be difficult to
figure out when these Functions pass and run.
Thus making all 4 Functions is a must.
For a more advanced knowhow and complete understanding, peruse the Hive
and Hadoop documents.
If you want to forgo complete understanding and just want to know how to use
it, you must remember this: in order to use RHive’s UDAF support, you need
to make 4 Functions and follow the rules.
RHive - UDAF 3
This is about the workings and arguments of the 4 UDAF Functions.
From the aforementioned sumAllColumns Functions, take a look at
15. sumAllColumns and sumAllColumns.merge.
The two Functions have the same code.
sumAllColumns
<-‐
function(prev,
values)
{
if
(is.null(prev))
{
prev
<-‐
rep(0.0,
length(values))
}
prev
+
values
}
sumAllColumns.merge
<-‐
function(prev,
values)
{
if
(is.null(prev))
{
prev
<-‐
rep(0.0,
length(values))
}
prev
+
values
}
These Functions need to process 2 arguments.
The first argument is the returned values from sumAllColumns.merge and
sumAllColumns.
The second argument is the value regarding record which Hive hands over.
The first and the second arguments are actually all vector or list.
Because it is handed over as vector, you need to remember the sequence of
the columns inputted into the SQL syntax.
The “prev” handed over as the first argument is actually a returned value from
the previous step.
So when the Function is first run it is sent over a NULL value.
Thus within the Function, you can use is.null to see whether it is a NULL value.
if
(is.null(prev))
{
prev
<-‐
rep(0.0,
length(values))
}
So the values that are recursively processed and returned from the last
Record, are sent over as the arguments for sumAllColumns.partial and
16. sumAllColumns.terminate.
The two Functions’ codes are identical.
sumAllColumns.partial
<-‐
function(values)
{
values
}
sumAllColumns.terminate
<-‐
function(values)
{
values
}
These two Functions do not recursively run, therefore they can only receive
one argument. And the result of sumAllColumns.terminate, is sent to Hive and
forms one column value.
In the above example you can each see 2 Functions with identical codes, but
this is an easy example made for the sake of making a simple tutorial. In
actual practice, the 4 Functions’ codes may all differ depending on context.
Even if all 4 Functions have identical codes, there must always be 4 Functions.
RHive - UDTF
This section is devoted to streamline the codes you’ve written so far to be
more graceful.
It is quite difficult to handle something that just comes out as a single string
value.
There is a need to split it to its individual constituents and the unfold Function,
a UDTF Function supported by RHive, is proper for this.
The result of this is the result of running SQL syntax shown in a previous
example.
print(result)
species
1
setosa
2
versicolor
3
virginica
X_c1
17. 1
250.29999999999998,171.40000000000003,73.10000000000001,12.2999
99999999995
2
296.8,138.50000000000003,212.999999
99999997,66.3
3
329.3999999999999,148.7,277.59999999999997,101.2
9999999999998
The 2nd column, X_c1, is a value made by UDAF and it consists of character
type.
You can also see the values are distinguished by “,”s between them.
To make this back into a numeric vector, R Functions like strsplit must be
used. However, even if there are no problems with using that when there is a
small number of Records, a problem occurs otherwise.
The example above only has 3 Records but when applying the same
procedure for big tables, you might encounter millions of Records.
Hence the values returned by UDAF must be each split and made into column
values.
To do this you need subquery and UDTF, and the code altered after using
UDTF is as such:
sumAllColumns
<-‐
function(prev,
values)
{
if
(is.null(prev))
{
prev
<-‐
rep(0.0,
length(values))
}
prev
+
values
}
sumAllColumns.partial
<-‐
function(values)
{
values
}
sumAllColumns.merge
<-‐
function(prev,
values)
{
if
(is.null(prev))
{
prev
<-‐
rep(0.0,
length(values))
}
18.
prev
+
values
}
sumAllColumns.terminate
<-‐
function(values)
{
values
}
rhive.assign("sumAllColumns",
sumAllColumns)
rhive.assign("sumAllColumns.partial",
sumAllColumns.partial)
rhive.assign("sumAllColumns.merge",
sumAllColumns.merge)
rhive.assign("sumAllColumns.terminate",
sumAllColumns.terminate)
rhive.exportAll("sumAllColumns")
result
<-‐
rhive.query(
"SELECT
unfold(dummytable.dummycolumn,
0.0,
0.0,
0.0,
0.0,
',')
AS
(sepallength,
sepalwidth,
petallength,
petalwidth)
FROM
(
SELECT
RA('sumAllColumns',
sepallength,
sepalwidth,
petallength,
petalwidth)
AS
dummycolumn
FROM
iris
GROUP
BY
species
)
dummytable")
print(result)
sepallength
sepalwidth
petallength
petalwidth
1
250.3
171.4
73.1
12.3
2
296.8
138.5
213.0
66.3
3
329.4
148.7
277.6
101.3
19. SQL syntax** became a bit complex compared to prior examples, but in the
final result, you can see that the UDAF return values are all split into columns
by the “unfold” UDTF.
Unfold is the UDTF Function supported by RHive, so there is no need to
separately apply R code.
And Hive’s UDTF must be used alone.
This is not solvable in RHive because this depends on Hive.
Examples so far only performed Map/Reduce once.
If you are trying to use RHive for a very complex task then you may need
multiple Map/Reduces.
This is often seen in normal Map/Reduce implementations or Hadoop
streaming implementations.
If you attempt to combine these Map/Reduces, you may need to make
temporary tables to save results and then delete them later.