The Ultimate Guide to Choosing WordPress Pros and Cons
RHive tutorials - Basic functions
1. RHive tutorial - basic functions
This tutorial explains how to load RHive library and use basic Functions for
RHive.
Loading RHive
Load RHive with the method used when using any R package. Load RHive
like below:
library(RHive)
But before loading RHive, you must not forget to configure HADOOP_HOME
and HIVE_HOME environment
And if they are not set then you can temporarily set them before loading the
library, like as follows.
HADOOP_HOME is the home directory where Hadoop is installed and
HIVE_HOME is the home directory where Hive is installed.
Consult RHive tutorial - RHive installation and setting for details on
environment variables.
Sys.setenv(HIVE_HOME="/service/hive-‐0.7.1")
Sys.setenv(HADOOP_HOME="/service/hadoop-‐0.20.203.0")
library(RHive)
rhive.init
rhive.init is a procedure that internally initializes and if, before loading RHive,
environment variables were calibrated accurately then they will automatically
run.
But if these environment variable were not configured while RHive was loaded
via library(RHIve) then the following error message will result.
rhive.connect()
Error
in
.jcall("java/lang/Class",
"Ljava/lang/Class;",
"forName",
cl,
:
No
running
JVM
detected.
Maybe
.jinit()
would
help.
Error
in
.jfindClass(as.character(class))
:
No
running
JVM
detected.
Maybe
.jinit()
would
help.
2. For this case then designate HADOOP_HOME and HADOOP_HOME as
shown below or exit R then configure environment variables and restart R.
Sys.setenv(HIVE_HOME="/service/hive-‐0.7.1")
Sys.setenv(HADOOP_HOME="/service/hadoop-‐0.20.203.0")
rhive.init()
Or,
close
R
export
HIVE_HOME="/service/hive-‐0.7.1"
export
HADOOP_HOME="/service/hadoop-‐0.20.203.0"
open
R
rhive.connect
All Functions of RHive will only work after having connected to Hive server.
If before using other Functions of RHive, you have not established a
connection by using the rhive.connect Function,
All RHive Functions will malfunction and produce the following errors when
running.
Error
in
.jcast(hiveclient[[1]],
new.class
=
"org/apache/hadoop/hive/service/HiveClient",
:
cannot
cast
anything
but
Java
objects
Establishing a connection with Hive server to use RHive is simple with the
following:
rhive.connect()
The example above can additionally assign a few more things.
rhiveConnection
<-‐
rhive.connect("10.1.1.1")
In the case the user’s Hive server is installed to a server other than the one
with RHive installed, and has to remotely connect,
a connection can be made by handing arguments over to the rhive.connect
Function.
3. Then if you have multiple Hadoop and Hive clusters, then after making the
right configurations to have RHive activated, and you want to switch between
the Hives then
just like using DB client such as MySQL, you should make connections and
hand it over to the Functions via arguments to explicitly select connection.
rhive.query
If the user has experience in using Hive, then he/she probably knows that
Hive supports SQL syntax to handle the data for Map/Reduce and HDFS.
rhive.query gives SQL to Hive and receives results from Hive.
Users who know SQL syntax will find this a frequently encountered example.
rhive.query("SELECT
*
FROM
usarrests")
If you run the example above then you will see the contents of a table named
‘usarrests’ printed on the screen.
Or, on top of printing the returned result on the screen, you can also assign to
a data.frame object those results.
resultDF
<-‐
rhive.query("SELECT
*
FROM
usarrests")
A thing to beware of is if the data returned from rhive.query is bigger than the
RHive server’s memory or laptop’s, exhaustion of available memory will
induce an error message.
That is why you must not receive and put into object any data of such size.
It is better to first create a temporary table and then put the results of the SQL
to the temporary table.
You can do it as the following.
rhive.query("
CREATE
TABLE
new_usarrests
(
rowname
string,
murder
double,
assault
int,
urbanpop
int,
rape
double
)")
4.
rhive.query("INSERT
OVERWRITE
TABLE
new_usarrests
SELECT
*
FROM
usarrest")
Consult a Hive document for a detailed account of how to use Hive SQL.
rhive.close
If you have finished using Hive and do not wish to use RHive Functions any
longer, you can use the rhive.close Function to terminate the connection.
rhive.close()
Alternatively, you can assign a specific connection to close it.
conn
<-‐
rhive.connect()
rhive.close(conn)
rhive.list.tables
The rhive.list.tables Function returns the results of tables in Hive.
rhive.list.tables()
tab_name
1
aids2
2
new_usarrests
3
usarrests
This is effectively identical to this:
rhive.query("SHOW
TABLES")
rhive.desc.table
The rhive.desc.table Function shows the description of the chosen table.
rhive.desc.table("usarrests")
5.
col_name
data_type
comment
1
rowname
string
2
murder
double
3
assault
int
4
urbanpop
int
5
rape
double
This is effectively identical to this:
rhive.query("DESC
usarrests")
rhive.load.table
The rhive.load.table Function loads Hive tables’ contents as R’s data.frame
object.
df1
<-‐
rhive.load.table("usarrests")
df1
This is effectively identical to this:
df1
<-‐
rhive.query("SELECT
*
FROM
usarrests")
df1
rhive.write.table
The rhive.write.table Function is the antithesis of rhive.load.table.
But it is more useful than rhive.load.table.
If you wish to add data to a table located in Hive, you must first make a table.
But using rhive.write.table does not require any additional work, and simply
creates R’s dataframe into Hive and inserts all data.
head(UScrime)
M
So
Ed
Po1
Po2
LF
M.F
Pop
NW
U1
U2
GDP
Ineq
Prob
Time
y
1
151
1
91
58
56
510
950
33
301
108
41
394
261
0.084602
26.2011
791
7. 0.034201
20.9995
7
7
127
1
111
82
79
519
982
4
139
97
38
620
168
0.042100
20.6993
8
8
131
1
109
115
109
542
969
50
179
79
35
472
206
0.040099
24.5988
9
9
157
1
90
65
62
553
955
39
286
81
28
421
239
0.071697
29.4001
10
10
140
0
118
71
68
632
1029
7
15
100
24
526
174
0.044498
19.5994
y
1
791
2
1635
3
578
4
1969
5
1234
6
682
7
963
8
1555
9
856
10
705
The rhive.write.table Function encounters an error and does not work if the
table to be saved into Hive already exists.
Hence, if attempting to save to Hive any dataframes with the same name and
symbol as any table already in Hive, it is imperative that you delete them
before using rhive.write.table.
if
(rhive.exist.table("uscrime"))
{
rhive.query("DROP
TABLE
uscrime")
}
rhive.write.table(UScrime)
8. RHive - alias functions
RHive’s Functions look similar to S3 generic’s naming rules but many are
actually not generic. This is for the S3 generic Functions which RHive may or
may not support in the future.
For users who detest confusion wrought by Functions that, despite containing
“.” yet still do not count as generic, there exist some Functions with different
names but serve the same roles. The following alias Functions are such as
described below.
hiveConnect
This is same as rhive.connect.
hiveQuery
This is same as rhive.query.
hiveClose
This is same as hive.close.
hiveListTables
This is same as hive.list.tables.
hiveDescTable
This is same as hive.desc.table.
hiveLoadTable
This is same as hive.load.table.
9. rhive.basic.cut
rhive.basic.cut converts one numerical column from a table to one factorized
column. First, the range of the numerical column is divided into intervals, and
the values in the numerical column are factorized according to which interval
they fall. Rhive.basic.cut receives the following six arguments, tablename(a
table name), col(a numerical column name), breaks, right, summary, and
forcedRef. breaks are numerical cut points for the numerical column. right
indicates if the ends of the intervals are open or closed. If TRUE, the intervals
are closed on the right and open on the left. If not, vice versa. summary =
TRUE spits out total counts of numerical values corresponding to the intervals.
If FALSE, the name of a new table updated by the factorized table is returned.
forcedRef = TRUE forces rhive.basic.cut to return a table name instead of a
data frame for forcedRef = FALSE. The defaults of right, summary,
and forcedRef are TRUE, FALSE, and TRUE respectively.
Example for summary = FALSE
>
table_name
=
rhive.basic.cut(tablename
=
"iris",
col
=
"sepallength",
breaks
=
seq(0,
5,
0.5),
right
=
FALSE,
summary
=
FALSE,
forcedRef
=
TRUE)
>
table_name
[1]
"rhive_result_1330382904"
attr(,"result:size")
[1]
4296
>
results
=
rhive.query("select
*
from
rhive_result_1330382904")
>
head(results)
rowname
sepalwidth
petallength
petalwidth
species
sepallength
1
1
3.5
1.4
0.2
setosa
NULL
2
2
3.0
1.4
0.2
setosa
[4.5,5.0)
3
3
3.2
1.3
0.2
setosa
[4.5,5.0)
4
4
3.1
1.5
0.2
setosa
[4.5,5.0)
5
5
3.6
1.4
0.2
setosa
NULL
6
6
3.9
1.7
0.4
setosa
NULL
Example for summary = TRUE
10. >
summary
=
rhive.basic.cut(tablename
=
"iris",
col
=
"sepallength",
breaks
=
seq(0,
5,
0.5),
right
=
FALSE,
summary
=
TRUE,
forcedRef
=
TRUE)
>
summary
NULL
[4.0,4.5)
[4.5,5.0)
128
4
18
rhive.basic.cut2
rhive.basic.cut2 converts two numerical columns from a table to two factorized
columns. That is, the range of each numerical column is divided into intervals,
and the values in each numerical column are factorized according to which
interval they fall. Rhive.basic.cut2 receives the following eight arguments,
tablename(a table name), col1, col2(two column names), breaks1, breaks2,
right, keepCol, and forcedRef. breaks1 and breaks2 are numerical cut points
for the two numerical columns. right indicates if the ends of the intervals are
open or closed. If TRUE, the intervals are closed on the right and open on the
left. If not, vice versa. keepCol = TRUE makes the two numerical columns
kept even after the conversion. Otherwise, the factorized columns replace the
original numerical columns. forcedRef = TRUE forces rhive.basic.cut to return
a table name instead of a data frame for forcedRef = FALSE. The defaults of
right, summary, and forcedRef are TRUE, FALSE, and TRUE respectively.
Example for right = TRUE and keepCol = FALSE
> table_name = rhive.basic.cut2(tablename = "iris", col1 = "sepallength", col2
= "petallength", breaks1 = seq(0, 5, 0.5), breaks2 = seq(0, 5, 0.5), right =
TRUE, keepCol = FALSE, forcedRef = TRUE)
> table_name
[1] "rhive_result_1330385833"
attr(,"result:size")
[1] 5272
> results = rhive.query("select * from rhive_result_1330385833")
> head(results)
12. 5
5
3.6
0.2
setosa
5.0
N
ULL
1.4
[1.0,1.5)
1
rhive.basic.xtabs
rhive.basic.xtabs makes a contingency table from cross-classifying factors. A
formula object and a table name are used as input arguments and a
contingency table with matrix format is returned based on the given formula.
For instance, two column names, agegp and alcgp from a table are cross-
classifying factors in this formula, "ncontrols ~ agegp + alcgp".
Also, observations for each combination of the cross-classifying
factors are summed up through another column name, ncontrols.
Example for esoph data
>
xtab_formula
=
as.formula(paste("ncontrols","~",
"agegp",
"+","alcgp",sep
=""))
>
xtab_formula
ncontrols
~
agegp
+
alcgp
>
table_result
=
rhive.basic.xtabs(formula
=
xtab_formula,
tablename
=
"esoph")
>
head(table_result)
alcgp
agegp
0-‐39g/day
120+
40-‐79
80-‐119
25-‐34
61
5
45
5
35-‐44
89
10
80
20
45-‐54
78
15
81
39
55-‐64
89
26
84
43
65-‐74
71
8
53
29
75+
27
3
12
2
rhive.basic.t.test
The rhive.basic.t.test Function runs Welch's t-test on two samples. In this case
the two sample's mean difference is tested while holding the alternative
hypothesis, "two sample's mean difference is not 0." Thus, two-side test is
performed.
13. The following is an example of test the mean difference between the irises'
sepal widths and petal widths. Pay attention to how the Functions that used
the "sepallength" and "petallength" variables were called.
>
rhive.basic.t.test("iris",
"sepallength",
"iris",
"petallength")
[1]
"t
=
13.1422338118038,
df
=
211.542688378717,
p-‐value
=
0,
mean
of
x
:
5.84333333333333,
mean
of
y
:
3.758"
$statistic
t
13.14223
$parameter
df
211.5427
$p.value
[1]
0
$estimate
$estimate[[1]]
mean
of
x
5.843333
$estimate[[2]]
mean
of
y
3.758
>
Interpreting the results gives you a p-value of 0, thus revealing a difference
between the means of petal width and sepal width. The resulting statistics are
converted as an R list Object, and the string made from amassed statistics is
printed onto console.
Iris data is 150 observation cases provided by R. Using this data for R's t.test
results in a slightly off t-statistic of 13.0984. This is due to the variance used
by t.test Function to find t-statistic is sample variance, while rhive.basic.t.test
Function uses population variance. Like the example scenario, in the case of
little data, t-statistic deviance may exist but the larger the data gets the
deviance dwindles. With rhive.basic.t.test being a Function made for massive
data analysis in mind, population variance is used for speedy calculations.
14. rhive.block.sample
The percent argument is an optional argument that sets the percentage of
data to extract from the total data. It has a default value of 0.01, which means
it extracts 0.01% of the total data. But this percent argument's value is not the
ratio of the actually sampled data count to the total data count but more akin
to the ratio of Blocks to the total Blocks. Thus, rhive.block.sample Function
takes Samples by the Block.
Thus the entire data may be returned when using the rhive.block.sample
Function on Hive Tables of small data size. This occurs when the data is
smaller than the Block size set in Hive.
The seed variable is for specifying the Random Seed used when executing
Block Sampling in Hive. Should the Random Seeds be identical, Hive's Block
Sampling returns the same results. Thus in order to guarantee Random
Samples for every sampling, it is best to assign a value for the seed variable
in rhive.block.sample, by using the Sample Function of R.
The subset variable is an optional variable that can specify the condition for
the data to be extracted from the Table targeted by Hive, when returning
Sample Block. This argument uses the character type and corresponds to the
'where' clause in Hive HQL. Thus it must use syntax appropriate for HQL's
where clause.
rhive.block.sample Function's return values are the character values of the
name of the Hive Table that contain Sample Block results. That is, the
rhive.block.sample Function uses Sample Block to automatically create a
temporary Hive Table and return that Table's name. The following example
involves sampling data worth 0.01% of the Hive Table called
listvirtualmachines. This example used R's sample Function for the Random
Seed to be used during Block Sampling of Hive.
seedNumber
<-‐
sample(1:2^16,
1)
rhive.block.sample("listvirtualmachines",
seed=seedNumber
)
[1]
"rhive_sblk_1330404552"
15. As per this example, a Hive Table of the name "rhive_sblk_1330404552"
bearing 0.01% worth of data from the Hive Table, "listvirtualmachines", has
been created.
rhive.basic.scale
The rhive.basic.scale function converts numerical data with 0 average and 1
deviation. Input table name for the first argument, and the output column
name for the second.
In the returned list, there is added a "scaled_column name" column saved as
a string. This is also approachable/editable in RHive, along with/just like other
Hive tables.
scaled
<-‐
rhive.basic.scale("iris",
"sepallength")
attr(scaled,
"scaled:center")
#
[1]
5.843333
attr(scaled,
"scaled:scale")
#
[1]
0.8253013
>
rhive.desc.table(scaled[[1]])
col_name
data_type
comment
#
1
rowname
string
#
2
sepalwidth
double
#
3
petallength
double
#
4
petalwidth
double
#
5
species
string
#
6
sepallength
double
#
7
sacled_sepallength
double
rhive.basic.by
The rhive.basic.by Function consists of code that runs group by for a
specified/particular column. Thus the code below excecutes/applies group by
for "species" column, and returns the result of applying the sum Function on
16. "sepallength". In the results you will find the sum of each species and
sepallength.
rhive.basic.by("iris",
"species",
"sum","sepallength")
#
species
sum
#
1
setosa
250.3
#
2
versicolor
296.8
#
3
virginica
329.4
rhive.basic.merge
rhive.basic.merge makes new data set from merging two tables, based on
their common rows.
#
checking
data
rhive.query('select
*
from
iris
limit
5')
rowname
sepallength
sepalwidth
petallength
petalwidth
species
1
1
5.1
3.5
1.4
0.2
setosa
2
2
4.9
3.0
1.4
0.2
setosa
3
3
4.7
3.2
1.3
0.2
setosa
4
4
4.6
3.1
1.5
0.2
setosa
5
5
5.0
3.6
1.4
0.2
setosa
rhive.query('select
*
from
usarrests
limit
5')
rowname
murder
assault
urbanpop
rape
1
Alabama
13.2
236
58
21.2
2
Alaska
10.0
263
48
44.5
3
Arizona
8.1
294
80
31.0
4
Arkansas
8.8
190
50
19.5
5
California
9.0
276
91
40.6
##rhive.basic.merge
rhive.basic.merge('iris','usarrests',by.x='sepallength',by.y='
17. murder')
sepallength
sepalwidth
petallength
petalwidth
species
assault
urbanpop
rape
rowname
1
4.3
3.0
1.1
0.1
setosa
102
62
16.5
14
2
4.4
2.9
1.4
0.2
setosa
149
85
16.3
9
3
4.4
3.0
1.3
0.2
setosa
149
85
16.3
39
4
4.4
3.2
1.3
0.2
setosa
149
85
16.3
43
5
4.9
3.1
1.5
0.1
setosa
159
67
29.3
10
Merge is similar with ‘join’ in SQL. Followings are same with that.
#
Use
join
to
extract
and
print
the
names
of
all
rows
not
found
to
be
common
after
merging.
#
Should
row
names
overlap,
only
print
out
the
name
of
the
former
row.
rhive.big.query('select
a.sepallength,a.sepalwidth,a.petallength,a.petalwidth,a.species
,b.assault,b.urbanpop,b.rape,a.rowname
from
iris
a
join
usarrests
b
on
a.sepallength
=
b.murder')
sepallength
sepalwidth
petallength
petalwidth
species
assault
urbanpop
rape
rowname
1
4.3
3.0
1.1
0.1
setosa
102
62
16.5
14
2
4.4
2.9
1.4
0.2
setosa
149
85
16.3
9
3
4.4
3.0
1.3
0.2
setosa
149
85
16.3
39
4
4.4
3.2
1.3
0.2
setosa
149
85
16.3
43
5
4.9
3.1
1.5
0.1
setosa
159
67
29.3
10
18. rhive.basic.mode
rhive.basic.mode returns the mode and its frequency within a specified row of
the Hive table.
rhive.basic.mode('iris',
'sepallength')
sepallength
freq
1
5
10
rhive.basic.range
rhive.basic.range returns the greatest and lowest values within the specified
numerical row of the Hive table.
rhive.basic.range('iris',
'sepallength')
[1]
4.3
7.9