An efficient data mining framework on hadoop using java persistence api

2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010)

An Efficient Data Mining Framework on Hadoop using Java Persistence API

Yang Lai Shi ZhongZhi
The Key Laboratory of Intelligent Information Process- The Key Laboratory of Intelligent Information Process-
ing, Institute of Computing Technology, Chinese ing, Institute of Computing Technology, Chinese
Academy of Sciences, Beijing, 100190, China. Academy of Sciences, Beijing, 100190, China.
Graduate University of Chinese Academy of Sciences, e-mail: shizz@ics.ict.ac.cn
Beijing 100039, China.
e-mail: yanglai@ics.ict.ac.cn

Abstract—Data indexing is common in data mining when working X-tree [8], may be suitable for indexing high-dimensional
with high-dimensional, large-scale data sets. Hadoop, a cloud data sets.
computing project using the MapReduce framework in Java, has However, the following features in the MapReduce
become of significant interest in distributed data mining. A feasi- framework are unsuitable for data mining. First, in globality,
ble distributed data indexing algorithm is proposed for Hadoop map tasks are irrelevant to each other, as are reducing tasks.
data mining, based on ZSCORE binning and inverted indexing and Data mining requires that all of the training data be con-
on the Hadoop SequenceFile format. A data mining framework on verted into a global model, such as a decision tree or clus-
Hadoop using the Java Persistence API (JPA) and MySQL Cluster tering tree. The tasks in the MapReduce framework only
is proposed. The framework is elaborated in the implementation of
handle its partition of the entire data set and output its results
a decision tree algorithm on Hadoop. We compare the data index-
ing algorithm with Hadoop MapFile indexing, which performs a
into the Hadoop distributed file system (HDFS). Second,
binary search, in a modest cloud environment. The results show random-write operations are disallowed by the HDFS, thus
the algorithm is more efficient than naïve MapFile indexing. We disabling link-based data models in Hadoop, such as
compare the JDBC and JPA implementations of the data mining linked-lists, trees, and graphs. Finally, the duration of both
framework. The performance shows the framework is efficient for map and reduce tasks are based on scanning processing, and
data mining on Hadoop. will end when the partitioning of the training dataset is fin-
ished. Data mining requires a persistent model for following
Keywords—Data Mining, Distributed applications, JPA, testing processing.
ORM, Distributed file systems, Cloud computing A database is an ideal persistent repository for objects
generated by data mining using Hadoop tasks. To mine
I. INTRODUCTION high-dimensional and large-scale data on Hadoop, we em-
Many approaches have been proposed for handling ploy Object-Relation Mapping (ORM), which stores objects
high-dimensional and large-scale data, in which query proc- whose size may surpass memory limits in a relational data-
essing is the bottleneck. “Algorithms for knowledge discov- base. The Java Persistence API (JPA) provides a persistence
ery tasks are often based on range searches or nearest model for ORM [9]. It can store all the objects generated
neighbor search in multidimensional feature spaces” [1]. from data mining models in a relational database. A distrib-
Business intelligence and data warehouses can hold a uted database is a suitable solution to ensure robustness in
Terabyte or more of data. Cloud computing has emerged for distributed handling. MySQL Cluster is designed to with-
the subsequently increasing demands of data mining. Ma- stand any single point of failure [10], which is consistent
pReduce is a programming framework and an associated with Hadoop.
implementation designed for large data sets. The details of We performed the same work that McCreadie’s per-
partitioning, scheduling, failure handling and communica- formed [3] and now propose a novel indexing Hadoop im-
tion are hidden by MapReduce. Users simply define map plementation for continuous values. Using JPA and MySQL
functions to create intermediate <key, value> tuples, and Cluster on Hadoop, we propose an efficient data mining
then reduce functions to merge the tuples for special proc- framework, which is elaborated by a decision tree imple-
essing [2]. mentation. The distributed computing in Hadoop and cen-
A concise indexing Hadoop implementation of MapRe- tralized data collection using JPA are combined together
duce is presented in McCreadie’s work [3]. Ralf proposes a organically. We also employ the naïve JDBC implementa-
basic program skeleton to underlie MapReduce computation in comparison with our JPA implementation on Ha-
tions [4]. Moretti presents an abstraction for scalable data doop.
mining, in which data and computation are distributed in a The rest of the paper is organized as follows. In Section
computing cloud with minimal user efforts [5]. Gillick uses 2 we explain the index structures, flowchart and algorithms,
Hadoop to implement query-based learning [6]. and propose the data mining framework. Section 3 provides
Most data mining algorithms are based on object descriptions of our experimental setting and results. In Sec-
-oriented programming, which runs in memory. Han elabo- tion 4, we offer conclusions and suggest possible directions
rates many of these methods [7]. A link-based structure, like for future work.

978-0-7695-4108-2/10 $26.00 © 2010 IEEE 203
DOI 10.1109/CIT.2010.71

II. METHOD for searching; it then outputs a list of the number of records.
The search algorithm is shown in Figure 5.
A. Naïve Data Indexing
In intersection, the positions of each integer in
Naïve data indexing is necessary for querying data in Val2NO.DAT for a particular query will be extracted from
classification or clustering algorithms. Data indexing can Col2Val.DAT , and the lists of LineNo’s for each integer
dramatically decrease the complexity of querying. As in can be found. The result will be computed by the intersec-
McCreadie’s work [3], we create an inverted index for con- tion of the lists. The algorithm is omitted.
tinuous values.
1) Definition Table 2 Structure of IntData.DAT in BNF.
A (n × m) dataset is an (n × m) matrix containing n rows IntData.DAT ::= <Record>
of data, which contain m columns of float numbers. Let <Record> ::= <Bins>
min(i), max(i), sum(i), avg(i), std(i) , cnt(i) (i∈[1..m]) equal
<Bins> ::= Integer(32bits binary)
the minimum, maximum, sum, average, standard deviation
and count of the i-column, respectively—i.e., the essential Table 3 Structure of Col2Val.DAT in BNF.
statistical measures of the i-column of the dataset. Let Col2Val.DAT ::= <Field>
sum2(i) be the dot product of the i-column, which is used for <Field> ::= <Bins, Address, LengthLineNo>
x-avg() <Bins> ::= Integer(32bits binary)
std(i). From the formula ZSCORE(x)= std() [7], the for-
<Address> ::= Integer(64bits binary)
mula YSCORE(x)=round⎛
(x - avg())×iStep⎞
⎝ std() ⎠, a new measure <LengthLineNo> ::= Integer(64bits binary)
is defined. The iStep is an experimental integral parameter. Table 4 Structure of Val2NO.DAT in BNF.
2) Simple Design of the Inverted Index Val2NO.DAT ::= <LineNo>
In general, high-dimensional datasets are always stored <LineNo> ::= Integer(64bits binary)
as tables in a sophisticated DBMS for most data mining
tasks. Table 1 shows the format of a dataset in DBMS. For
data mining, the CSV format is common data interchange FloatData.TXT Query Result
file format.
Table 1 A normal data table.
RecordsNO Field01 Field02 …
Rec01 Value02 Value01 … Discretization InterSection
Rec02 Value05 Value01 …
Rec03 Value02 Value04 …
… … … …
IntData.TXT List of No
Figure 1 shows the flowchart in which the whole proce-
dure is illustrated for indexing and similarity querying:
Discretization calculates the essential statistical measures
(Figure 2), then performs YSCORE binning on the Float-
Encoder Searcher
Data.TXT file (CSV format) and outputs the IntData.TXT
file in which each line is labeled with a unique number
(LineNo) consecutively. IntData.DAT
The encoder transforms the text format file IntData.TXT IntData.SEQ
into the Hadoop Sequence File format IntData.SEQ and ge-
nerates IntData.DAT. To locate a record for a query rapidly,
the length of the record has to be equivalent. The Int-
Data.DAT contains an (n × m) matrix, i.e., n rows of records
Indexer Col2Val.DAT
containing m columns of IDs of bins, as does IntData.SEQ.
The structure of IntData.DAT is shown in Table 2.
With the indexer, the index files Val2NO.DAT and
Col2Val.DAT are outputted with MapReduce from the Int- Val2No.DAT
Data.SEQ file. To obtain the list of LineNo’s for a specific
Figure 1 Flowchart for searching the Hadoop Index
bin, it is needed to store the list in a structural file
(Val2NO.DAT), shown in Table 4; the address and the 3) Statistics phase
length of the list should be stored in another structural file In Hadoop, data sets are scanned once to calculate the
(Col2Val.DAT), shown in Table 3. The two structural files four distributive measures, min(i), max(i), sum(i) and sum2(i)
are easy to access by offset. as well as cnt(i) for the i-column (Figure 2); it then com-
The searcher, given a query—which should be a vector putes the algebraic measures of average and deviation.
with m float values— performs YSCORE binning with the Therefore, a (5×m) array is used for the keys in the MapRe-
same statistical measures and yields a vector with m integers duce process: the first row contains the keys of the minimum

204

of a column; the second row contains the keys for the e. ELSE
maximum; the third contains the keys for the sum; the fourth 1) Cnt ← value;
contains the keys for the square of the sum; and the fifth 14. avg[] ←0; std[] ←0;
contains the keys for the row count. 15. FOR i = 1 THRU LenFields
FUNCTION map (key, values, output) a. avg(i) ← faStat[2][i] / cnt;
1. f[1..m]← parse the values into float array; b. std(i)←sqrt ((faStat[3][i]-faStat[2][i]×faStat[2][i]/cnt)
2. iW ←f.length; /(cnt-1));
3. FOR i = 1 THRU m 16. RETURN min(i),max(i),sum(i),avg(i),std(i),cnt;
a. IF min > f[i] THEN END OF ALGORITHM STATISTICS
1) min ← f[i]; Figure 2: Algorithm Statistics for continuous values in Hadoop
2) output.collect(iW*0+i, min);
b. IF max < f[i] THEN The algorithm performs O(N/mapper), in which N
1) max ← f[i]; represents the number of records in the data sets, and map-
2) output.collect(iW*1+i, max); per is the number of the task of map functions. After step 4,
c. output.collect(iW*2+i, sum); all partitions will have been calculated into the five distribu-
tive measures. The reduce function merges them into global
d. output.collect(iW*3+i, sum2);
measures. Keys for the map function are the positions of the
4. output.collect(iW*4, 1);
next line, and values are the next line of text.
FUNCTION reduce (key, values, output) After step 11 in Figure 2, Hadoop will create some result
5. iCategory ← (int) key.get() / LenFields files named “part-xxxxx”, which contain the five distributive
6. SWITCH (iCategory) measures for each field; the number of these result files de-
7. CASE 0 : /** min()*/ pends on the number of reducers. From step 15, the alge-
a. WHILE (values.hasNext()) braic measures avg() and std() can be calculated from the
1) fTmp ← values.next(); five distributive measures.
2) IF (min > fTmp) THEN min ← fTmp; 4) YSCORE phase
b. output.collect(key, min); Figure 3 shows YSCORE binning of 1,000,000 values
c. BREAK; from a normal distribution with a mean of 0 and standard
8. CASE 1 : /** max() */ deviation of 1. If the iStep of 1 is used, the upper left figure
a. WHILE (values.hasNext()){ shows exactly the Three-Sigma Rule, there are only 9 bins
1) fTmp ← values.next(); and the maximum number in the bins is near 400,000. When
the iStep is changed to 8, the upper right shows that the
2) IF (max < fTmp) THEN max ← fTmp;
number of bins increases to 36, and the figure fits the normal
b. output.collect(key, max);
distribution curve better. The important thing is that the
c. BREAK; maximum in the bins reduces to about 50,000 at the iStep of
9. CASE 2 : /** sum() */ 8. When the iStep is set to 16 or 32, the maximum reduces to
a. WHILE (values.hasNext()) 25,000 or 10,000. The lowered numbers administer to in-
1) sum ← sum + values.next(); verted indexing.
2) output.collect(key, sum); 5 4
x 10 x 10
b. BREAK;
10.CASE 3 : /** sum2() */ 3
4
a. WHILE (values.hasNext())
Counting

Counting

3
1) sum2← sum2+ values.next(); 2
2
2) output.collect(key, sum2); 1
1
b. BREAK;
0 0
11.CASE 4 : /** cnt() */ −4 −3 −2 −1 0 1 2 3 4 −20 0 20
a. WHILE (values.hasNext()) 4
Yscore bins, iStep=1 Yscore bins, iStep=8
x 10
1) cnt ← cnt + values.next();
2) output.collect(key, cnt); 2 10000
b. BREAK;
Counting

Counting

1.5
FUNCTION statistics (input) 1 5000
12.faStat[][]←0; key←0; value←0; cnt←0;
0.5
13.WHILE input.hasNext()
0 0
a. <key,value> ← Parse the line of input.Next(); −50 0 50 −100 0 100
Yscore bins, iStep=16 Yscore bins, iStep=32
b. iCategory← key / LenFields;
c. iField← key MOD LenFields; Figure 3 YSCORE binning of 1,000,000 values from a normal distribution
with mean 0 and standard deviation 1. The above figures show YSCORE
d. IF (iCategory == LenFields * 4) THEN binning using an iStep of 1, 8, 16, and 32, respectively.
1) faStat[iCategory][ iField]← value;

205

According to [11], “the central limit theorem (CLT) reduce task in Hadoop will generate its own result; therefore,
states conditions under which the mean of a sufficiently each attribute has its own index file.
large number of independent random variables, each with
finite mean and variance, will be approximately normally SourceDataSet::=
distributed.” Even if the distribution of a field does not <LongWritable, FloatArrayWritable>

JobDiscret
follow normal distribution, in large-scale data sets the values
of the field can be put in bins using its mean and deviation
when the size approaches a Terabyte. mapperDiscrete mapperDiscrete
These interval values are put into bins using YSCORE 1 ... m
binning [7]. Then, the inverted index will be of the format in
Table 5, following Table 1. The first column holds all fields DiscretedData::=
in the original dataset; the second holds the <bin, RecNoAr- <LongWritable, IntArrayWritable>
ray> tuples.
Table 5. The structure of inverted index for interval values.
Fields Inverted Index mapperIndex mapperIndex
1 ... m
Field01 <Bin02,RecNoArray>,<Bin05, RecNoArray >…
Field02 <Bin01,RecNoArray>,<Bin04, RecNoArray >…
IntermediateData::=

JobIndex
… … <LongWritable, LongWritable>
The YSCORE binning phase does not need a reduce
function, thus eliminating the cost of sorting and transferring; ReducerIndex ReducerIndex
the result files are precisely the partitions on which map 1 ... r
functions work. Keys in the map function are the positions
of the next line, and values are the next line of text. The in-
IndexData::=
put file will be FloatDate.TXT, and the output file will be
<IntWritable, LongArrayWritable>
IntData.TXT, as shown in Figure 3.
The cumulative distribution function values for the stan-
dard normal distribution will be 50% at 0, 60% at 0.253, Figure 4 The Data Flow Diagram for improved Indexing of continuous
values in Hadoop
70% at 0.524, 80% at 0.842, and 90% at 1.282. If the distri-
bution is approximately close to the standard normal distri-
bution, YSCORE determines a balanced binning from its C. Hadoop MapFile Indexing
ZSCORE value.
The MapFile format is a directory consisting of two files,
After the statistics’ phase and the YSCORE binning
a data file and an index file. Both files are, in fact, two files
phase, Hadoop can handle discrete data as in the com-
of the SequenceFile type. The data file is a normal Se-
monly-used MapReduce word count example.
quenceFile; the index file consists of a few keys at intervals
B. Improved Data Indexing and their corresponding addresses in the data file. To query a
Naïve data indexing has a flaw in that it uses only one special key for its value, a binary search will be performed
reducer to keep the global id for the keys. This causes a per- in the index file and a sequential search will be performed in
formance problem that is not scalable. A tractable <Long- the data file within the range defined by the corresponding
Writable, LongWritable> pair is employed to transform the addresses. [12] The algorithm for searching by MapFile is
three variables of the global id, the attribute and the id of bin shown in Figure 5.
into the intermediate key-value pairs in the JobIndex, FUNCTION SearchMapFile (float[] faQuery)
(Figure 4). In general, the number of attributes and the 17. laResult ← null; laTmp ← null; isFirst ← TRUE
number of bins in an attribute will not be more than 232. A 18. FOR iField =1 THRU faQuery.length
LongWritable variable is 64 bits, so an attribute can be put a. iValue ←YSCORE(faQuery[i]);
in the upper 32 bits in a LongWritable, and the number of b. Indexer ←new MapFile.Reader(index file of iField);
bins can be put in the lower 32 bits. These attribute-bin c. Indexer.get(iValue, laTmp);
LongWritables are used as keys for the sorting phase in Ha- d. IF (isFirst) THEN
doop. 1) laResult ← laTmp; isFirst ← FALSE
The number of the ReducerIndex task is set equal to the e. ELSE
number of attributes. The method setPartitionerClass() in 1) laResult ← getIntersection(laResult, laTmp);
Hadoop decides to which ReducerIndex task an intermediate
19. RETURN laResult
key-value pair should be sent, using the upper 32 bits in the END OF ALGORITHM SearchMapFile
keys. In the ReducerIndex task, the attribute-bin LongWri- Figure 5 Algorithm of searching by MapFile.
tables will be grouped by the number of bins in the lower 32
bits in the keys, and then the inverted index HBase, a Bigtable-like storage for the Hadoop Distrib-
<Bin,RecNoArray> (Table 5) can be outputted easily. Each uted Computing Environment, is based on MapFile [13]. It
does not perform well for reasons of complexity in MapFile.

206

D. Data mining framework on Hadoop The following codes are for returning all the children of
Although this improved data indexing performs better a node in a binary tree (Figure 7).
than does MapFile indexing, the overhead for range query-
ing by intersection algorithms is still too high. To apply 1. CREATE PROCEDURE getChild (IN rootid INT)
common object-oriented data mining algorithms in Hadoop, 2. BEGIN
the three problems concerning globalization, random-write a. DROP TABLE IF EXISTS chd;
and duration, posed in the introduction, should be handled. b. CREATE TEMPORARY TABLE chd (id INT
For this, we propose a data mining framework on Hadoop NOT NULL DEFAULT -1)ENGINE = MEMORY;
using JPA. c. DROP TABLE IF EXISTS par;
1) MySQL Cluster d. CREATE TEMPORARY TABLE par (id INT
After many failures, we found that, for data mining with NOT NULL DEFAULT -1)ENGINE = MEMORY;
Hadoop, a database was an ideal persistent repository for e. DROP TABLE IF EXISTS result;
handling the three problems mentioned. To ensure comput- f. CREATE TABLE result (id INT NOT NULL
ing robustness, a distributed database is the suitable solution DEFAULT -1);
for distributed high-dimensional and large-scale data han- g. INSERT chd SELECT A.ID FROM btree A
dling. MySQL Cluster is designed to eliminate any single WHERE rootid=A.PID;
point of failure [10] and is based on open source code; both h. WHILE ROW_COUNT()>0 DO
characteristics make it consistent with data mining with Ha- 1) TRUNCATE par;
doop. 2) INSERT par SELECT * FROM chd;
2) ORM 3) INSERT result SELECT * FROM chd;
The JPA and naïve JDBC are used to implement the 4) TRUNCATE chd;
ORM for objects of data mining in a database (Figure 6). To
5) INSERT chd SELECT A.ID FROM btree A, par
improve the data indexing method, a KD-tree or X-tree may
be suitable. However, the decision tree is considered a clas- B WHERE B.ID=A.PID;
sical algorithm, and better elaborates the framework. i. END WHILE;
3. END;
Figure 7 Stored procedure for all children of a node in MySQL
TrainingData TestingData
4) Decision tree implementation
As shown in the textbook [7], a decision tree for con-
hadoop tinuous values can build a binary tree. Let D be a dataset.
The entropy of D is given by
Data Mining
m
Info ( D ) = −∑ pi × log 2 ( pi )
(1)
i =1
JPA JDBC
Where m is the number of categories and p is the prob-
ability of a category. Let A be an attribute in D. The entropy
of D based on the partitioning of A is given by
ndbcluster
Figure 6 The Data Flow Diagram of Data Mining Framework on Hadoop D1 D2 (2)
Info A ( D) = × Info( D1 ) + × Info( D2 )
D D
In the data mining framework, a large-scale dataset is
scanned by Hadoop. The necessary information is extracted Where entropy Info() is from formula (1). D1 is the part
using classical data mining modules, which will be embed- of D, which belongs to the left child node, and the D2 is the
ded in the Hadoop map or reduce tasks. The extracted ob- part of D, which belongs to the right child node.
jects, such as lists, trees, and graphs, will be sent automati- Each node in the tree has to contain two members, an
cally into a database—in this case, MySQL Cluster. As a test ATTRIBUTE and a VALUE. Given a test record (an array
case, the test dataset will be scanned and handled by each of continuous values), the former tells which field should be
data mining task. checked, and the latter tells which child node should be se-
3) Stored procedure lected. If the test record is less than or equal to the member
After objects obtained from data mining are stored into a VALUE in the member ATTRIBUTE, then the left child
database, all the details can be finished using a stored pro- node is selected, otherwise the right child node is selected.
cedure, such as finding a path between two nodes, returning In Hadoop, the JobSelect has to select the member
all the children of a node and returning the category of a test ATTRIBUTE and the member VALUE using the entropy
case. A data mining task should send demands and receive criterion (Figure 8). A training data record should consist of
results. This decreases network overhead and latency. multiple values and a category. In the mapperSelect tasks, a
record will be decomposed into AVC_Writable tuples <at-

207

tribute, value, category>, where the attribute indicates the use the method output.collect (Figure 9); instead, it writes
position of the value in the record and the category indicates directly to two sequence files in the HDFS. This decreases
the class of the record. For example, a record <3, 5, 7, 9, -1> unwanted I/O overheads. The SourceDataSet is D; the
will output the tuples <1, 3, -1>, <2, 5, -1>, <3, 7, -1>, <4, 9, ChildDataLeft is D1; the ChildDataRight is D2; and all of
-1>. These AVC_Writable tuples will be the keys in the in- these are in fact directories in HDFS, which contain the
termediate key-value pairs. The values will be VC_Writable outputs from each mapperSplit task.
tuples, or the part <value, category> in the AVC_Writable The above processing is just for one node. To build a
tuples. whole decision tree, the processing needs to be repeated
Hadoop’s method setPartitionerClass() decides to which recursively. A complete decision tree Hadoop algorithm is
reducer a key-value pair should be sent, using the attribute in shown in Figure 10.
the AVC_Writable tuple. Hadoop’s method setOutputVal-
ueGroupingComparator() will group the values 1. nodeD.Path ← D, JPA.persist(nodeD);
(VC_Writable tuples) by the same attribute in ReducerSelect 2. BuildTree(nodeD);
tasks. By default, the values are sorted based on the keys, so FUNCTION BuildTree (Path D)
determining the best split value for the attribute requires one 3. IF (all categories in D are the same C) THEN
pass through the VC_Writable tuples by the formula (2). JPA.begin(); set nodeD as a leaf; set nodeD.category as C;
JPA.commit(); RETURN nodeD;
4. IF (all attribute checked) THEN
SourceDataSet::=
<LongWritable, FloatArrayWritable> JPA.begin();nodeD.Leaf ← true; nodeD.category ←
max(C){C∈D}; JPA.commit(); RETURN nodeD;
5. JPA.begin(); nodeD.attribute, nodeD.value ← JobSe-
mapperSelect mapperSelect
1 ... m lect(D); JPA.commit();
6. JPA.begin(); D1, D2 ← JobSplit(D); JPA.commit(),
7. nodeLeft ← BuildTree(D1);
IntermediateData::=
JobSelect

<AVC_Writable, VC_Writable> 8. nodeRight ← BuildTree(D2);
9. JPA.persist(nodeLeft); JPA.persist(nodeRight);
ReducerSelect ReducerSelect 10. JPA.begin(); nodeD.ChildLeft ← nodeLeft;
1 ... r nodeD. ChildRight ← nodeRight; JPA.commit();
11. RETURN nodeD
SelectData::= Figure 10 Algorithm for building decision tree on Hadoop
<EntropyWritable, AV_Writable>
The JPA.persist() function (Figure 10) means a new node
Figure 8 The Data Flow Diagram of Selection of Decision Tree in Hadoop is persisted by JPA. The assignments enclosed by
JPA.begin() and JPA.commit() mean these operations are
After the best split, values for each attribute are calcu- monitored by JPA. Any node in the cluster will share objects
lated by the ReduceSelect tasks, and the key-value pair <En- persisted by JPA.
tropyWritable, AV_Writable> is outputted. The minimum
III. EXPERIMENTAL SETUP
expected entropy is selected. Let A0 and V0 be the best at-
tribute-value pair. D1 is the subset of D satisfying A0 <=V0, We evaluated several synthetic datasets and an open da-
and D2 is the subset of D satisfying A0 > V0. taset.
Before the split, a node in JPA will be created corre- A. Computing Environment
sponding to D, with the best attribute A0 and best splitting
value V0. Experiments were conducted using Linux CentOS 5.0,
JobSplit contains only mapperSplit tasks, which do not Hadoop 0.19.1, JDK 1.6, a 1Gbps switch network, and 4
ASUS RS100-X5 servers (Table 10).
SourceDataSet::= Table 6 Devices for Hadoop computing.
<LongWritable, FloatArrayWritable> Environment Parameter
Model name ASUS RS100-X5
mapperSplit mapperSplit CPU MHz Dual Core 2GHz
1 ... m Cache size 1024 KB
JobSplit

Memory 3.4GB
NIC 1Gbps
ChildDataLeft ChildDataRight Hard Disk 1TB
Figure 9 The Data Flow Diagram of Split of Decision Tree in Hadoop

208

B. Inverted indexing vs MapFile Indexing convenient tool for the data mining framework will be a
Two different-sized experiments are shown in Table 7. A focus of future efforts.
(1,000,000 × 200) synthetic data set was generated at ran- ACKNOWLEDGEMENTS
dom for test1. A (102,294 × 117) data set was taken from
KDDCUP 2008[14] for test2. The procedure was followed This work is supported by the National Basic Research
according to the design (see II.A.2). In “Index Method,” M Priorities Programme (No. 2007CB311004) and National
means MapFile, and I means “Inverted Index”. Of both tests, Science Foundation of China (No.60775035, 60903141,
the latter takes fewer seconds for searching than the former. 60933004, 60970088), National Science and Technology
The SearcherTime is the average for a single query. Support Plan (No.2006BAC08B06).

Table 7 Comparison of MapFile and Inverted indexing REFERENCES
Item Test1 Test2 [1] C. Bohm, S. Berchtold, H. P. Kriegel, and U. Michel,
Index Method M I M I "Multidimensional index structures in relational databases," in
FloatData(MB) 1,614 204 1st International Conference on Data Warehousing and
DiscretizationTime(s) 180 59 Knowledge Discovery (DaWak 99), Florence, Italy, 1999, pp.
EncoderTime(s) 482 42 51-70.
[2] J. Dean, S. Ghemawat, and Usenix, "MapReduce: Simplified
IndexerTime(s) 5835 8473 307 397
data processing on large clusters," in 6th Symposium on
SearcherTime(s) 8.25 1.07 5.07 0.68 Operating Systems Design and Implementation (OSDI 04),
For a single query, the searching time of the inverted in- San Francisco, CA, 2004, pp. 137-149.
dexing shows great advantage. [3] R. M. C. McCreadie, C. Macdonald, and I. Ounis, On
Single-Pass Indexing with MapReduce. New York: Assoc
C. Data mining framework Computing Machinery, 2009.
The comparison of JDBC and JPA as decision trees uses [4] R. Lammel, "Google's MapReduce programming model -
the same synthetic dataset (Table 8). The times listed are the Revisited," Science of Computer Programming, vol. 70, pp.
cumulative of all procedures. To limit the size of the tree, the 1-30, Jan 2008.
depth of any tree is no more than 20. [5] C. Moretti, K. Steinhaeuser, D. Thain, and N. V. Chawla,
Table 8 Comparison of JDBC and JPA on MySQL cluster "Scaling Up Classifiers to Cloud Computers," in IEEE
Item JDBC JPA International Conference on Data Mining Pisa, Italy:
Size of synthetic 5,897,367,552 bytes http://icdm08.isti.cnr.it/Paper-Submissions/32/accepted-paper
data 5,000,000 records, with 117 fields. s, 2008.
[6] D. Gillick, A. Faria, and J. DeNero, "MapReduce: Distributed
Number of 501,723 nodes Computing for Machine Learning," 2006, p.
nodes http://www.icsi.berkeley.edu/~arlo/publications/gillick_cs262
Accumulated 2,695.307 seconds 2,981.279 second a_proj.pdf.
inserting time [7] J. Han and M. Kamber, Data Mining: Concepts and
Time for get- 257.93 seconds 324.51 seconds Techniques, Second Edition, second ed. San Francisco:
Child Morgan Kaufmann, 2006.
The time in Table 8 can be decreased dramatically re- [8] S. Berchtold, D. A. Keim, and H. P. Kriegel, "The X-tree: An
gardless of robustness; i.e., if the MyISAM engine is used Index Structure for High-Dimensional Data," Readings in
Multimedia Computing and Networking, 2001.
on a single node instead of the NDBcluster engine on 4
[9] R. Biswas and E. Ort, "Java Persistence API - A Simpler
nodes, performance in database will be multiply approxi- Programming Model for Entity Persistence,"
mately 25 times. http://java.sun.com/developer/technicalArticles/J2EE/jpa/,
2009.
IV. CONCLUSION
[10] S. Hinz, P. DuBois, J. Stephens, M. M. Brown, and A.
A feasible data indexing algorithm is proposed for Bedford, "MySQL Cluster," in
high-dimensional, large-scale data sets. It consists of http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-overvi
YSCORE binning, Hadoop processing, and improved inter- ew.html, 2009.
sectioning. Experimental results show that the algorithm is [11] Wikipedia, "Central limit theorem," 2009, p.
technically feasible. http://en.wikipedia.org/wiki/Central_limit_theorem.
An efficient data mining framework is proposed for Ha- [12] A. S. F. ASF, "MapFile (Hadoop 0.19.0 API)," 2008, p.
doop; it employs the JPA and MySQL Cluster to resolve http://hadoop.apache.org/core/docs/current/api/org/apache/ha
problems of globalization, random-write and duration in doop/io/MapFile.html.
Hadoop. The distributed computing in Hadoop and central- [13] J. Gray, "HBase: Bigtable-like structured storage for Hadoop
ized data collection using JPA are combined together or- HDFS." vol. 2009, 2008, p.
http://wiki.apache.org/hadoop/Hbase.
ganically.
[14] S. M. S. USA, "kddcup data," 2008, p.
We will consider replacing MapFile with our inverted http://www.kddcup2008.com/KDDsite/Data.htm.
indexing for HBase in future work in an effort to enhance
the performance of HBase in larger-scale data mining. A

209

An efficient data mining framework on hadoop using java persistence api

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to An efficient data mining framework on hadoop using java persistence api

Similar to An efficient data mining framework on hadoop using java persistence api (20)

More from João Gabriel Lima

More from João Gabriel Lima (20)

Recently uploaded

Recently uploaded (20)

An efficient data mining framework on hadoop using java persistence api