• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
An efficient data mining framework on hadoop using java persistence api

An efficient data mining framework on hadoop using java persistence api






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    An efficient data mining framework on hadoop using java persistence api An efficient data mining framework on hadoop using java persistence api Document Transcript

    • 2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010) An Efficient Data Mining Framework on Hadoop using Java Persistence API Yang Lai Shi ZhongZhi The Key Laboratory of Intelligent Information Process- The Key Laboratory of Intelligent Information Process- ing, Institute of Computing Technology, Chinese ing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China. Academy of Sciences, Beijing, 100190, China. Graduate University of Chinese Academy of Sciences, e-mail: shizz@ics.ict.ac.cn Beijing 100039, China. e-mail: yanglai@ics.ict.ac.cn Abstract—Data indexing is common in data mining when working X-tree [8], may be suitable for indexing high-dimensional with high-dimensional, large-scale data sets. Hadoop, a cloud data sets. computing project using the MapReduce framework in Java, has However, the following features in the MapReduce become of significant interest in distributed data mining. A feasi- framework are unsuitable for data mining. First, in globality, ble distributed data indexing algorithm is proposed for Hadoop map tasks are irrelevant to each other, as are reducing tasks. data mining, based on ZSCORE binning and inverted indexing and Data mining requires that all of the training data be con- on the Hadoop SequenceFile format. A data mining framework on verted into a global model, such as a decision tree or clus- Hadoop using the Java Persistence API (JPA) and MySQL Cluster tering tree. The tasks in the MapReduce framework only is proposed. The framework is elaborated in the implementation of handle its partition of the entire data set and output its results a decision tree algorithm on Hadoop. We compare the data index- ing algorithm with Hadoop MapFile indexing, which performs a into the Hadoop distributed file system (HDFS). Second, binary search, in a modest cloud environment. The results show random-write operations are disallowed by the HDFS, thus the algorithm is more efficient than naïve MapFile indexing. We disabling link-based data models in Hadoop, such as compare the JDBC and JPA implementations of the data mining linked-lists, trees, and graphs. Finally, the duration of both framework. The performance shows the framework is efficient for map and reduce tasks are based on scanning processing, and data mining on Hadoop. will end when the partitioning of the training dataset is fin- ished. Data mining requires a persistent model for following Keywords—Data Mining, Distributed applications, JPA, testing processing. ORM, Distributed file systems, Cloud computing A database is an ideal persistent repository for objects generated by data mining using Hadoop tasks. To mine I. INTRODUCTION high-dimensional and large-scale data on Hadoop, we em- Many approaches have been proposed for handling ploy Object-Relation Mapping (ORM), which stores objects high-dimensional and large-scale data, in which query proc- whose size may surpass memory limits in a relational data- essing is the bottleneck. “Algorithms for knowledge discov- base. The Java Persistence API (JPA) provides a persistence ery tasks are often based on range searches or nearest model for ORM [9]. It can store all the objects generated neighbor search in multidimensional feature spaces” [1]. from data mining models in a relational database. A distrib- Business intelligence and data warehouses can hold a uted database is a suitable solution to ensure robustness in Terabyte or more of data. Cloud computing has emerged for distributed handling. MySQL Cluster is designed to with- the subsequently increasing demands of data mining. Ma- stand any single point of failure [10], which is consistent pReduce is a programming framework and an associated with Hadoop. implementation designed for large data sets. The details of We performed the same work that McCreadie’s per- partitioning, scheduling, failure handling and communica- formed [3] and now propose a novel indexing Hadoop im- tion are hidden by MapReduce. Users simply define map plementation for continuous values. Using JPA and MySQL functions to create intermediate <key, value> tuples, and Cluster on Hadoop, we propose an efficient data mining then reduce functions to merge the tuples for special proc- framework, which is elaborated by a decision tree imple- essing [2]. mentation. The distributed computing in Hadoop and cen- A concise indexing Hadoop implementation of MapRe- tralized data collection using JPA are combined together duce is presented in McCreadie’s work [3]. Ralf proposes a organically. We also employ the naïve JDBC implementa- basic program skeleton to underlie MapReduce computa- tion in comparison with our JPA implementation on Ha- tions [4]. Moretti presents an abstraction for scalable data doop. mining, in which data and computation are distributed in a The rest of the paper is organized as follows. In Section computing cloud with minimal user efforts [5]. Gillick uses 2 we explain the index structures, flowchart and algorithms, Hadoop to implement query-based learning [6]. and propose the data mining framework. Section 3 provides Most data mining algorithms are based on object descriptions of our experimental setting and results. In Sec- -oriented programming, which runs in memory. Han elabo- tion 4, we offer conclusions and suggest possible directions rates many of these methods [7]. A link-based structure, like for future work.978-0-7695-4108-2/10 $26.00 © 2010 IEEE 203DOI 10.1109/CIT.2010.71
    • II. METHOD for searching; it then outputs a list of the number of records. The search algorithm is shown in Figure 5.A. Naïve Data Indexing In intersection, the positions of each integer in Naïve data indexing is necessary for querying data in Val2NO.DAT for a particular query will be extracted fromclassification or clustering algorithms. Data indexing can Col2Val.DAT , and the lists of LineNo’s for each integerdramatically decrease the complexity of querying. As in can be found. The result will be computed by the intersec-McCreadie’s work [3], we create an inverted index for con- tion of the lists. The algorithm is omitted.tinuous values. 1) Definition Table 2 Structure of IntData.DAT in BNF. A (n × m) dataset is an (n × m) matrix containing n rows IntData.DAT ::= <Record>of data, which contain m columns of float numbers. Let <Record> ::= <Bins>min(i), max(i), sum(i), avg(i), std(i) , cnt(i) (i∈[1..m]) equal <Bins> ::= Integer(32bits binary)the minimum, maximum, sum, average, standard deviationand count of the i-column, respectively—i.e., the essential Table 3 Structure of Col2Val.DAT in BNF.statistical measures of the i-column of the dataset. Let Col2Val.DAT ::= <Field>sum2(i) be the dot product of the i-column, which is used for <Field> ::= <Bins, Address, LengthLineNo> x-avg() <Bins> ::= Integer(32bits binary)std(i). From the formula ZSCORE(x)= std() [7], the for- <Address> ::= Integer(64bits binary)mula YSCORE(x)=round⎛ (x - avg())×iStep⎞ ⎝ std() ⎠, a new measure <LengthLineNo> ::= Integer(64bits binary)is defined. The iStep is an experimental integral parameter. Table 4 Structure of Val2NO.DAT in BNF. 2) Simple Design of the Inverted Index Val2NO.DAT ::= <LineNo> In general, high-dimensional datasets are always stored <LineNo> ::= Integer(64bits binary)as tables in a sophisticated DBMS for most data miningtasks. Table 1 shows the format of a dataset in DBMS. Fordata mining, the CSV format is common data interchange FloatData.TXT Query Resultfile format. Table 1 A normal data table. RecordsNO Field01 Field02 … Rec01 Value02 Value01 … Discretization InterSection Rec02 Value05 Value01 … Rec03 Value02 Value04 … … … … … IntData.TXT List of No Figure 1 shows the flowchart in which the whole proce-dure is illustrated for indexing and similarity querying: Discretization calculates the essential statistical measures(Figure 2), then performs YSCORE binning on the Float- Encoder SearcherData.TXT file (CSV format) and outputs the IntData.TXTfile in which each line is labeled with a unique number(LineNo) consecutively. IntData.DAT The encoder transforms the text format file IntData.TXT IntData.SEQinto the Hadoop Sequence File format IntData.SEQ and ge-nerates IntData.DAT. To locate a record for a query rapidly,the length of the record has to be equivalent. The Int-Data.DAT contains an (n × m) matrix, i.e., n rows of records Indexer Col2Val.DATcontaining m columns of IDs of bins, as does IntData.SEQ.The structure of IntData.DAT is shown in Table 2. With the indexer, the index files Val2NO.DAT andCol2Val.DAT are outputted with MapReduce from the Int- Val2No.DATData.SEQ file. To obtain the list of LineNo’s for a specific Figure 1 Flowchart for searching the Hadoop Indexbin, it is needed to store the list in a structural file(Val2NO.DAT), shown in Table 4; the address and the 3) Statistics phaselength of the list should be stored in another structural file In Hadoop, data sets are scanned once to calculate the(Col2Val.DAT), shown in Table 3. The two structural files four distributive measures, min(i), max(i), sum(i) and sum2(i)are easy to access by offset. as well as cnt(i) for the i-column (Figure 2); it then com- The searcher, given a query—which should be a vector putes the algebraic measures of average and deviation.with m float values— performs YSCORE binning with the Therefore, a (5×m) array is used for the keys in the MapRe-same statistical measures and yields a vector with m integers duce process: the first row contains the keys of the minimum 204
    • of a column; the second row contains the keys for the e. ELSEmaximum; the third contains the keys for the sum; the fourth 1) Cnt ← value;contains the keys for the square of the sum; and the fifth 14. avg[] ←0; std[] ←0;contains the keys for the row count. 15. FOR i = 1 THRU LenFieldsFUNCTION map (key, values, output) a. avg(i) ← faStat[2][i] / cnt;1. f[1..m]← parse the values into float array; b. std(i)←sqrt ((faStat[3][i]-faStat[2][i]×faStat[2][i]/cnt)2. iW ←f.length; /(cnt-1));3. FOR i = 1 THRU m 16. RETURN min(i),max(i),sum(i),avg(i),std(i),cnt; a. IF min > f[i] THEN END OF ALGORITHM STATISTICS 1) min ← f[i]; Figure 2: Algorithm Statistics for continuous values in Hadoop 2) output.collect(iW*0+i, min); b. IF max < f[i] THEN The algorithm performs O(N/mapper), in which N 1) max ← f[i]; represents the number of records in the data sets, and map- 2) output.collect(iW*1+i, max); per is the number of the task of map functions. After step 4, c. output.collect(iW*2+i, sum); all partitions will have been calculated into the five distribu- tive measures. The reduce function merges them into global d. output.collect(iW*3+i, sum2); measures. Keys for the map function are the positions of the4. output.collect(iW*4, 1); next line, and values are the next line of text.FUNCTION reduce (key, values, output) After step 11 in Figure 2, Hadoop will create some result5. iCategory ← (int) key.get() / LenFields files named “part-xxxxx”, which contain the five distributive6. SWITCH (iCategory) measures for each field; the number of these result files de-7. CASE 0 : /** min()*/ pends on the number of reducers. From step 15, the alge- a. WHILE (values.hasNext()) braic measures avg() and std() can be calculated from the 1) fTmp ← values.next(); five distributive measures. 2) IF (min > fTmp) THEN min ← fTmp; 4) YSCORE phase b. output.collect(key, min); Figure 3 shows YSCORE binning of 1,000,000 values c. BREAK; from a normal distribution with a mean of 0 and standard8. CASE 1 : /** max() */ deviation of 1. If the iStep of 1 is used, the upper left figure a. WHILE (values.hasNext()){ shows exactly the Three-Sigma Rule, there are only 9 bins 1) fTmp ← values.next(); and the maximum number in the bins is near 400,000. When the iStep is changed to 8, the upper right shows that the 2) IF (max < fTmp) THEN max ← fTmp; number of bins increases to 36, and the figure fits the normal b. output.collect(key, max); distribution curve better. The important thing is that the c. BREAK; maximum in the bins reduces to about 50,000 at the iStep of9. CASE 2 : /** sum() */ 8. When the iStep is set to 16 or 32, the maximum reduces to a. WHILE (values.hasNext()) 25,000 or 10,000. The lowered numbers administer to in- 1) sum ← sum + values.next(); verted indexing. 2) output.collect(key, sum); 5 4 x 10 x 10 b. BREAK;10.CASE 3 : /** sum2() */ 3 4 a. WHILE (values.hasNext()) Counting Counting 3 1) sum2← sum2+ values.next(); 2 2 2) output.collect(key, sum2); 1 1 b. BREAK; 0 011.CASE 4 : /** cnt() */ −4 −3 −2 −1 0 1 2 3 4 −20 0 20 a. WHILE (values.hasNext()) 4 Yscore bins, iStep=1 Yscore bins, iStep=8 x 10 1) cnt ← cnt + values.next(); 2) output.collect(key, cnt); 2 10000 b. BREAK; Counting Counting 1.5FUNCTION statistics (input) 1 500012.faStat[][]←0; key←0; value←0; cnt←0; 0.513.WHILE input.hasNext() 0 0 a. <key,value> ← Parse the line of input.Next(); −50 0 50 −100 0 100 Yscore bins, iStep=16 Yscore bins, iStep=32 b. iCategory← key / LenFields; c. iField← key MOD LenFields; Figure 3 YSCORE binning of 1,000,000 values from a normal distribution with mean 0 and standard deviation 1. The above figures show YSCORE d. IF (iCategory == LenFields * 4) THEN binning using an iStep of 1, 8, 16, and 32, respectively. 1) faStat[iCategory][ iField]← value; 205
    • According to [11], “the central limit theorem (CLT) reduce task in Hadoop will generate its own result; therefore,states conditions under which the mean of a sufficiently each attribute has its own index file.large number of independent random variables, each withfinite mean and variance, will be approximately normally SourceDataSet::=distributed.” Even if the distribution of a field does not <LongWritable, FloatArrayWritable> JobDiscretfollow normal distribution, in large-scale data sets the valuesof the field can be put in bins using its mean and deviationwhen the size approaches a Terabyte. mapperDiscrete mapperDiscrete These interval values are put into bins using YSCORE 1 ... mbinning [7]. Then, the inverted index will be of the format inTable 5, following Table 1. The first column holds all fields DiscretedData::=in the original dataset; the second holds the <bin, RecNoAr- <LongWritable, IntArrayWritable>ray> tuples. Table 5. The structure of inverted index for interval values. Fields Inverted Index mapperIndex mapperIndex 1 ... mField01 <Bin02,RecNoArray>,<Bin05, RecNoArray >…Field02 <Bin01,RecNoArray>,<Bin04, RecNoArray >… IntermediateData::= JobIndex… … <LongWritable, LongWritable> The YSCORE binning phase does not need a reducefunction, thus eliminating the cost of sorting and transferring; ReducerIndex ReducerIndexthe result files are precisely the partitions on which map 1 ... rfunctions work. Keys in the map function are the positionsof the next line, and values are the next line of text. The in- IndexData::=put file will be FloatDate.TXT, and the output file will be <IntWritable, LongArrayWritable>IntData.TXT, as shown in Figure 3. The cumulative distribution function values for the stan-dard normal distribution will be 50% at 0, 60% at 0.253, Figure 4 The Data Flow Diagram for improved Indexing of continuous values in Hadoop70% at 0.524, 80% at 0.842, and 90% at 1.282. If the distri-bution is approximately close to the standard normal distri-bution, YSCORE determines a balanced binning from its C. Hadoop MapFile IndexingZSCORE value. The MapFile format is a directory consisting of two files, After the statistics’ phase and the YSCORE binning a data file and an index file. Both files are, in fact, two filesphase, Hadoop can handle discrete data as in the com- of the SequenceFile type. The data file is a normal Se-monly-used MapReduce word count example. quenceFile; the index file consists of a few keys at intervalsB. Improved Data Indexing and their corresponding addresses in the data file. To query a Naïve data indexing has a flaw in that it uses only one special key for its value, a binary search will be performedreducer to keep the global id for the keys. This causes a per- in the index file and a sequential search will be performed informance problem that is not scalable. A tractable <Long- the data file within the range defined by the correspondingWritable, LongWritable> pair is employed to transform the addresses. [12] The algorithm for searching by MapFile isthree variables of the global id, the attribute and the id of bin shown in Figure 5.into the intermediate key-value pairs in the JobIndex, FUNCTION SearchMapFile (float[] faQuery)(Figure 4). In general, the number of attributes and the 17. laResult ← null; laTmp ← null; isFirst ← TRUEnumber of bins in an attribute will not be more than 232. A 18. FOR iField =1 THRU faQuery.lengthLongWritable variable is 64 bits, so an attribute can be put a. iValue ←YSCORE(faQuery[i]);in the upper 32 bits in a LongWritable, and the number of b. Indexer ←new MapFile.Reader(index file of iField);bins can be put in the lower 32 bits. These attribute-bin c. Indexer.get(iValue, laTmp);LongWritables are used as keys for the sorting phase in Ha- d. IF (isFirst) THENdoop. 1) laResult ← laTmp; isFirst ← FALSE The number of the ReducerIndex task is set equal to the e. ELSEnumber of attributes. The method setPartitionerClass() in 1) laResult ← getIntersection(laResult, laTmp);Hadoop decides to which ReducerIndex task an intermediate 19. RETURN laResultkey-value pair should be sent, using the upper 32 bits in the END OF ALGORITHM SearchMapFilekeys. In the ReducerIndex task, the attribute-bin LongWri- Figure 5 Algorithm of searching by MapFile.tables will be grouped by the number of bins in the lower 32bits in the keys, and then the inverted index HBase, a Bigtable-like storage for the Hadoop Distrib-<Bin,RecNoArray> (Table 5) can be outputted easily. Each uted Computing Environment, is based on MapFile [13]. It does not perform well for reasons of complexity in MapFile. 206
    • D. Data mining framework on Hadoop The following codes are for returning all the children of Although this improved data indexing performs better a node in a binary tree (Figure 7).than does MapFile indexing, the overhead for range query-ing by intersection algorithms is still too high. To apply 1. CREATE PROCEDURE getChild (IN rootid INT)common object-oriented data mining algorithms in Hadoop, 2. BEGINthe three problems concerning globalization, random-write a. DROP TABLE IF EXISTS chd;and duration, posed in the introduction, should be handled. b. CREATE TEMPORARY TABLE chd (id INTFor this, we propose a data mining framework on Hadoop NOT NULL DEFAULT -1)ENGINE = MEMORY;using JPA. c. DROP TABLE IF EXISTS par; 1) MySQL Cluster d. CREATE TEMPORARY TABLE par (id INT After many failures, we found that, for data mining with NOT NULL DEFAULT -1)ENGINE = MEMORY;Hadoop, a database was an ideal persistent repository for e. DROP TABLE IF EXISTS result;handling the three problems mentioned. To ensure comput- f. CREATE TABLE result (id INT NOT NULLing robustness, a distributed database is the suitable solution DEFAULT -1);for distributed high-dimensional and large-scale data han- g. INSERT chd SELECT A.ID FROM btree Adling. MySQL Cluster is designed to eliminate any single WHERE rootid=A.PID;point of failure [10] and is based on open source code; both h. WHILE ROW_COUNT()>0 DOcharacteristics make it consistent with data mining with Ha- 1) TRUNCATE par;doop. 2) INSERT par SELECT * FROM chd; 2) ORM 3) INSERT result SELECT * FROM chd; The JPA and naïve JDBC are used to implement the 4) TRUNCATE chd;ORM for objects of data mining in a database (Figure 6). To 5) INSERT chd SELECT A.ID FROM btree A, parimprove the data indexing method, a KD-tree or X-tree maybe suitable. However, the decision tree is considered a clas- B WHERE B.ID=A.PID;sical algorithm, and better elaborates the framework. i. END WHILE; 3. END; Figure 7 Stored procedure for all children of a node in MySQL TrainingData TestingData 4) Decision tree implementation As shown in the textbook [7], a decision tree for con- hadoop tinuous values can build a binary tree. Let D be a dataset. The entropy of D is given by Data Mining m Info ( D ) = −∑ pi × log 2 ( pi ) (1) i =1 JPA JDBC Where m is the number of categories and p is the prob- ability of a category. Let A be an attribute in D. The entropy of D based on the partitioning of A is given by ndbclusterFigure 6 The Data Flow Diagram of Data Mining Framework on Hadoop D1 D2 (2) Info A ( D) = × Info( D1 ) + × Info( D2 ) D D In the data mining framework, a large-scale dataset isscanned by Hadoop. The necessary information is extracted Where entropy Info() is from formula (1). D1 is the partusing classical data mining modules, which will be embed- of D, which belongs to the left child node, and the D2 is theded in the Hadoop map or reduce tasks. The extracted ob- part of D, which belongs to the right child node.jects, such as lists, trees, and graphs, will be sent automati- Each node in the tree has to contain two members, ancally into a database—in this case, MySQL Cluster. As a test ATTRIBUTE and a VALUE. Given a test record (an arraycase, the test dataset will be scanned and handled by each of continuous values), the former tells which field should bedata mining task. checked, and the latter tells which child node should be se- 3) Stored procedure lected. If the test record is less than or equal to the member After objects obtained from data mining are stored into a VALUE in the member ATTRIBUTE, then the left childdatabase, all the details can be finished using a stored pro- node is selected, otherwise the right child node is selected.cedure, such as finding a path between two nodes, returning In Hadoop, the JobSelect has to select the memberall the children of a node and returning the category of a test ATTRIBUTE and the member VALUE using the entropycase. A data mining task should send demands and receive criterion (Figure 8). A training data record should consist ofresults. This decreases network overhead and latency. multiple values and a category. In the mapperSelect tasks, a record will be decomposed into AVC_Writable tuples <at- 207
    • tribute, value, category>, where the attribute indicates the use the method output.collect (Figure 9); instead, it writesposition of the value in the record and the category indicates directly to two sequence files in the HDFS. This decreasesthe class of the record. For example, a record <3, 5, 7, 9, -1> unwanted I/O overheads. The SourceDataSet is D; thewill output the tuples <1, 3, -1>, <2, 5, -1>, <3, 7, -1>, <4, 9, ChildDataLeft is D1; the ChildDataRight is D2; and all of-1>. These AVC_Writable tuples will be the keys in the in- these are in fact directories in HDFS, which contain thetermediate key-value pairs. The values will be VC_Writable outputs from each mapperSplit task.tuples, or the part <value, category> in the AVC_Writable The above processing is just for one node. To build atuples. whole decision tree, the processing needs to be repeated Hadoop’s method setPartitionerClass() decides to which recursively. A complete decision tree Hadoop algorithm isreducer a key-value pair should be sent, using the attribute in shown in Figure 10.the AVC_Writable tuple. Hadoop’s method setOutputVal-ueGroupingComparator() will group the values 1. nodeD.Path ← D, JPA.persist(nodeD);(VC_Writable tuples) by the same attribute in ReducerSelect 2. BuildTree(nodeD);tasks. By default, the values are sorted based on the keys, so FUNCTION BuildTree (Path D)determining the best split value for the attribute requires one 3. IF (all categories in D are the same C) THENpass through the VC_Writable tuples by the formula (2). JPA.begin(); set nodeD as a leaf; set nodeD.category as C; JPA.commit(); RETURN nodeD; 4. IF (all attribute checked) THEN SourceDataSet::= <LongWritable, FloatArrayWritable> JPA.begin();nodeD.Leaf ← true; nodeD.category ← max(C){C∈D}; JPA.commit(); RETURN nodeD; 5. JPA.begin(); nodeD.attribute, nodeD.value ← JobSe- mapperSelect mapperSelect 1 ... m lect(D); JPA.commit(); 6. JPA.begin(); D1, D2 ← JobSplit(D); JPA.commit(), 7. nodeLeft ← BuildTree(D1); IntermediateData::= JobSelect <AVC_Writable, VC_Writable> 8. nodeRight ← BuildTree(D2); 9. JPA.persist(nodeLeft); JPA.persist(nodeRight); ReducerSelect ReducerSelect 10. JPA.begin(); nodeD.ChildLeft ← nodeLeft; 1 ... r nodeD. ChildRight ← nodeRight; JPA.commit(); 11. RETURN nodeD SelectData::= Figure 10 Algorithm for building decision tree on Hadoop <EntropyWritable, AV_Writable> The JPA.persist() function (Figure 10) means a new nodeFigure 8 The Data Flow Diagram of Selection of Decision Tree in Hadoop is persisted by JPA. The assignments enclosed by JPA.begin() and JPA.commit() mean these operations are After the best split, values for each attribute are calcu- monitored by JPA. Any node in the cluster will share objectslated by the ReduceSelect tasks, and the key-value pair <En- persisted by JPA.tropyWritable, AV_Writable> is outputted. The minimum III. EXPERIMENTAL SETUPexpected entropy is selected. Let A0 and V0 be the best at-tribute-value pair. D1 is the subset of D satisfying A0 <=V0, We evaluated several synthetic datasets and an open da-and D2 is the subset of D satisfying A0 > V0. taset. Before the split, a node in JPA will be created corre- A. Computing Environmentsponding to D, with the best attribute A0 and best splittingvalue V0. Experiments were conducted using Linux CentOS 5.0, JobSplit contains only mapperSplit tasks, which do not Hadoop 0.19.1, JDK 1.6, a 1Gbps switch network, and 4 ASUS RS100-X5 servers (Table 10). SourceDataSet::= Table 6 Devices for Hadoop computing. <LongWritable, FloatArrayWritable> Environment Parameter Model name ASUS RS100-X5 mapperSplit mapperSplit CPU MHz Dual Core 2GHz 1 ... m Cache size 1024 KB JobSplit Memory 3.4GB NIC 1Gbps ChildDataLeft ChildDataRight Hard Disk 1TB Figure 9 The Data Flow Diagram of Split of Decision Tree in Hadoop 208
    • B. Inverted indexing vs MapFile Indexing convenient tool for the data mining framework will be a Two different-sized experiments are shown in Table 7. A focus of future efforts.(1,000,000 × 200) synthetic data set was generated at ran- ACKNOWLEDGEMENTSdom for test1. A (102,294 × 117) data set was taken fromKDDCUP 2008[14] for test2. The procedure was followed This work is supported by the National Basic Researchaccording to the design (see II.A.2). In “Index Method,” M Priorities Programme (No. 2007CB311004) and Nationalmeans MapFile, and I means “Inverted Index”. Of both tests, Science Foundation of China (No.60775035, 60903141,the latter takes fewer seconds for searching than the former. 60933004, 60970088), National Science and TechnologyThe SearcherTime is the average for a single query. Support Plan (No.2006BAC08B06). Table 7 Comparison of MapFile and Inverted indexing REFERENCES Item Test1 Test2 [1] C. Bohm, S. Berchtold, H. P. Kriegel, and U. Michel, Index Method M I M I "Multidimensional index structures in relational databases," in FloatData(MB) 1,614 204 1st International Conference on Data Warehousing and DiscretizationTime(s) 180 59 Knowledge Discovery (DaWak 99), Florence, Italy, 1999, pp. EncoderTime(s) 482 42 51-70. [2] J. Dean, S. Ghemawat, and Usenix, "MapReduce: Simplified IndexerTime(s) 5835 8473 307 397 data processing on large clusters," in 6th Symposium on SearcherTime(s) 8.25 1.07 5.07 0.68 Operating Systems Design and Implementation (OSDI 04), For a single query, the searching time of the inverted in- San Francisco, CA, 2004, pp. 137-149.dexing shows great advantage. [3] R. M. C. McCreadie, C. Macdonald, and I. Ounis, On Single-Pass Indexing with MapReduce. New York: AssocC. Data mining framework Computing Machinery, 2009. The comparison of JDBC and JPA as decision trees uses [4] R. Lammel, "Googles MapReduce programming model -the same synthetic dataset (Table 8). The times listed are the Revisited," Science of Computer Programming, vol. 70, pp.cumulative of all procedures. To limit the size of the tree, the 1-30, Jan 2008.depth of any tree is no more than 20. [5] C. Moretti, K. Steinhaeuser, D. Thain, and N. V. Chawla, Table 8 Comparison of JDBC and JPA on MySQL cluster "Scaling Up Classifiers to Cloud Computers," in IEEE Item JDBC JPA International Conference on Data Mining Pisa, Italy:Size of synthetic 5,897,367,552 bytes http://icdm08.isti.cnr.it/Paper-Submissions/32/accepted-paperdata 5,000,000 records, with 117 fields. s, 2008. [6] D. Gillick, A. Faria, and J. DeNero, "MapReduce: DistributedNumber of 501,723 nodes Computing for Machine Learning," 2006, p.nodes http://www.icsi.berkeley.edu/~arlo/publications/gillick_cs262Accumulated 2,695.307 seconds 2,981.279 second a_proj.pdf.inserting time [7] J. Han and M. Kamber, Data Mining: Concepts andTime for get- 257.93 seconds 324.51 seconds Techniques, Second Edition, second ed. San Francisco: Child Morgan Kaufmann, 2006. The time in Table 8 can be decreased dramatically re- [8] S. Berchtold, D. A. Keim, and H. P. Kriegel, "The X-tree: Angardless of robustness; i.e., if the MyISAM engine is used Index Structure for High-Dimensional Data," Readings in Multimedia Computing and Networking, 2001.on a single node instead of the NDBcluster engine on 4 [9] R. Biswas and E. Ort, "Java Persistence API - A Simplernodes, performance in database will be multiply approxi- Programming Model for Entity Persistence,"mately 25 times. http://java.sun.com/developer/technicalArticles/J2EE/jpa/, 2009. IV. CONCLUSION [10] S. Hinz, P. DuBois, J. Stephens, M. M. Brown, and A. A feasible data indexing algorithm is proposed for Bedford, "MySQL Cluster," inhigh-dimensional, large-scale data sets. It consists of http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-overviYSCORE binning, Hadoop processing, and improved inter- ew.html, 2009.sectioning. Experimental results show that the algorithm is [11] Wikipedia, "Central limit theorem," 2009, p.technically feasible. http://en.wikipedia.org/wiki/Central_limit_theorem. An efficient data mining framework is proposed for Ha- [12] A. S. F. ASF, "MapFile (Hadoop 0.19.0 API)," 2008, p.doop; it employs the JPA and MySQL Cluster to resolve http://hadoop.apache.org/core/docs/current/api/org/apache/haproblems of globalization, random-write and duration in doop/io/MapFile.html.Hadoop. The distributed computing in Hadoop and central- [13] J. Gray, "HBase: Bigtable-like structured storage for Hadoopized data collection using JPA are combined together or- HDFS." vol. 2009, 2008, p. http://wiki.apache.org/hadoop/Hbase.ganically. [14] S. M. S. USA, "kddcup data," 2008, p. We will consider replacing MapFile with our inverted http://www.kddcup2008.com/KDDsite/Data.htm.indexing for HBase in future work in an effort to enhancethe performance of HBase in larger-scale data mining. A 209