Parametric comparison based on split criterion on classification algorithm
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Parametric comparison based on split criterion on classification algorithm

on

  • 419 views

 

Statistics

Views

Total Views
419
Views on SlideShare
419
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Parametric comparison based on split criterion on classification algorithm Document Transcript

  • 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME459PARAMETRIC COMPARISON BASED ON SPLIT CRITERION ONCLASSIFICATION ALGORITHM IN STREAM DATA MININGMs. Madhu S. Shukla*, Dr.K.H.Wandra**, Mr. Kirit R. Rathod****(PG-CE Student, Department of Computer Engineering),(C.U.Shah College of Engineering and Technology, Gujarat, India)** (Principal, Department of Computer Engineering),(C.U.Shah College of Engineering and Technology, Gujarat, India)*** (Assistant Professor, Department of Computer Engineering)ABSTRACTStream Data Mining is a new emerging topic in the field of research. Today, there arenumber of application that generate Massive amount of stream data. Examples of such kindof systems are Sensor networks, Real time surveillance systems, telecommunication systems.Hence there is requirement of intelligent processing of such type of data that would help inproper analysis and use of this data in other task even. Mining stream data is concerned withextracting knowledge structures represented in models and patterns in non stopping streamsof information.Classification process based on generating decision tree in stream data miningthat makes decision process easy. As per the characteristic of stream data, it becomesessential to handle large amount of continuous and changing data with accuracy. Inclassification process attribute selection at the non leaf decision node thus become a criticalanalytic point. Various performance parameter’s like Speed of Classification, Accuracy, andCPU Utilization time can be improved if split criterion is implemented precisely. This paperpresents implementation of different attribute selection criteria and their comparison withalternative method.Keywords: Stream, Stream Data Mining, Performance Parameter processing, MOA (MassiveOnline Analysis), Split Criterion.INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING& TECHNOLOGY (IJCET)ISSN 0976 – 6367(Print)ISSN 0976 – 6375(Online)Volume 4, Issue 2, March – April (2013), pp. 459-470© IAEME: www.iaeme.com/ijcet.aspJournal Impact Factor (2013): 6.1302 (Calculated by GISI)www.jifactor.comIJCET© I A E M E
  • 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME4601. INTRODUCTIONCharacteristic of stream data also act as challenges for the same. Due its huge size,continuous nature, speed with which it changes, it requires a real time response which is doneafter analysis of this type of data. As the data is huge in size algorithm which would accessthe data is restricted for single scan of the data.Data mining makes use of different types of algorithm for various types of miningtask like Classification, Clustering, and Pattern Recognition. Same way, Stream Data miningalso makes use of different types of algorithm for various types of mining task. Some of thealgorithm for Classification of Stream Data is Hoeffding Tree, VFDT (Very Fast decisionTree, CVFDT (Concept adaptation Very Fast Decision Tree).These classification algorithm isbased on Hoeffding Bound for decision tree generation. It makes use of Hoeffding Bound togather optimum amount of data so that classification can be done accurately. CVFDT is thealgorithm which is able to detect concept drift which again is a challenge in stream datamining. As the size of stream data is extremely large, a method is required for improving thesplit criterion at the node of decision tree, so that the speed in tree generation is achievedaccuracy is improved and CPU utilization time is reduced. Two different types of splitcriterion are checked for Stream data Classification in this paper. And thus improvement inthe algorithm based on it is done as a part of research work.As said earlier, Stream Data is huge in size, so in order to perform certain analysis; weneed to take some sample of that data so that processing of stream data could be done withease. These samples taken should be such that whatever data comes in the portion of sampleis worth analyzing or processing, which means maximum knowledge is extracted from thatsampled data.In this paper sampling technique used is adaptive sliding window in Hoeffding-Bound basedtree algorithm.2. RELATED WORKImplementing algorithm for Stream Data Classification demands improvement inresource utilization as well as improvisation in accuracy with ongoing classification process.Here, we would see improvement done on algorithm that is based on Concept Drift Detectionwhile doing the classification of the data. Drift Detection here is done using WindowingTechnique.Sliding Window: It is an advance technique. It deals with detailed analysis over most recentdata items and over summarized versions of older ones.The inspiration behind sliding window is that the user is more concerned with the analysis ofmost recent data streams. Thus the detailed analysis is done over the most recent data itemsand summarized versions of the old ones. This idea has been adopted in many techniques inthe undergoing comprehensive data stream mining system.3. CLASSIFICATION PROCESS.There are many data mining algorithms that exist in practice. Data mining algorithmscan be categorized in three types:1. Classification2. Clustering3. Association
  • 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME461A standard classification system has normally three different phases:1. The training phase, during which the model is built using labeled data.2. The testing phase, during which the model is tested by measuring its classificationaccuracy on withheld labeled data.3. The deployment phase during which the model is used to predict the class of unlabelleddata. The three phases are carried out in sequence. See Figure 2.1 for the standardclassification phases.Fig 3.1: Phases of standard classification systems3.1. STREAM DATA MININGOrdinary classification is usually considered in three phases. In the first phase, amodel is built using data, called the training data, for which the property of interest (the class)is already known (labeled data). In the second phase, the model is used to predict the class ofdata (test data), for which the property of interest is known, but which the model has notpreviously seen. In the third phase, the model is deployed and used to predict the property ofinterest for (unlabelled data).In stream classification, there is only a single stream of data, having labeled and unlabelledrecords occurring together in the stream. The training/test and deployment phases, therefore,interleave. Stream classification of unlabelled records could be required from the beginningof the stream, after some sufficiently long initial sequence of labeled records, or at specificmoments in time or for a specific block of records selected by an external analyst.4. ATTRIBUTE SELECTION CRITERION IN DECISION TREE:Selection of appropriate splitting criterion helps in improving performance measurementdimensions. In data stream mining main three performance measurement dimensions:- Accuracy- Amount of space necessary or computer memory (Model cost or RAM hours)- The time required to learn from training examples and to predict (Evaluation time)These properties may be interdependent: adjusting the time and space used by analgorithm can influence accuracy. By storing more pre-computed information, such as lookup tables, an algorithm can run faster at the expense of space. An algorithm can also runfaster by processing less information, either by stopping early or storing less, thus having lessdata to process. The more time an algorithm has, the more likely it is that accuracy can beincreased.
  • 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME462There are major two types of attribute selection criterion and they are InformationGain and Gini Index. Later one is also known as binary split criterion. During late 1970s and1980s .J.Ross Quinlan, a researcher in machine learning has developed a decision treealgorithm known as ID3 [1] (Iterative Dichotomiser). ID3 uses information gain for attributeselection. Information gain Gain (A) is given as Gain (A) = Info (D) –InfoA (D).We havedeveloped a new algorithm to calculate information gain. Methodology wise this algorithm ispromising. We have divided the algorithm into two parts. The first part calculates Info (D)and the second part calculates the Gain (A).4.1. Information Gain Calculation: (information before split) – (information after split)Entropy: A common way to measure impurity is entropy• Entropy = Where pi is theprobability of class i.Compute it as the proportion of class i in the set.• Entropy comes from information theory. The higher the entropy the more theinformation content.• For Continuous data value is computed as (ai+ai+1+1)/2787.0174log1741713log171322 =⋅−⋅−Entire population (30 instances)Information Gain= 0.996 - 0.615 = 0.38391.01312log1312131log13122 =⋅−⋅−Calculating Information Gain17 instances13 instancesInformation Gain = entropy(parent) – [average entropy(children)]996.03016log30163014log301422 =⋅−⋅−(Weighted) Average Entropy of Children = 615.0391.03013787.03017=⋅+⋅parententropychildentropychildentropyFigure 4.1: Phases of standard classification systems4.2. Calculating Gini IndexIf a data set T contains examples from n classes, Gini index, Gini (T) is defined asWhere pj is the relative frequency of class j in T. Gini (T) is minimized if the classes in T areskewed.After splitting T into two subsets T1 and T2 with sizes N1 and N2, the Gini index of the splitdata is defined asThe attribute providing smallest gin split(T) is chosen to split the node.∑−iii pp 2log∑=−=njjpTgini121)()()()( 2211TNTNgini giniNginiNTsplit+=
  • 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME4635. METHODOLOGY AND PROPOSED ALGORITHMCVFDT (Concept Adaptation Very fast Decision Tree) is an extended version ofVFDT which provides same speed and accuracy advantages but if any changes occur inexample generating process provide the ability to detect and respond. Various systems withthis CVFDT uses sliding window of various dataset to keep its model consistent. In Most ofsystems, it needs to learn a new model from scratch after arrival of new data. Instead,CVFDT continuous monitors the quality of new data and adjusts those that are no longercorrect. Whenever new data arrives, CVFDT incrementing counts for new data anddecrements counts for oldest data in the window. The concept is stationary than there is nostatically effect. If the concept is changing, however, some splits examples that will no longerappear best because new data provides more gain than previous one. Whenever this thingoccurs, CVFDT create alternative sub-tree to find best attribute at root. Each time new besttree replaces old sub tree and it is more accurate on new data.5.1 CVFDT ALGORITHM (Based on HoeffdingTree)1. Alternate trees for each node in HT start as empty.2. Process Examples from the stream indefinitely3. For Each Example (x, y)4. Pass (x, y) down to a set of leaves using HT And all alternate trees of the nodes (x, y) passThrough.5. Add(x, y) To the sliding window of examples.6. Remove and forget the effect of the oldest Examples, if the sliding window overflows.7. CVFDT Grow8. Check Split Validity if f examples seen since Last checking of alternate trees.9. Return HT.Fig: 5.1 Flow of CVFDT algorithm
  • 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME4646. EXPERIMENTAL ANALYSIS WITH OBSERVATIONDifferent types of dataset were taken and the algorithm of CVFDT was implementedafter Importing those data set to in MOA. Performance analysis of various split criterion usedin decision tree approach are also tested for improving the accuracy of the algorithm. Datasetsused here are in ARFF format. Some of the data are taken from Repository of CaliforniaUniversity, some from projects of Spain which are working on Stream Data.Data Sets taken were as follows:1) Sensor2) Sea3) Random Tree generator.The Readings taken here are for Sensor data. It contains information (temperature,humidity, light, and sensor voltage) collected from 54 sensors deployed in Intel BerkeleyResearch Lab. The whole stream contains consecutive information recorded over a 2 monthsperiod (1 reading per 1-3 minutes). I used the sensor ID as the class label, so the learning taskof the stream is to correctly identify the sensor ID (1 out of 54 sensors) purely based on thesensor data and the corresponding recording time. While the data stream flow over time, sodoes the concepts underlying the stream. For example, the lighting during the working hoursis generally stronger than the night, and the temperature of specific sensors (conference room)may regularly rise during the meetings.Fig: 6.1 MIT Computer Science and Artificial Intelligence Lab data repositoryAs discussed above an attribute selection measure is a heuristic for selecting the splitting criterionthat “best” separates a given Data. Two common methods used for it are:1) Entropy based method (i.e. Information Gain)2) Gini Index6.1 RANDOM TREE GENERATOR DATA SET RESULTS
  • 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME465Instance InformationGain(Accuracy)GiniIndex(Accuracy)100000 92.6 81.7200000 93 83300000 94.7 80.1400000 96.3 82.2500000 94.8 80.9600000 96.9 81.9700000 96.9 82.6800000 96.7 82.1900000 98.7 841000000 97.4 77.9Table-I: Comparison for accuracy in random tree generator6.2 SEA DATA SET RESULTSInstance InformationGain(Accuracy)GiniIndex(Accuracy)100000 89.8 89.3200000 92.1 91.6300000 89.6 89.3400000 89.1 88.9500000 88.5 88.5600000 88.8 88.1700000 90.6 90.6800000 89.5 89.3900000 89.1 891000000 89.9 89.9Table-II: Comparison for accuracy for SEA Data
  • 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME4666.3 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (CPUUTILIZATION)Learning evaluationinstancesEvaluation time (Cpuseconds) Info gainEvaluation time (Cpuseconds)Gini index100000 6.676843 8.704856200000 13.46289 18.67332300000 20.23333 29.40619400000 26.97257 39.87386500000 33.68062 49.63952600000 40.40426 59.06198700000 47.0499 67.70443800000 53.74234 78.0941900000 59.93558 88.140571000000 66.79963 98.483431100000 73.27367 107.17271200000 79.27971 116.98511300000 85.53535 127.0161400000 91.99379 136.62571500000 98.40543 145.29931600000 104.3803 152.92781700000 110.3083 160.01021800000 116.4859 168.12231900000 121.9928 174.8459Table-III: Comparison of CPU Utilization time for SENSOR Data
  • 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME4676.4 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (ACCURACY)Learning evaluationinstancesClassifications correct(percent)Info GainClassifications correct(percent)Gini Index100000 96.3 98.4200000 68.3 69.7300000 18 64.4400000 43.2 67.4500000 62.8 72.9600000 92 71700000 97.9 72.5800000 97.4 73.9900000 96.8 73.71000000 80.6 68.51100000 53.6 71.21200000 71 90.31300000 84.1 73.11400000 78.5 83.91500000 96.3 84.91600000 50.9 84.91700000 24 791800000 74.3 87.61900000 98 97.8Table-IV: Comparison of ACCURACY for SENSOR Data
  • 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME4686.5 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (TREE SIZE)Learning evaluationinstancesTree size (nodes) InfoGainTree size (nodes) GiniIndex100000 14 126200000 30 270300000 44 396400000 60 530500000 76 666600000 88 800700000 102 938800000 122 1076900000 136 12141000000 150 13461100000 172 14661200000 196 16021300000 216 17421400000 226 18681500000 240 19981600000 262 21221700000 282 22381800000 292 23521900000 312 2474Table-V: Comparison of TREE SIZE for SENSOR Data)
  • 11. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME4696.6 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (LEAVES)Learning evaluationinstancesTree size (leaves) InfoGain Tree size (leaves) Gini Index100000 7 63200000 15 135300000 22 198400000 30 265500000 38 333600000 44 400700000 51 469800000 61 538900000 68 6071000000 75 6731100000 86 7331200000 98 8011300000 108 8711400000 113 9341500000 120 9991600000 131 10611700000 141 11191800000 146 11761900000 156 1237Table-IV: Comparison of LEAVES for SENSOR Data)6.7 COMPARISION OF ALL DIMENSION OF PERFORMANCE TOGETHERFOR SENSOR DATAFig 6.2: Comparison of Performance for Sensor Data for every dimension together
  • 12. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME4707. CONCLUSIONIn this paper, we discussed about theoretical aspects and practical results of StreamData Mining Classification algorithms with different split criterion. The comparison based ondifferent dataset shows the result analysis. Hoeffding trees with windowing technique spendleast amount of time for learning and results in higher accuracy than Gini Index. Memoryutilization, Accuracy and CPU Utilization which are crucial factor in Stream Data arepractically discussed here in this paper with observation. Classification generates decisiontree and tree generated with Split Criterion as Information gain shows that size of tree is alsodecreased as shown in table along with dramatic change in accuracy and CPU Utilization.REFERENCES[1] Elena ikonomovska,Suzana Loskovska,Dejan Gjorgjevik, “A Survey Of Stream DataMining” Eight National Conference with International Participation-ETAI2007[2] S.Muthukrishnan, “Data streams: Algorithms and Applications”.Proceeding of thefourteenth annual ACM-SIAM symposium on discrete algorithms,2003[3] Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy. ]“Mining DataStreams: A Review”, Centre for Distributed Systems and Software Engineering, MonashUniversity900 Dandenong Rd, Caulfield East, VIC3145, Australia[4] P. Domingos and G. Hulten, “A General Method for Scaling Up Machine LearningAlgorithms and its Application to Clustering”, Proceedings of the Eighteenth InternationalConference on Machine Learning, 2001, Williamstown, MA, Morgan Kaufmann[5] H. Kargupta, R. Bhargava, K. Liu, M. Powers, P.Blair, S. Bushra, J. Dull, K. Sarkar, M.Klein, M. Vasa, and D. Handy, VEDAS: “A Mobile and Distributed Data Stream MiningSystem for Real-Time Vehicle Monitoring”, Proceedings of SIAM International Conferenceon Data Mining, 2004.[6]“Adaptive Parameter-free Learning from Evolving Data Streams”, Albert Bifet and RicardGavald`a, Universitat Polit`ecnica de Catalunya, Barcelona, Spain.[7] “Mining Stream with Concept Drift”, Dariusz Brzezinski, Master’s thesis, PoznanUniversity of Technology[8] R. Manickam, D. Boominath and V. Bhuvaneswari, “An Analysis of Data Mining: Past,Present and Future”, International journal of Computer Engineering & Technology (IJCET),Volume 3, Issue 1, 2012, pp. 1 - 9, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375[9] Mr. M. Karthikeyan, Mr. M. Suriya Kumar and Dr. S. Karthikeyan, “A Literature Reviewon the Data Mining And Information Security”, International journal of ComputerEngineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 141 - 146, ISSN Print:0976 – 6367, ISSN Online: 0976 – 6375