• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
400
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 1 Classification of newborn’s sleeping phases from their EEG. Dominik Franˇ k e Abstrakt— Correct classification of newborn’s sleeping phases from their EEG can help to predict the problems on brain or other mental defects. This semestral work has been disposed to find optimal k in nearest neighbor classifier. The choice of kNN is motivated by its simplicity, flexibility to incorporate different data types and adaptability to irregular feature spaces. The best k in nearest neighbor classifier was figured up for the value 3, with accuracy 83.69%.It means each time newborn’s EEG will be given the algorithm can classify sleeping phases of this newborn by choosing 3 other nearest EEG records. I. A SSIGNMENT Use the method of k Nearest Neighbors for classification of Fig. 1. Graph showing original values of attributes; x-axis: attributes, y-axis: values of attributes (−5 to 543) the target attribute of chosen dataset. Chose one of the classes as target class - positive. Find the best classifier, which has False Positive rate (F P r)< 0.3. Count the accuracy and True Positive rate T P r of this classificator. II. I NTRODUCTION The problem is to find optimal k in Nearest Neighbor classificator (next time will be written as NN) for given dataset. The algorithm can be briefly summarized as follows: In the training phase, it computes the similarity measures from all rows in the training set and combines them in a global similarity measure using the XValidation method. In the testing phase, for a rows with “unknown“ classes, it chooses their k nearest neighbors in the training set according to the trained Fig. 2. Graph showing normalized values of attributes; x-axis: attributes, similarity measure and then uses a customized voting scheme y-axis: values of attributes (0.0 to 1.0) to generate a list of predictions with confidence scores [4]. Dataset is in *.arff format and each row has 55 attributes. Attribute called ”class“ has 4 nominal values (0,1,2,3) and it It’s sure the dataset has to be preprocessed before starting represents the classified new-born’s sleeping phases. I didn’t experiments. First operator Normalization is used (Showed on find anywhere what does it mean exactly, but from my Fig.5, page 3) which normalizes all numerical values to range observations I expect it means that from given attributes from 0.0 to 1.0. The optimization of extreme values won’t (EEG c1 alpha,...) can be computed what kind of sleeping be done because in next part of preprocessing will be chosen 1 these values of attributes represent. [5] just 70 of all rows (2942 rows) and extreme values will be The given dataset is preprocessed a little bit. There aren’t “eliminated“. For choosing this subset method of Stratified any rows with zero attributes and all attributes are numerical Sampling is used and as label attribute named ”class“ is set values. In Fig.1 are shown all attributes of dataset and their attributed. From 2942 chosen rows 2210 are labeled as class values. These values are not normalized so the range of 0 and 732 as class 1 (Tab.I). Class 0 is merged from original attribute’s values is from −5 to 543. The normalized dataset classes 1,2 and 3. Class 1 is renamed from original class 0. is on Fig.2, where all values are in the range from 0.0 to 1.0. Normalized datasubset is shown on Fig.3 Each class (0, 1, 2, 3) has different color. The dataset is too big After attributes normalization the phase of training the to process it at once, because it sets up of 42027 rows each model begins. As shown in Fig.5 (right side) the normalized with 55 regular attributes. subset is divided into 2 parts. 1 of this subset goes to training 5 4 phase and 5 are used for testing. III. E XPERIMENTS In training phase operator Parameter Iteration is used for The chosen positive class of original data is class 0 (In iterating the k of NN. K is iterated from 1 to 15 incresing by normalized subset renamed to class 1). The other classes +1. (1,2,3) are set as negative classes. To avoid overfitting of NN method called K-fold cross
  • 2. SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 2 Attr. name Statistics Range class label 0.0 (2210), 1.0 (732) PNG 0.427 +/- 0.111 [0.000 ; 0.919] PNG filtered 0.359 +/- 0.166 [0.000 ; 1.000] EMG std 0.114 +/- 0.090 [0.033 ; 0.766] EMG std filtered 0.126 +/- 0.138 [0.004 ; 0.874] ECG beat 0.427 +/- 0.135 [0.212 ; 0.993] ECG beat filtered 0.444 +/- 0.138 [0.225 ; 0.987] EEG fp1 delta 0.216 +/- 0.065 [0.081 ; 0.964] EEG fp2 delta 0.218 +/- 0.067 [0.071 ; 0.958] EEG t3 delta 0.202 +/- 0.074 [0.064 ; 0.906] Fig. 3. Graph showing normalized datasubset with positive class = 1; x-axis: EEG t4 delta 0.232 +/- 0.089 [0.062 ; 0.956] attributes, y-axis: values of attributes (0.0 to 1.0) EEG c3 delta 0.243 +/- 0.072 [0.091 ; 0.961] EEG c4 delta 0.244 +/- 0.070 [0.089 ; 0.968] EEG o1 delta 0.212 +/- 0.077 [0.066 ; 0.958] EEG o2 delta 0.211 +/- 0.083 [0.046 ; 0.933] validation (CV) is used. For each iteration of k CV is run EEG fp1 theta 0.188 +/- 0.072 [0.068 ; 0.976] 10 times. CV divides training set 10 times into 2 parts. CV EEG fp2 theta 0.216 +/- 0.075 [0.090 ; 0.972] trains kNN on the first part and validates kNN with the second EEG t3 theta 0.222 +/- 0.065 [0.077 ; 0.970] part. After 10 iterations of K-fold cross validation the average EEG t4 theta 0.264 +/- 0.079 [0.082 ; 0.938] accuracy of kNN for these 10 CV is computed. After k is EEG c3 theta 0.308 +/- 0.061 [0.101 ; 0.962] iterated from 1 to 15, k with the highest average accuracy EEG c4 theta 0.299 +/- 0.060 [0.098 ; 0.960] is selected and will be used in testing phase. Graph with the EEG o1 theta 0.219 +/- 0.067 [0.080 ; 0.922] average accuracies for each k NN is on Fig.4. EEG o2 theta 0.271 +/- 0.079 [0.080 ; 0.931] EEG fp1 alpha 0.112 +/- 0.077 [0.043 ; 0.981] EEG fp2 alpha 0.124 +/- 0.081 [0.046 ; 0.956] EEG t3 alpha 0.158 +/- 0.080 [0.055 ; 0.946] EEG t4 alpha 0.181 +/- 0.082 [0.055 ; 0.928] EEG c3 alpha 0.249 +/- 0.070 [0.088 ; 0.943] EEG c4 alpha 0.246 +/- 0.069 [0.085 ; 0.957] EEG o1 alpha 0.116 +/- 0.066 [0.039 ; 0.910] EEG o2 alpha 0.151 +/- 0.066 [0.048 ; 0.935] EEG fp1 beta1 0.114 +/- 0.079 [0.043 ; 0.985] EEG fp2 beta1 0.123 +/- 0.083 [0.046 ; 0.943] EEG t3 beta1 0.152 +/- 0.084 [0.045 ; 0.957] EEG t4 beta1 0.168 +/- 0.087 [0.053 ; 0.930] EEG c3 beta1 0.234 +/- 0.077 [0.092 ; 0.942] EEG c4 beta1 0.226 +/- 0.074 [0.079 ; 0.949] Fig. 4. Average accuracy for kNN; x-axis: k NN; y-axis: accuracy; EEG o1 beta1 0.091 +/- 0.070 [0.028 ; 0.916] EEG o2 beta1 0.129 +/- 0.070 [0.041 ; 0.970] EEG fp1 beta2 0.217 +/- 0.081 [0.086 ; 0.990] EEG fp2 beta2 0.211 +/- 0.076 [0.083 ; 0.958] IV. M ETHODOLOGY EEG t3 beta2 0.189 +/- 0.070 [0.063 ; 0.927] A. Used tool EEG t4 beta2 0.226 +/- 0.083 [0.065 ; 0.922] EEG c3 beta2 0.248 +/- 0.066 [0.092 ; 0.960] Tool used is called RapidMiner (v4.0) [1]. Using Rapid- EEG c4 beta2 0.246 +/- 0.065 [0.090 ; 0.966] Miner allows user to make all phases of DataMining in this EEG o1 beta2 0.230 +/- 0.085 [0.076 ; 0.958] tool. It detracts from familiarization with only one environ- EEG o2 beta2 0.220 +/- 0.080 [0.055 ; 0.932] ment. All operators used in this work are accesible from the EEG fp1 gama 0.154 +/- 0.073 [0.058 ; 0.976] basic version of RapidMiner. EEG fp2 gama 0.172 +/- 0.076 [0.075 ; 0.956] EEG t3 gama 0.196 +/- 0.069 [0.067 ; 0.958] EEG t4 gama 0.227 +/- 0.078 [0.071 ; 0.897] B. Configuration EEG c3 gama 0.289 +/- 0.063 [0.097 ; 0.959] By combining many operators in RapidMiner the project is EEG c4 gama 0.281 +/- 0.061 [0.095 ; 0.959] built. The complete tree view of operators used to get the best EEG o1 gama 0.168 +/- 0.065 [0.062 ; 0.915] k in Nearest Neighbor classification is showed on Fig.5. EEG o2 gama 0.237 +/- 0.077 [0.072 ; 0.912] • All operators has local random seed set to -1. Just the TABLE I Root operator has value 2001, because then the random S TATISTICS OF ATTRIBUTES OF NORMALIZED SUBSET operations generates the same values. If an operator has sampling type then it is set to stratified sampling.. • SplitChain operator has split ratio set to 0.2.
  • 3. SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 3 • XValidation operator has number of validations set to xi and xj (j = 1, 2, ...n) is defined as: 10 and measure set to Euclidean Distance. • NearestNeighbor trying k operator has k set to 15 but this d(xi , xj ) = (xi1 − xj1 )2 + (xi2 − xj2 )2 + ... + (xin − xjn )2 parameter is influenced by Iterating k - training operator. • ClassificationPerformance (1) operator has checked ac- The Algorithm NN can be built as: [6] curacy. • NearestNeighbor defined k operator has k set to 3 and • Training phase: Build the set of training examples T . measure set to Euclidean Distance. • Testing phase: • ClassificationPerformance (2) operator has checked ac- – Is given a query instance xq to be classified curacy. – Let x1 ...xn denote the k instances from T that are • BinominalClassificationPerformance operator has nearest to xq checked fallout. n • ProcessLog operator logs the accuracy from Classifica- F (xq ) = argmax δ(v, f (xi )) tionPerformance. i=1 The best k in Nearest Neighbor classificator is found by iterating k from 1 to 15. Top value 15 is chosen as enough. In each iteration the Operator “ClassificationPerformance” counts accuracy of given k. The Operator ProcessLog writes results of “ClassificationPerformance” and generates report (Fig.4). From the report it stands for reason, that the best k is 3. True 0 True 1 Pred 0 1609 225 Pred 1 159 361 TABLE II NN CLASSIFICATION FOR k := 3; accuracy = 83.69%, F P r = 8.99%, T P r = 61.60% Positive class is selected as class with value = 1. For k = 3 the accuracy was 83, 69% as shown in Tab.II. False Positive rate (F P r) of this classicator is 8.99%. 159/(159 + 1609) = 0.0899 There is also evidently, that True Positive rate (T P r) is 61.60%. Because there are 586 examples with class=1 and just 361 of them were classified correctly. 361/(361 + 225) = 0.616 V. D ISCUSSION False Positive rate seems to be very good. Maybe it seems to be very low but it’s probably by the big subset of training data. To discuss is, if the rate = 0.2 dividing datasubset to training Fig. 5. “Box view” of complete project from DataMiner and testing part is set correctly or not. With faster computer can be set oposite rate = 0.8. In my opinion 584 training examples were enough and F P r declares the rate wasn’t chosen so badly. On the other hand T P r is 61.60% which is not much and can be easily higher/lower influencing F P r. C. Experiments setup Next question can be if the algorithm shouldn’t count the The Nearest Neighbor classification uses Euclidean distance weighted kNN. I was trying to find any dependencies between to compute the kNN. In human words it can be translated as the 55 attributes but I wasn’t successful. So I don’t think “Finds the closest point xi to xj ”. Euclidean distance between setting weights on attributes would be helpful.
  • 4. SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 4 VI. C ONCLUSION In my opinion I found very good classifier for the subset of dataset. There can be done some improvments of this algorithm to work better. I think it would be useful if such classifier should be used in practices but just for school work it’s not so important. The hardest part of the work was exploring operators in RapidMiner and finding the right one I needed. I know there are still some which can be replaced by better operators, but this solution was working and what more it gave good results. Most of the time I spent on waiting until RapidMiner will process all the operators with given dataset. Unfortunately this programm is written in Java, what is not language for scientific computing and I had to restart java because it got out of memory quite often. The most interesting part for me was generating graphs and writing this report ¨ . I am very satisfied I finished the work and I can say that I learnt a lot about datamining and about classifing any dataset. I am afraid anybody can feel from this work that my future specialization will be Software Engineering and such scientific work is not my cup of tea. R EFERENCES [1] CENTRAL QUEENSLAND UNIVERSITY. RapidMiner GUI man- ual [online]. 2007 , May 29, 2007 [cit. 2008-02-08]. Available from WWW: <http://os.cqu.edu.au/oswins/datamining/ rapidminer/rapidminer-4.0beta-guimanual.pdf>. [2] FARKASOVA, Blanka, KRCAL, Martin . Project Bibliographic citations [online]. c2004-2008 [cit. 2008-05-08]. CZ. Available from WWW: <http://www.citace.com/>. [3] LAURIKKALA, Jorma. Improving Identification of Difficult Small Classes by Balancing Class Distribution . [s.l.], 2001. 14 p. DEPART- MENT OF COMPUTER AND INFORMATION SCIENCES UNIVER- SITY OF TAMPERE . Report. Available from WWW: <http:// www.cs.uta.fi/reports/pdf/A-2001-2.pdf>. ISBN 951- 44-5093-0. [4] KARDI, Teknomo. K-Nearest Neighbors Tutorial [online]. c2006 [cit. 2008-05-08]. Available from WWW: <http://people.revoledu. com/kardi/tutorial/KNN/>. [5] POBLANO, Adrian and GUTIERREZ, Roberto. Correlation between the neonatal EEG and the neurological examination in the first year of life in infants with bacterial meningitis. Arq. Neuro-Psiquiatr. [online]. 2007, vol. 65, no. 3a [cited 2008-05-10], pp. 576-580. Available from: <http://www.scielo.br/scielo.php?script=sci_ arttext&pid=S0004-282X2007000400005&lng=en&nrm= iso>. ISSN 0004-282X. doi: 10.1590/S0004-282X2007000400005 [6] SOLOMATINE, D.P. Instance-based learning and k- Nearest neighbor algorithm [online]. c1988-2003 [cit. 2008- 05-10]. EN. Available from WWW: <http://www. xs4all.nl/˜dpsol/data-machine/nmtutorial/ instancebasedlearningandknearestneighboralgorithm. htm>. [7] VAYATIS, Nicolas, CLEMENCON, Stphan. Advanced Machine Learning Course [online]. [2008] [cit. 2008-05-08]. EN. Available from WWW: <http://www.cmla.ens-cachan.fr/Membres/vayatis/ teaching/cours-de-machine-learning-ecp.html>. [8] XIAOJIN, Zhu. K-nearest-neighbor: an introduction to machine learning. CS 540: Introduction to Artificial Intelligence [online]. 2005 [cit. 2008- 05-08]. Available from WWW: <http://pages.cs.wisc.edu/ ˜jerryzhu/cs540/knn.pdf>. [9] van den BOSCH, Antal.Video: K-nearest neighbor classification [online]. Tilburg University cc2007 [cit. 2008-05-10]. EN. Available from WWW: <http://videolectures.net/aaai07_bosch_knnc/>.