SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 1
Classiﬁcation of newborn’s sleeping phases from
Dominik Franˇ k
Abstrakt— Correct classiﬁcation of newborn’s sleeping phases
from their EEG can help to predict the problems on brain or
other mental defects. This semestral work has been disposed to
ﬁnd optimal k in nearest neighbor classiﬁer. The choice of kNN
is motivated by its simplicity, ﬂexibility to incorporate different
data types and adaptability to irregular feature spaces. The best
k in nearest neighbor classiﬁer was ﬁgured up for the value 3,
with accuracy 83.69%.It means each time newborn’s EEG will be
given the algorithm can classify sleeping phases of this newborn
by choosing 3 other nearest EEG records.
I. A SSIGNMENT
Use the method of k Nearest Neighbors for classiﬁcation of Fig. 1. Graph showing original values of attributes; x-axis: attributes, y-axis:
values of attributes (−5 to 543)
the target attribute of chosen dataset. Chose one of the classes
as target class - positive. Find the best classiﬁer, which has
False Positive rate (F P r)< 0.3. Count the accuracy and True
Positive rate T P r of this classiﬁcator.
II. I NTRODUCTION
The problem is to ﬁnd optimal k in Nearest Neighbor
classiﬁcator (next time will be written as NN) for given dataset.
The algorithm can be brieﬂy summarized as follows: In
the training phase, it computes the similarity measures from
all rows in the training set and combines them in a global
similarity measure using the XValidation method. In the testing
phase, for a rows with “unknown“ classes, it chooses their k
nearest neighbors in the training set according to the trained Fig. 2. Graph showing normalized values of attributes; x-axis: attributes,
similarity measure and then uses a customized voting scheme y-axis: values of attributes (0.0 to 1.0)
to generate a list of predictions with conﬁdence scores .
Dataset is in *.arff format and each row has 55 attributes.
Attribute called ”class“ has 4 nominal values (0,1,2,3) and it It’s sure the dataset has to be preprocessed before starting
represents the classiﬁed new-born’s sleeping phases. I didn’t experiments. First operator Normalization is used (Showed on
ﬁnd anywhere what does it mean exactly, but from my Fig.5, page 3) which normalizes all numerical values to range
observations I expect it means that from given attributes from 0.0 to 1.0. The optimization of extreme values won’t
(EEG c1 alpha,...) can be computed what kind of sleeping be done because in next part of preprocessing will be chosen
these values of attributes represent.  just 70 of all rows (2942 rows) and extreme values will be
The given dataset is preprocessed a little bit. There aren’t “eliminated“. For choosing this subset method of Stratiﬁed
any rows with zero attributes and all attributes are numerical Sampling is used and as label attribute named ”class“ is set
values. In Fig.1 are shown all attributes of dataset and their attributed. From 2942 chosen rows 2210 are labeled as class
values. These values are not normalized so the range of 0 and 732 as class 1 (Tab.I). Class 0 is merged from original
attribute’s values is from −5 to 543. The normalized dataset classes 1,2 and 3. Class 1 is renamed from original class 0.
is on Fig.2, where all values are in the range from 0.0 to 1.0. Normalized datasubset is shown on Fig.3
Each class (0, 1, 2, 3) has different color. The dataset is too big After attributes normalization the phase of training the
to process it at once, because it sets up of 42027 rows each model begins. As shown in Fig.5 (right side) the normalized
with 55 regular attributes. subset is divided into 2 parts. 1 of this subset goes to training
phase and 5 are used for testing.
III. E XPERIMENTS In training phase operator Parameter Iteration is used for
The chosen positive class of original data is class 0 (In iterating the k of NN. K is iterated from 1 to 15 incresing by
normalized subset renamed to class 1). The other classes +1.
(1,2,3) are set as negative classes. To avoid overﬁtting of NN method called K-fold cross
SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 2
Attr. name Statistics Range
class label 0.0 (2210), 1.0 (732)
PNG 0.427 +/- 0.111 [0.000 ; 0.919]
PNG ﬁltered 0.359 +/- 0.166 [0.000 ; 1.000]
EMG std 0.114 +/- 0.090 [0.033 ; 0.766]
EMG std ﬁltered 0.126 +/- 0.138 [0.004 ; 0.874]
ECG beat 0.427 +/- 0.135 [0.212 ; 0.993]
ECG beat ﬁltered 0.444 +/- 0.138 [0.225 ; 0.987]
EEG fp1 delta 0.216 +/- 0.065 [0.081 ; 0.964]
EEG fp2 delta 0.218 +/- 0.067 [0.071 ; 0.958]
EEG t3 delta 0.202 +/- 0.074 [0.064 ; 0.906]
Fig. 3. Graph showing normalized datasubset with positive class = 1; x-axis:
EEG t4 delta 0.232 +/- 0.089 [0.062 ; 0.956] attributes, y-axis: values of attributes (0.0 to 1.0)
EEG c3 delta 0.243 +/- 0.072 [0.091 ; 0.961]
EEG c4 delta 0.244 +/- 0.070 [0.089 ; 0.968]
EEG o1 delta 0.212 +/- 0.077 [0.066 ; 0.958]
EEG o2 delta 0.211 +/- 0.083 [0.046 ; 0.933] validation (CV) is used. For each iteration of k CV is run
EEG fp1 theta 0.188 +/- 0.072 [0.068 ; 0.976] 10 times. CV divides training set 10 times into 2 parts. CV
EEG fp2 theta 0.216 +/- 0.075 [0.090 ; 0.972] trains kNN on the ﬁrst part and validates kNN with the second
EEG t3 theta 0.222 +/- 0.065 [0.077 ; 0.970] part. After 10 iterations of K-fold cross validation the average
EEG t4 theta 0.264 +/- 0.079 [0.082 ; 0.938] accuracy of kNN for these 10 CV is computed. After k is
EEG c3 theta 0.308 +/- 0.061 [0.101 ; 0.962] iterated from 1 to 15, k with the highest average accuracy
EEG c4 theta 0.299 +/- 0.060 [0.098 ; 0.960] is selected and will be used in testing phase. Graph with the
EEG o1 theta 0.219 +/- 0.067 [0.080 ; 0.922]
average accuracies for each k NN is on Fig.4.
EEG o2 theta 0.271 +/- 0.079 [0.080 ; 0.931]
EEG fp1 alpha 0.112 +/- 0.077 [0.043 ; 0.981]
EEG fp2 alpha 0.124 +/- 0.081 [0.046 ; 0.956]
EEG t3 alpha 0.158 +/- 0.080 [0.055 ; 0.946]
EEG t4 alpha 0.181 +/- 0.082 [0.055 ; 0.928]
EEG c3 alpha 0.249 +/- 0.070 [0.088 ; 0.943]
EEG c4 alpha 0.246 +/- 0.069 [0.085 ; 0.957]
EEG o1 alpha 0.116 +/- 0.066 [0.039 ; 0.910]
EEG o2 alpha 0.151 +/- 0.066 [0.048 ; 0.935]
EEG fp1 beta1 0.114 +/- 0.079 [0.043 ; 0.985]
EEG fp2 beta1 0.123 +/- 0.083 [0.046 ; 0.943]
EEG t3 beta1 0.152 +/- 0.084 [0.045 ; 0.957]
EEG t4 beta1 0.168 +/- 0.087 [0.053 ; 0.930]
EEG c3 beta1 0.234 +/- 0.077 [0.092 ; 0.942]
EEG c4 beta1 0.226 +/- 0.074 [0.079 ; 0.949] Fig. 4. Average accuracy for kNN; x-axis: k NN; y-axis: accuracy;
EEG o1 beta1 0.091 +/- 0.070 [0.028 ; 0.916]
EEG o2 beta1 0.129 +/- 0.070 [0.041 ; 0.970]
EEG fp1 beta2 0.217 +/- 0.081 [0.086 ; 0.990]
EEG fp2 beta2 0.211 +/- 0.076 [0.083 ; 0.958] IV. M ETHODOLOGY
EEG t3 beta2 0.189 +/- 0.070 [0.063 ; 0.927]
A. Used tool
EEG t4 beta2 0.226 +/- 0.083 [0.065 ; 0.922]
EEG c3 beta2 0.248 +/- 0.066 [0.092 ; 0.960] Tool used is called RapidMiner (v4.0) . Using Rapid-
EEG c4 beta2 0.246 +/- 0.065 [0.090 ; 0.966] Miner allows user to make all phases of DataMining in this
EEG o1 beta2 0.230 +/- 0.085 [0.076 ; 0.958] tool. It detracts from familiarization with only one environ-
EEG o2 beta2 0.220 +/- 0.080 [0.055 ; 0.932] ment. All operators used in this work are accesible from the
EEG fp1 gama 0.154 +/- 0.073 [0.058 ; 0.976] basic version of RapidMiner.
EEG fp2 gama 0.172 +/- 0.076 [0.075 ; 0.956]
EEG t3 gama 0.196 +/- 0.069 [0.067 ; 0.958]
EEG t4 gama 0.227 +/- 0.078 [0.071 ; 0.897] B. Conﬁguration
EEG c3 gama 0.289 +/- 0.063 [0.097 ; 0.959] By combining many operators in RapidMiner the project is
EEG c4 gama 0.281 +/- 0.061 [0.095 ; 0.959] built. The complete tree view of operators used to get the best
EEG o1 gama 0.168 +/- 0.065 [0.062 ; 0.915]
k in Nearest Neighbor classiﬁcation is showed on Fig.5.
EEG o2 gama 0.237 +/- 0.077 [0.072 ; 0.912]
• All operators has local random seed set to -1. Just the
Root operator has value 2001, because then the random
S TATISTICS OF ATTRIBUTES OF NORMALIZED SUBSET
operations generates the same values. If an operator has
sampling type then it is set to stratiﬁed sampling..
• SplitChain operator has split ratio set to 0.2.
SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 3
• XValidation operator has number of validations set to xi and xj (j = 1, 2, ...n) is deﬁned as:
10 and measure set to Euclidean Distance.
• NearestNeighbor trying k operator has k set to 15 but this d(xi , xj ) = (xi1 − xj1 )2 + (xi2 − xj2 )2 + ... + (xin − xjn )2
parameter is inﬂuenced by Iterating k - training operator.
• ClassiﬁcationPerformance (1) operator has checked ac- The Algorithm NN can be built as: 
• NearestNeighbor deﬁned k operator has k set to 3 and • Training phase: Build the set of training examples T .
measure set to Euclidean Distance. • Testing phase:
• ClassiﬁcationPerformance (2) operator has checked ac- – Is given a query instance xq to be classiﬁed
curacy. – Let x1 ...xn denote the k instances from T that are
• BinominalClassiﬁcationPerformance operator has nearest to xq
• ProcessLog operator logs the accuracy from Classiﬁca-
F (xq ) = argmax δ(v, f (xi ))
The best k in Nearest Neighbor classiﬁcator is found by
iterating k from 1 to 15. Top value 15 is chosen as enough. In
each iteration the Operator “ClassiﬁcationPerformance” counts
accuracy of given k. The Operator ProcessLog writes results
of “ClassiﬁcationPerformance” and generates report (Fig.4).
From the report it stands for reason, that the best k is 3.
True 0 True 1
Pred 0 1609 225
Pred 1 159 361
NN CLASSIFICATION FOR k := 3; accuracy = 83.69%, F P r = 8.99%,
T P r = 61.60%
Positive class is selected as class with value = 1. For k = 3
the accuracy was 83, 69% as shown in Tab.II. False Positive
rate (F P r) of this classicator is 8.99%.
159/(159 + 1609) = 0.0899
There is also evidently, that True Positive rate (T P r) is
61.60%. Because there are 586 examples with class=1 and
just 361 of them were classiﬁed correctly.
361/(361 + 225) = 0.616
V. D ISCUSSION
False Positive rate seems to be very good. Maybe it seems
to be very low but it’s probably by the big subset of training
To discuss is, if the rate = 0.2 dividing datasubset to training
Fig. 5. “Box view” of complete project from DataMiner and testing part is set correctly or not. With faster computer
can be set oposite rate = 0.8. In my opinion 584 training
examples were enough and F P r declares the rate wasn’t
chosen so badly. On the other hand T P r is 61.60% which
is not much and can be easily higher/lower inﬂuencing F P r.
C. Experiments setup Next question can be if the algorithm shouldn’t count the
The Nearest Neighbor classiﬁcation uses Euclidean distance weighted kNN. I was trying to ﬁnd any dependencies between
to compute the kNN. In human words it can be translated as the 55 attributes but I wasn’t successful. So I don’t think
“Finds the closest point xi to xj ”. Euclidean distance between setting weights on attributes would be helpful.
SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 4
VI. C ONCLUSION
In my opinion I found very good classiﬁer for the subset
of dataset. There can be done some improvments of this
algorithm to work better. I think it would be useful if such
classiﬁer should be used in practices but just for school work
it’s not so important.
The hardest part of the work was exploring operators in
RapidMiner and ﬁnding the right one I needed. I know there
are still some which can be replaced by better operators, but
this solution was working and what more it gave good results.
Most of the time I spent on waiting until RapidMiner will
process all the operators with given dataset. Unfortunately
this programm is written in Java, what is not language for
scientiﬁc computing and I had to restart java because it got
out of memory quite often. The most interesting part for me
was generating graphs and writing this report ¨ .
I am very satisﬁed I ﬁnished the work and I can say that I
learnt a lot about datamining and about classiﬁng any dataset.
I am afraid anybody can feel from this work that my future
specialization will be Software Engineering and such scientiﬁc
work is not my cup of tea.
 CENTRAL QUEENSLAND UNIVERSITY. RapidMiner GUI man-
ual [online]. 2007 , May 29, 2007 [cit. 2008-02-08]. Available
from WWW: <http://os.cqu.edu.au/oswins/datamining/
 FARKASOVA, Blanka, KRCAL, Martin . Project Bibliographic citations
[online]. c2004-2008 [cit. 2008-05-08]. CZ. Available from WWW:
 LAURIKKALA, Jorma. Improving Identiﬁcation of Difﬁcult Small
Classes by Balancing Class Distribution . [s.l.], 2001. 14 p. DEPART-
MENT OF COMPUTER AND INFORMATION SCIENCES UNIVER-
SITY OF TAMPERE . Report. Available from WWW: <http://
www.cs.uta.fi/reports/pdf/A-2001-2.pdf>. ISBN 951-
 KARDI, Teknomo. K-Nearest Neighbors Tutorial [online]. c2006 [cit.
2008-05-08]. Available from WWW: <http://people.revoledu.
 POBLANO, Adrian and GUTIERREZ, Roberto. Correlation between
the neonatal EEG and the neurological examination in the ﬁrst year of
life in infants with bacterial meningitis. Arq. Neuro-Psiquiatr. [online].
2007, vol. 65, no. 3a [cited 2008-05-10], pp. 576-580. Available
iso>. ISSN 0004-282X. doi: 10.1590/S0004-282X2007000400005
 SOLOMATINE, D.P. Instance-based learning and k-
Nearest neighbor algorithm [online]. c1988-2003 [cit. 2008-
05-10]. EN. Available from WWW: <http://www.
 VAYATIS, Nicolas, CLEMENCON, Stphan. Advanced Machine Learning
Course [online].  [cit. 2008-05-08]. EN. Available from WWW:
 XIAOJIN, Zhu. K-nearest-neighbor: an introduction to machine learning.
CS 540: Introduction to Artiﬁcial Intelligence [online]. 2005 [cit. 2008-
05-08]. Available from WWW: <http://pages.cs.wisc.edu/
 van den BOSCH, Antal.Video: K-nearest neighbor classiﬁcation [online].
Tilburg University cc2007 [cit. 2008-05-10]. EN. Available from WWW: