Minimum Information Loss Algorithm

MIL DISCRETIZATION
ALGORITHM
(FOR DESIGN OF DATA DISCRETIZOR FOR CLASSIFICATION
PROBLEMS)

Project Guide –
Prof Bikash.K.Sarkar
Members -
Shashidhar Sundareisan (BE/1343/08) and Gourab Mitra
(BE/1232/08)
3/22/2012 1

Introduction

 Discretization is concerned with the process of
transferring continuous models and equations
into discrete counterparts.
 This process is usually carried out as a first step
toward making them suitable for numerical
evaluation and implementation on digital
computers.

3/22/2012 2

Four scans in MIL

 Scan 1 : To calculate dmax and dmin
 Scan 2: To calculate CTS(Calculated
Threshold) for n intervals between dmax and
dmin. Width of the interval (dmax – dmin)/n
 Scan 3: To calculate optimal merged sub-
intervals
 Scan 4: To discretize the attribute to one of
the optimal merged sub-intervals

3/22/2012 3

Example
Name CGPA Grade
Alice 6.90 Average
Bob 7.90 Good
Catherine 8.00 Excellent
Doug 5.70 Poor
Elena 7.00 Average
…… ….. ……

• CGPA is a continuous attribute
• s = 4 {‘Excellent’,’Good’,’Average’,’Poor’}
• c = 3 (constant value chosen by the user)
• n = c . s (no. of sub-intervals) = 12
• m = 88 (instances of training data)

3/22/2012 4

 Here, dmin = 5.7 and dmax = 8.0
 TS = m/n = 88/12 ≈ 7
 We divide the range into 12 sub-intervals
Interval CTS
5.7 – 5.975 1
5.975 – 6.25 2
6.25 – 6.525 6
6.525 – 6.8 20
6.8 – 7.075 15
… …

3/22/2012 5

Example Frequency chart

Frequency
40
35
30
25
20
15 Frequency
10
5
0

3/22/2012 6

 In the first interval, Tot_CTS < TS/3 . So, we
merge it with the next interval.
 Update Tot_CTS= Tot_CTS + CTS[1] (= 3 + 3)
 Update TS = TS + m/n (= 12 + 12)

3/22/2012 7

Frequency
40
35
30
25
20
15 Frequency
10
5
0

3/22/2012 8

 Still, Tot_CTS < TS/3. So, we merge again.
 Update Tot_CTS= Tot_CTS + CTS[1] (= 6 + 8)
 Update TS = TS + m/n (= 24 + 12)

3/22/2012 9

Frequency
40
35
30
25
20
15 Frequency
10
5
0

3/22/2012 10

 Now, Tot_CTS > TS/3 . So we don’t merge and
create an optimal merged interval.
 Set Tot_CTS = 0
 TS = m/n
 And we move to the next CTS

3/22/2012 11

16
14
12
10
8 Frequency

6
Optimal Merged Sub-
4 interval
2
0

3/22/2012 12

 In this case, Tot_CTS > TS / 3. So we don’t
merge and create another optimal merged
sub-interval

3/22/2012 13

16
14
12
10
8 Frequency

6
Optimal Merged Sub-
4 interval
2
0

3/22/2012 14

20
18
16
14
12
Frequency
10
8 Optimal Merged Sub-
6 interval
4
2
0
CTS[0]CTS[1]CTS[2]CTS[3]CTS[4]CTS[5]CTS[6]

3/22/2012 15

Final Frequency Chart

45
40
35
30
25 Frequency
20
Optimal Merged Sub-
15 interval
10
5
0
CTS[0] CTS[1] CTS[2] CTS[3] CTS[4]

3/22/2012 16

MIL Discretization Algorithm

 Characteristics :-
 Supervised
 Local
 Split-and-merge
 Features :-
 Time complexity Θ(n) ( other algorithms are
Θ(n log n) )
 Requires only 4 scans of the training data

3/22/2012 17

Scope for research

 Optimize value of C
 Optimize algorithm for TS/ k (in previous
slides, we had k=3)
 Improving logic of the discretizer

3/22/2012 18

Uniform Distribution ?

Frequency
7
6
5
4
3
Frequency
2
1
0

3/22/2012 19

Information loss ?

Frequency
16
14
12
10
8
6 Frequency
4
2
0

3/22/2012 20

Optimize the value of C

 Datasets used : Iris, Haberman, transfusion
and vertebrae column
 Testing for c = 1 to 25 in each of these data
sets
 Using Weka to compare classfication
accuracy with undiscretized data (J48
classifier)

3/22/2012 21

DEMO…

3/22/2012 22

Iris Datasets

3/22/2012 23

Haberman Data

3/22/2012 24

Transfusion Data

3/22/2012 25

Vertebral Column Data

3/22/2012 26

Conclusion

 Value of % accuracy of classification stabilizes
after a certain value of C and remains
constant
 A steep decrease in % accuracy of
classification is noted in certain cases when
we deviate from continuous to discretized
data by the algorithm. This warrants further
probe into the algorithm.

3/22/2012 27

References

 UCI Machine Learning Repository – the
source for all datasets used in the project
 MIL: a data discretisation approach by
Bikash Kanti Sarkar, Shib Sankar
Sana, Kripasindhu Chaudhuri - International
Journal of Data Mining, Modelling and
Management 2011 - Vol. 3, No.3 pp. 303 –
318
 Oracle.com Online Javadoc
3/22/2012 28

THANK YOU

3/22/2012 29

Minimum Information Loss Algorithm

More Related Content

Similar to Minimum Information Loss Algorithm

Recently uploaded

Minimum Information Loss Algorithm

Editor's Notes