MIL DISCRETIZATION
ALGORITHM
(FOR DESIGN OF DATA DISCRETIZOR FOR CLASSIFICATION
PROBLEMS)



Project Guide –
Prof Bikash.K.Sarkar
Members -
Shashidhar Sundareisan (BE/1343/08) and Gourab Mitra
(BE/1232/08)
                                          3/22/2012    1
Introduction

 Discretization is concerned with the process of
  transferring continuous models and equations
  into discrete counterparts.
 This process is usually carried out as a first step
  toward making them suitable for numerical
  evaluation and implementation on digital
  computers.




                                        3/22/2012       2
Four scans in MIL

 Scan 1 : To calculate dmax and dmin
 Scan 2: To calculate CTS(Calculated
  Threshold) for n intervals between dmax and
  dmin. Width of the interval (dmax – dmin)/n
 Scan 3: To calculate optimal merged sub-
  intervals
 Scan 4: To discretize the attribute to one of
  the optimal merged sub-intervals

                                   3/22/2012      3
Example
Name                          CGPA                  Grade
Alice                         6.90                  Average
Bob                           7.90                  Good
Catherine                     8.00                  Excellent
Doug                          5.70                  Poor
Elena                         7.00                  Average
……                            …..                   ……

 •    CGPA is a continuous attribute
 •    s = 4 {‘Excellent’,’Good’,’Average’,’Poor’}
 •    c = 3 (constant value chosen by the user)
 •    n = c . s (no. of sub-intervals) = 12
 •    m = 88 (instances of training data)


                                                         3/22/2012   4
 Here, dmin = 5.7 and dmax = 8.0
 TS = m/n = 88/12 ≈ 7
 We divide the range into 12 sub-intervals
   Interval              CTS
   5.7 – 5.975           1
   5.975 – 6.25          2
   6.25 – 6.525          6
   6.525 – 6.8           20
   6.8 – 7.075           15
   …                     …

                                    3/22/2012   5
Example Frequency chart

           Frequency
40
35
30
25
20
15                                 Frequency
10
 5
0




                       3/22/2012               6
 In the first interval, Tot_CTS < TS/3 . So, we
  merge it with the next interval.
 Update Tot_CTS= Tot_CTS + CTS[1] (= 3 + 3)
 Update TS = TS + m/n (= 12 + 12)




                                     3/22/2012     7
Frequency
40
35
30
25
20
15                           Frequency
10
 5
0




                 3/22/2012               8
 Still, Tot_CTS < TS/3. So, we merge again.
 Update Tot_CTS= Tot_CTS + CTS[1] (= 6 + 8)
 Update TS = TS + m/n (= 24 + 12)




                                     3/22/2012   9
Frequency
40
35
30
25
20
15                           Frequency
10
 5
0




                 3/22/2012               10
 Now, Tot_CTS > TS/3 . So we don’t merge and
  create an optimal merged interval.
 Set Tot_CTS = 0
 TS = m/n
 And we move to the next CTS




                                  3/22/2012     11
16
14
12
10
8    Frequency

6
     Optimal Merged Sub-
4    interval
2
0




     3/22/2012             12
 In this case, Tot_CTS > TS / 3. So we don’t
  merge and create another optimal merged
  sub-interval




                                    3/22/2012   13
16
14
12
10
8    Frequency

6
     Optimal Merged Sub-
4    interval
2
0




     3/22/2012             14
20
18
16
14
12
                                                  Frequency
10
8                                                 Optimal Merged Sub-
6                                                 interval
4
2
0
     CTS[0]CTS[1]CTS[2]CTS[3]CTS[4]CTS[5]CTS[6]

                                                  3/22/2012             15
Final Frequency Chart

45
40
35
30
25                                                Frequency
20
                                                  Optimal Merged Sub-
15                                                interval
10
 5
0
     CTS[0]   CTS[1]   CTS[2]   CTS[3]   CTS[4]

                                                  3/22/2012             16
MIL Discretization Algorithm

 Characteristics :-
   Supervised
   Local
   Split-and-merge
 Features :-
   Time complexity Θ(n) ( other algorithms are
     Θ(n log n) )
   Requires only 4 scans of the training data


                                        3/22/2012   17
Scope for research

 Optimize value of C
 Optimize algorithm for TS/ k (in previous
  slides, we had k=3)
 Improving logic of the discretizer




                                       3/22/2012   18
Uniform Distribution ?

           Frequency
7
6
5
4
3
                                   Frequency
2
1
0




                       3/22/2012               19
Information loss ?

           Frequency
16
14
12
10
8
6                                  Frequency
4
2
0




                       3/22/2012               20
Optimize the value of C

 Datasets used : Iris, Haberman, transfusion
  and vertebrae column
 Testing for c = 1 to 25 in each of these data
  sets
 Using Weka to compare classfication
  accuracy with undiscretized data (J48
  classifier)



                                     3/22/2012    21
DEMO…




        3/22/2012   22
Iris Datasets




                3/22/2012   23
Haberman Data




                3/22/2012   24
Transfusion Data




                   3/22/2012   25
Vertebral Column Data




                    3/22/2012   26
Conclusion

 Value of % accuracy of classification stabilizes
  after a certain value of C and remains
  constant
 A steep decrease in % accuracy of
  classification is noted in certain cases when
  we deviate from continuous to discretized
  data by the algorithm. This warrants further
  probe into the algorithm.


                                    3/22/2012        27
References

 UCI Machine Learning Repository – the
  source for all datasets used in the project
 MIL: a data discretisation approach by
  Bikash Kanti Sarkar, Shib Sankar
  Sana, Kripasindhu Chaudhuri - International
  Journal of Data Mining, Modelling and
  Management 2011 - Vol. 3, No.3 pp. 303 –
  318
 Oracle.com Online Javadoc
                                 3/22/2012      28
THANK YOU

       3/22/2012   29

Minimum Information Loss Algorithm

  • 1.
    MIL DISCRETIZATION ALGORITHM (FOR DESIGNOF DATA DISCRETIZOR FOR CLASSIFICATION PROBLEMS) Project Guide – Prof Bikash.K.Sarkar Members - Shashidhar Sundareisan (BE/1343/08) and Gourab Mitra (BE/1232/08) 3/22/2012 1
  • 2.
    Introduction  Discretization isconcerned with the process of transferring continuous models and equations into discrete counterparts.  This process is usually carried out as a first step toward making them suitable for numerical evaluation and implementation on digital computers. 3/22/2012 2
  • 3.
    Four scans inMIL  Scan 1 : To calculate dmax and dmin  Scan 2: To calculate CTS(Calculated Threshold) for n intervals between dmax and dmin. Width of the interval (dmax – dmin)/n  Scan 3: To calculate optimal merged sub- intervals  Scan 4: To discretize the attribute to one of the optimal merged sub-intervals 3/22/2012 3
  • 4.
    Example Name CGPA Grade Alice 6.90 Average Bob 7.90 Good Catherine 8.00 Excellent Doug 5.70 Poor Elena 7.00 Average …… ….. …… • CGPA is a continuous attribute • s = 4 {‘Excellent’,’Good’,’Average’,’Poor’} • c = 3 (constant value chosen by the user) • n = c . s (no. of sub-intervals) = 12 • m = 88 (instances of training data) 3/22/2012 4
  • 5.
     Here, dmin= 5.7 and dmax = 8.0  TS = m/n = 88/12 ≈ 7  We divide the range into 12 sub-intervals Interval CTS 5.7 – 5.975 1 5.975 – 6.25 2 6.25 – 6.525 6 6.525 – 6.8 20 6.8 – 7.075 15 … … 3/22/2012 5
  • 6.
    Example Frequency chart Frequency 40 35 30 25 20 15 Frequency 10 5 0 3/22/2012 6
  • 7.
     In thefirst interval, Tot_CTS < TS/3 . So, we merge it with the next interval.  Update Tot_CTS= Tot_CTS + CTS[1] (= 3 + 3)  Update TS = TS + m/n (= 12 + 12) 3/22/2012 7
  • 8.
    Frequency 40 35 30 25 20 15 Frequency 10 5 0 3/22/2012 8
  • 9.
     Still, Tot_CTS< TS/3. So, we merge again.  Update Tot_CTS= Tot_CTS + CTS[1] (= 6 + 8)  Update TS = TS + m/n (= 24 + 12) 3/22/2012 9
  • 10.
    Frequency 40 35 30 25 20 15 Frequency 10 5 0 3/22/2012 10
  • 11.
     Now, Tot_CTS> TS/3 . So we don’t merge and create an optimal merged interval.  Set Tot_CTS = 0  TS = m/n  And we move to the next CTS 3/22/2012 11
  • 12.
    16 14 12 10 8 Frequency 6 Optimal Merged Sub- 4 interval 2 0 3/22/2012 12
  • 13.
     In thiscase, Tot_CTS > TS / 3. So we don’t merge and create another optimal merged sub-interval 3/22/2012 13
  • 14.
    16 14 12 10 8 Frequency 6 Optimal Merged Sub- 4 interval 2 0 3/22/2012 14
  • 15.
    20 18 16 14 12 Frequency 10 8 Optimal Merged Sub- 6 interval 4 2 0 CTS[0]CTS[1]CTS[2]CTS[3]CTS[4]CTS[5]CTS[6] 3/22/2012 15
  • 16.
    Final Frequency Chart 45 40 35 30 25 Frequency 20 Optimal Merged Sub- 15 interval 10 5 0 CTS[0] CTS[1] CTS[2] CTS[3] CTS[4] 3/22/2012 16
  • 17.
    MIL Discretization Algorithm Characteristics :-  Supervised  Local  Split-and-merge  Features :-  Time complexity Θ(n) ( other algorithms are Θ(n log n) )  Requires only 4 scans of the training data 3/22/2012 17
  • 18.
    Scope for research Optimize value of C  Optimize algorithm for TS/ k (in previous slides, we had k=3)  Improving logic of the discretizer 3/22/2012 18
  • 19.
    Uniform Distribution ? Frequency 7 6 5 4 3 Frequency 2 1 0 3/22/2012 19
  • 20.
    Information loss ? Frequency 16 14 12 10 8 6 Frequency 4 2 0 3/22/2012 20
  • 21.
    Optimize the valueof C  Datasets used : Iris, Haberman, transfusion and vertebrae column  Testing for c = 1 to 25 in each of these data sets  Using Weka to compare classfication accuracy with undiscretized data (J48 classifier) 3/22/2012 21
  • 22.
    DEMO… 3/22/2012 22
  • 23.
    Iris Datasets 3/22/2012 23
  • 24.
    Haberman Data 3/22/2012 24
  • 25.
    Transfusion Data 3/22/2012 25
  • 26.
  • 27.
    Conclusion  Value of% accuracy of classification stabilizes after a certain value of C and remains constant  A steep decrease in % accuracy of classification is noted in certain cases when we deviate from continuous to discretized data by the algorithm. This warrants further probe into the algorithm. 3/22/2012 27
  • 28.
    References  UCI MachineLearning Repository – the source for all datasets used in the project  MIL: a data discretisation approach by Bikash Kanti Sarkar, Shib Sankar Sana, Kripasindhu Chaudhuri - International Journal of Data Mining, Modelling and Management 2011 - Vol. 3, No.3 pp. 303 – 318  Oracle.com Online Javadoc 3/22/2012 28
  • 29.
    THANK YOU 3/22/2012 29

Editor's Notes

  • #2 Most of the real-world problems involve continuous attributes, each of whichpossesses more values. conversion of input data sets with continuous attributes into data setswith discrete attributes is necessary to reduce the range of values. This is the goal of datadiscretization.
  • #3 Our aim is to design a optimal discretizer based on our project guide’s MIL (Minimum information Loss) algorithm. Many machine learning algorithms cannot handle continuous attributes, whereas each of them canoperate on discretized attributes. Even if an algorithm can handle continuous attributes itsperformance can be significantly improved by replacing continuous attributes with itsdiscretized values. The other advantages in operating discretized attributes are the need ofless memory space and processing time in comparison to their non-discretized form. Inaddition, much larger rules are produced, while processing continuous attributes
  • #5 Unify all fonts and sizes
  • #22 Haberman : study that was conducted between 1958 and 1970 at the University of Chicago&apos;s Billings Hospital on the survival of patients who had undergone surgery for breastcancer. Iris: different classes of Iris flowers.Transfusion: Data taken from the Blood Transfusion Service Center in Hsin-Chu City in Taiwan Vertebrae: characteristics of vertebral column