Multivalued Subsets Under Information Theory

Multivalued Subsets under Information Theory
THESIS
Me
Father
Father's Father
Father's
Paternal
Grandfather
Father's
Paternal
Grandmother
Father's
Mother
Father's
Maternal
Grandfather
Father's
Maternal
Grandmother
Mother
Mother's
Father
Mother's
Paternal
Grandfather
Mother's
Paternal
Grandmother
Mother's
Mother
Mother's
Maternal
Grandfather
Mother's
Maternal
Grandmother
Indraneel Dabhade

Key Data Mining Tools
Regression
Decision Tree
Neural Network Clustering
Association Rules

Information Gain
( , ) ( ) ( , )
i i
Gain A S Ent S E A S
 
2
0 ( , ) log
i
Gain A S K
 
Classes
Class1
Class2
Class2
Class3
Class3
Class1
Class1
Class2
A
A1
A1
A2
A3
A4
A5
A5
A5
Classes
Class1
Class2
Class2
Class3
Class3
Class1
Class1
Class2
Ent(S) ( , )
i
E A S
2
1
log
n
i i
i
p p


Entropy =

Terminology Attribute 1
Attribute 4
Attribute 2
Class 1
Class 2
Class 2 Class 1
(A1)
(A2)
(A3)
(B1)
(B2)
(C1) (C2)
Instances Attribute 1 Attribute 2 Attribute 3 Attribute 4 Classes
1 A1 B1 D1 C1 Class2
Class 1
Attribute-values
3 4 2 2
Number of Unique Values
Attribute Information
Gain
1 0.246
2 0.029
3 0.151
4 0.048
A1,
A2,
A3
B1,
B2,
B3,B
4
D1,
D2
C1,
C2
Unique Attribute-Values

Classifiers
Adaptation from the slides of Michael Collins, Discriminative Reranking for Natural Language Parsing, ICML 2000
Given: m examples (x1, y1), …, (xm, ym) where xiÎX, yiÎY={-1, +1}
Initialize D1(i) = 1/m
For t = 1 to T
]
)
(
[
Pr ~ i
i
t
D
i
t y
x
h
t



1. Train learner ht with min error







 

t
t
t



1
ln
2
1
2. Compute the hypothesis weight









i
i
t
i
i
t
t
t
t
y
x
h
e
y
x
h
e
Z
i
D
i
D
t
t
)
(
if
)
(
if
)
(
)
(
1 

3. For each example i = 1 to m






 

T
t
t
t x
h
x
H
1
)
(
sign
)
( 
Output
Adaptive Boosting (Basic)

Classifiers
Adaptive Boosting (Basic)
Adaptation from the slides of Freund and Shapire (1996)
e1 = 0.300
a1=0.424
e2 = 0.196
a2=0.704
= + +
e3 = 0.344
a2=0.323
Weak Classifiers
Strong Classifiers
•Need to extend the 2- class to multi-class learning
•Usage of AdaBoost.M1

Classifiers Classification via Regression
Instances Att1 Att2 Att3 Att4 Classes
Instan
ces
Att1 Att2 Att3 Att4 Classe
s
1 A1 B1 D1 C1 0
2 A1 B1 D1 C2 0
3 A2 B1 D1 C1 1
4 A3 B2 D1 C1 0
5 A3 B3 D2 C1 1
6 A3 B3 D2 C2 0
7 A2 B3 D2 C1 1
Instan
ces
Att1 Att2 Att3 Att4 Classe
s
1 A1 B1 D1 C1 1
2 A1 B1 D1 C2 1
3 A2 B1 D1 C1 0
4 A3 B2 D1 C1 1
5 A3 B3 D2 C1 0
6 A3 B3 D2 C2 1
7 A2 B3 D2 C1 0
Class 1 Class 2…
Test query ( 1 2, 2 1, 3 2, 4 1) ?
f ATT A ATT B ATT D ATT C
    
( 1)
f Class ( 2)
f Class
( 1) 0.1
f Class  ( 2) 0.9
f Class 
Class(Test query) =Class 2

Classifiers
Attribute 1
Attribute 4
Attribute 2
Class 1 Class 2 Class 2
(A1) (A2) (A3)
(B1)
(B2) (C1) (C2)
Class 1
Class 1
Iterative Dichotomizer 3 Decision Tree

Information Gains
Instance
s
Att1
1 A1
2 A1
3 A2
4 A3
5 A4
6 A3
7 A5
8 A1
9 A7
10 A3
11 A9
12 A10
13 A1
14 A6
15 A8
16 A9
17 A10
18 A2
19 A3
20 A6
21 A8
22 A2
23 A1
24 A4
Att1
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
Att1
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
Att1
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
ID3 GID MVS
Subset 1
Subset 2
Rest

Information Gain evaluation for ID3
Insta
nces
Att1 Att n
1 A1 …. N1
2 A1 …. N2
3 A2 …. N5
4 A3 …. N4
5 A3 …. N6
6 A3 …. N7
7 A2 …. N8
8 A1 …. N4
9 A1 …. N5
10 A3 …. N3
11 A1 …. N2
12 A2 …. N5
13 A2 …. N6
14 A3 …. N8
Att1 Class1 Class2 …. Class
n
A1 4 3 …. 4
A2 5 7 …. 5
A3 7 6 …. 0
Class Quanta Identity
…
Att n Class1 Class2 …. Class
n
N1 1 0 …. 5
N2 3 2 …. 9
N3 4 6 …. 3
Att Information Gain
1 0.877
2 0.511
3 1.45
4 1.44
Considering the ‘Iris’ Dataset

Information Gain evaluation for GID
Att1 Class
1
Class
2
…. Clas
s n
A1 4 3 …. 4
Rest 12 13 …. 5
…
Att n Clas
s1
Class2 …. Class
n
N1 1 0 …. 5
Rest 7 8 …. 12
1 0.06886155
2 0.06583857
3 0.162349
4 0.3645836
Att1 Class
1
Class
2
…. Clas
s n
A m 4 3 …. 4
Rest 12 13 …. 5
…
Att n Clas
s1
Class2 …. Class
n
NM 1 0 …. 5
Rest 7 8 …. 12
…
…
Insta
nces
Att1 Att n
1 A1 …. N1
2 A1 …. N2
3 A2 …. N5
4 A3 …. N4
5 A3 …. N6
6 A3 …. N7
7 A2 …. N8
8 A1 …. N4
9 A1 …. N5
10 A3 …. N3
11 A1 …. N2
12 A2 …. N5
13 A2 …. N6
14 A3 …. N8

NP Hard
Att1
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
Subset 1
Subset 2
What makes the problem interesting ? Why is it NP-Hard ?
Att Class
A1 1
A1 1
A2 1
A3 2
A3 2
A3 3
A4 4
Att Class
1 1
1 1
0 1
0 2
0 2
0 3
0 4
Att Class
1 1
1 1
1 1
0 2
0 2
0 3
0 4
0.4695652
Information Gain
0.9852281
Information Gain
GID
MVS

Information Gain evaluation for MVS
Att1 Class
1
Class
2
…. Clas
s n
Subset
1
4 3 …. 4
Subset
2
12 13 …. 5
…
Att n Cla
ss
1
Class2 …. Class
n
Subset
1
1 0 …. 5
Subset
2
7 8 …. 12
1 0.128627
2 0.120512
3 0.345634
4 0.618695
Att1 Class
1
Class
2
…. Clas
s n
Subset
1
4 3 …. 4
Subset
2
12 13 …. 5
…
Att n Clas
s1
Class2 …. Class
n
Subset
1
1 0 …. 5
Subset
2
7 8 …. 12
…
…
Insta
nces
Att1 Att n
1 A1 …. N1
2 A1 …. N2
3 A2 …. N5
4 A3 …. N4
5 A3 …. N6
6 A3 …. N7
7 A2 …. N8
8 A1 …. N4
9 A1 …. N5
10 A3 …. N3
11 A1 …. N2
12 A2 …. N5
13 A2 …. N6
14 A3 …. N8

Testing Conditions
Dataset Instances Attributes Unique
Values
Data Type Missing
Values
Iris 150 4 22-43 Fractional No
Glass 214 9 32-178 Fractional No
Images 4435 36 49-104 Integer Missing
Class ‘6’
PenDigits* 7494* 16 96-101 Integer No
Vehicles 846 18 13-424 Integer No
Datasets
Palmetto High Performance Computing
Wall time runs of 50 hours (in parallel)
Use of ‘Mersenne Twister’ pseudorandom number generator
* Reduced to 4350
Classifier Time to
compute
Nature Rule Generation
AdaBoost Low Deterministic
/Probabilistic
Function of Sample
Size and weighted
predictions
ID3 Low Deterministic Robust Rule
Regression High Deterministic Robust Rule
Naïve Bayesian Low Probabilistic Robust Rule
Classification Algorithms

Multisubset variant using the Adaptive Simulated Annealing
generate initial solution
initialize
begin
initialize
while {
begin
initialize
while
begin
Binary-Rand( )
form CQI for the binary subsets
if solution< then change
if solution> = then change
evaluate Δ= -L_
if Δ>0 then L_ =
if Δ<0 then if then L_ =
then =
end
lower
end
end
Fl,Fh,Ebest,Econfig
To,Tend
To > Tend
Lb,I,Lt
Lt < (Lb + I)
nx1
Fl Fl
Fh Fh
Solcurr Solcurr
Solcurr Solcurr
eD/To
> Rand(1) Solcurr Solcurr
Ebest Solcurr
Lt = Lb + (Lb.(1- e)
-(Fh-Fl)
Fh
)
To
Instances Att1 Att2 Att3 Att4 Classe
s
Search
Space Att1
Att2
Att4
Att3
A1,A2
C2, C6
C7, C4
A3,A4,A5
B1,B2 B1,B2
D1,D2 D3,D4,
D5,D6
C3,C1,C5
Proportion of the classes
0.5 1
Max

Criterion for the Classifiers Information Gain Decision Trees
Att1
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
Att1
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
Att1
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
ID3 GID
Subset 1
Subset 2
Rest
MVS
? ?

What Next ?
Application of Multivalue Subsets
Feature Selection Discretization
Supervised, Un-
supervised, Semi-
supervised
Varying
Interval Sizes
Un-supervising the
Supervised

Feature Selection
What is Feature Selection ?
Selecting set of attributes that ‘increase the predictive performance and builds
more compact feature subsets’.
All features Filter
Feature
subset
Classifier
All features
Wrapper
Multiple
Feature
subsets
Classifier
Adaptation from the slides of ‘Introduction to Feature Selection’,
Isabelle Guyon.
Two of the commonly used Feature Selection techniques.

Feature Selection
• Selecting set of attributes that contribute highest to the user’s objective
• This research focuses on identifying a lower bound on the equivalent subset
size while ranking (ID3 based gain criterion vs. MVS based gain criterion )
A B C D E F G Class
Traditional Search Method Objective
Maximizing
Classification
Accuracy
Proposed Search Method
A B C D E F G Class
A2 B3 C2 D4 E32 F45 G56 1
A34 B56 C34 C76 E45 F78 G143 2
A45 B67 C45 C89 E67 F89 G210 2
A56 B109 C76 C76 E78 F121 G301 1
…
…
…
…
…
…
…
…
A134 B231 C453 D456 E99 F201 G567 21
Search
Space
Maximizing the
Information Gain
Search Space

Feature Selection
Feature Set Classifier Performance
{J, H, E, D, A, C, B, I, G, F} ID3 98%
{J, H, E, D, A, C, B, I, G} ID3 97%
{J, H, E, D, A, C, B, I} ID3 85%
{J, H, E, D, A, C, B} ID3 80%
{J, H, E, D, A, C} ID3 87%
{J, H, E, D, A} ID3 90%
{J, H, E, D} ID3 92%
{J, H, E} ID3 89%
{J, H} ID3 91%
{J} ID3 88%
Sequential Elimination based Selection Process
Rank the features as per
the objective values
Eliminate the lowest
ranked feature
Check for classifier
performance

Feature Selection Feature Selection for ‘Iris’ Dataset
0.00E+00
2.00E-01
4.00E-01
6.00E-01
8.00E-01
1.00E+00
1.20E+00
1.40E+00
1.60E+00
1 2 3 4
GID
MVS
ID3
Attribute
Information
Gain
Att ID3 GID MVS
1 0.877 0.06886155 0.128627
2 0.511 0.06583857 0.120512
3 1.45 0.162349 0.345634
4 1.44 0.3645836 0.618695

Feature Selection Feature Selection for ‘Vehicles’ Dataset
Information Gain for ID3 Attribute Information Gain for MVS Attribute
1.38E+00 12 0.252924 9
8.12E‐01 7 0.212451 6
7.88E‐01 11 0.144444 8
6.14E‐01 4 0.093892 3
6.13E‐01 8 0.083239 2
5.99E‐01 13 0.066396 14
5.78E‐01 3 0.058774 10
4.83E‐01 9 0.057838 17
3.66E‐01 10 0.051468 18
3.37E‐01 6 0.048565 7
3.26E‐01 1 0.045457 11
3.08E‐01 2 0.037533 1
2.77E‐01 14 0.033462 5
2.40E‐01 17 0.033196 4
2.31E‐01 18 0.03098 13
2.12E‐01 5 0.025649 12
1.82E‐01 16 0.018691 15
1.02E‐01 15 0.015805 16
0.00E+00
2.00E‐01
4.00E‐01
6.00E‐01
8.00E‐01
1.00E+00
1.20E+00
1.40E+00
1.60E+00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
GID
Mul value Subset
ID3
A ributes
Informa
on Gain
Information Gain ID3 Ranking MVS Ranking
Classifier Error Classifier Error
12,7,11,4,8,13,3,9,10,6,1,2,14,17,18,5,16,15 55.32% 9,6,8,3,2,14,10,17,18,7,11,1,5,4,13,12,15,16 55.32%
12,7,11,4,8,13,3,9,10,6,1,2,14,17,18,5,16 52.07% 9,6,8,3,2,14,10,17,18,7,11,1,5,4,13,12,15 55.67%
12,7,11,4,8,13,3,9,10,6,1,2,14,17,18,5 53.31% 9,6,8,3,2,14,10,17,18,7,11,1,5,4,13,12 53.42%
12,7,11,4,8,13,3,9,10,6,1,2,14,17,18 54.26% 9,6,8,3,2,14,10,17,18,7,11,1,5,4,13 54.13%
12,7,11,4,8,13,3,9,10,6,1,2,14,17 54.25% 9,6,8,3,2,14,10,17,18,7,11,1,5,4 53.90%
12,7,11,4,8,13,3,9,10,6,1,2,14 54.14% 9,6,8,3,2,14,10,17,18,7,11,1,5 53.90%
12,7,11,4,8,13,3,9,10,6,1,2 54.02% 9,6,8,3,2,14,10,17,18,7,11,1 53.90%
12,7,11,4,8,13,3,9,10,6,1 54.37% 9,6,8,3,2,14,10,17,18,7,11 54.72%
12,7,11,4,8,13,3,9,10,6 54.02% 9,6,8,3,2,14,10,17,18,7 54.72%
12,7,11,4,8,13,3,9,10 54.02% 9,6,8,3,2,14,10,17,18 55.56%
12,7,11,4,8,13,3,9 57.92% 9,6,8,3,2,14,10,17 55.55%
12,7,11,4,8,13,3 57.45% 9,6,8,3,2,14,10 55.67%
12,7,11,4,8,13 59.46% 9,6,8,3,2,14 58.86%
12,7,11,4,8 59.22% 9,6,8,3,2 58.75%
12,7,11,4 59.22% 9,6,8,3 59.57%
12,7,11 59.22% 9,6,8 61.35%
12,7 60.52% 9,6 62.65%
12 60.52% 9 62.64%

Implications of the work
1. The research was able to identify subsets that effectively provided better
information gain values
2. The Feature Selection process did manage to identify subsets that provided a
lower bound on the classification error
Contribution to the field of Industrial Engineering
Feature Selection.
When identifying subset of factor that would provide better classifier
performances, the proposed method can be used as an effectively subject to
additional testing.

Multivalued Subsets Under Information Theory

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Multivalued Subsets Under Information Theory

Similar to Multivalued Subsets Under Information Theory (20)

Recently uploaded

Recently uploaded (20)

Multivalued Subsets Under Information Theory