This is our 4th year class presentation of Data Mining Course. The presentation is on a paper titled same with the presentation title published on "Algorithms" on November 2017. This paper has mainly Improved the traditional decision tree based ID3 algorithm and showed that it performs better on large dataset
Improvement of id3 algorithm based on simplified information entropy and coordination degree
1. Improvement of ID3 Algorithm Based on
Simplified Information Entropy and
Coordination Degree
Md.Ahasanul Alam(10)
Mustafizur Rahman(22)
2. About The Paper
Authors:
Yingying Wang , Yibin Li , Yong Song , Xuewen Rong and Shuaishuai Zhang
Published at:
Algorithms. A monthly peer-reviewed journal published by MDPI.
Date: November 2017
2
3. Iterative Dichotomiser 3 (ID3)
● A traditional decision tree classification algorithm
● Use of information gain as an attribute selection method
● Entropy:
○ The expected information needed to classify a tuple in D
● Information Gain:
3
4. Limitations of ID3
● Logarithmic expression requires more calculation time
● ID3 tends to choose multi-valued attributes first
● No control over the size of the decision tree
4
5. Improvement of ID3
● Simplifying Information Entropy
○ Replace Logarithm with 4 arithmetics (+, -, *, /)
○ Utilize Taylor series expansion technique
● Removing Multi-value Bias problem
○ Weights are introduced into each attribution
○ Each weight equals the reciprocal of the length of different values
● Minimizing Uncontrollable Tree Size
○ Pruning step in runtime
○ Utilize the dependency of label attribute on condition attribute
5
6. Simplifying Information Entropy (Removing Log term)
● Let assume a database D has
○ Positive examples- p, negative examples - n
● In attribute ai , V different values, each value contains pi-positive example
and ni-negative example
6
…………. (3)
…………. (4)
13. Removing Multi-value Bias problem
13
Gain(D,number) = 0.5
Gain(D,color) = 2.65
Gain(D,Body Shape) = 0.5
Gain(D,Hair Type) = 0.15
Fig: Decision tree removing multi bias problem
14. Minimizing Uncontrollable Tree Size
● The dependency of label attribute d on an attribute att is defined as the percentage of
tuples whose att attribute value is same and their label attribute value is also same. This
is also known as Coordination Degree
● An example : CON (A->D) = 60%, CON (B->D) = 40%
14
A B D
a1 b1 yes
a1 b2 yes
a1 b2 yes
a2 b1 yes
a2 b2 no
15. Minimizing Uncontrollable Tree Size
● Pruning Step:
If CON ( Cparent-> D) >= CON (Cchild-> D)
then replace the child node with
a majority class label
● Example data table
15
16. Minimizing Uncontrollable Tree Size
16
Fig: Decision Tree reduced by ID3 algorithm Fig: Decision Tree reduced by improved algorithm