The document provides study notes on decision tree algorithms, specifically ID3. It explains that ID3 is suitable for categorical data and provides an example play ball dataset. It then describes the ID3 algorithm which calculates entropy and gain to choose the attribute with highest gain to split the data recursively until a decision can be made. Finally, it mentions the implementation of these concepts in C# including classes to calculate entropy and gain, a decision tree class, and outputting the rules.
2. Use and Sample Data
Decision Tree based on ID3 algorithm is suitable for processing data with
categorical Input (attributes) and Output (decision).
Example of such data is the Play Ball decision table (apparently everyone
learning ML uses this)
Outlook Temperature Humidity Wind Play ball
Sunny Hot High Weak No
Sunny Hot High Strong No
Overcast Hot High Weak Yes
Rain Mild High Weak Yes
Rain Cool Normal Weak Yes
Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes
Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes
Overcast Mild High Strong Yes
Overcast Hot Normal Weak Yes
Rain Mild High Strong No
3. Algorithm
Calculate Entropy of data set using formula
Entropy(S) = ∑ – p . 2log(p)
Where p is the probability of each possible decision outcome.
Initial entropy is calculated from the entire data set;
Subsequently entropy is calculated from the subset of data (after the split)
Then calculate Gain from each attribute using the formula
Gain(S, A) = Entropy(S) – ∑ [ p(S|A) . Entropy(S|A) ]
Where p(S|A) is the probability of category in an attribute.
4. Algorithm
After Gain for all attributes are calculated, choose an attribute that has the
highest Gain. This attribute will be the used as next decision branch.
For each category value in chosen attribute, apply filter to current Training
data set and repeat the calculation process. This is done recursively until
either:
Number of samples in processed data set is too small
OR
Decision can be made from the processed data set (only 1 possible decision
outcome)
5. Implementation (C#)
Most codes and examples I found are python codes. To simplify my learning
without having to learn python, I used C# to build the following components:
File loader and store into DataSet (I can then apply filter using simple SQL
like query or Lambda query)
Entropy calculator function ID3EntropyFactor() that takes in an attribute,
where each attribute has a flag to indicate if it has been processed.
Max Gain calculator function ID3MaxGain() that calls ID3EntropyFactor()
passing attributes that is not disabled.
A generic class DecisionTree() which I can use with different algorithm in
future (e.g. C4.5, CART, etc)