Entropy and
Information Gain
in Decision Tree
OMega TechEd
Entropy
Entropy is the machine learning metric that measures the
unpredictability or impurity in the system.
Entropy is the measurement of disorder or impurities in the information
processed in machine learning. It determines how a decision tree
chooses to split data.
2
High Entropy
Low Entropy
OMega TechEd
Entropy
A random variable with only one value, a coin that always comes up heads,
has no uncertainty and thus its entropy is defined as zero; thus, we gain no
information by observing its value.
Entropy always lies between 0 and 1, however depending on the number of
classes in the dataset, it can be greater than 1.
In general, the entropy of a random variable V with values vk, each with
probability P(vk), is defined as Entropy:
H(V ) = − ∑ k P(vk) log2 P(vk) .
Entropy of a fair coin flip
H(Fair ) = −(0.5 log2 0.5+0.5 log2 0.5) = 1 .
3
OMega TechEd
How to calculate Entropy?
H(V ) = − ∑ k P(vk) log2 P(vk) .
Example:
If we had a total 10 data points in our dataset with 3 belonging to positive
class and 7 belonging to negative class:
-3/10 * log2 (3/10) – 7/10 * log2 (7/10) ≈ 0.876
The Entropy is approximately 0.88 .
High entropy means low level of purity.
4
OMega TechEd
Entropy (Cont.)
Different cases
5
Entropy=0
Entropy=1 Entropy=0.88
If dataset contain equal no of positive and negative data points entropy is 1.
If dataset contain only positive or only negative data points entropy is 0.
OMega TechEd
Information Gain
Information gain is defined as the pattern observed in the dataset and
reduction in the entropy.
Mathematically, information gain can be expressed with the below formula:
Information Gain = (Entropy of parent node)-(Entropy of child node)
6
OMega TechEd
Decision tree using information gain
1. An attribute with the highest information gain from a set should be
selected as the parent (root) node.
2. Build child nodes for every value of attribute A.
3. Repeat iteratively until we finish constructing the whole tree.
7
OMega TechEd
Choosing the best attribute
We need a measure of “good” and “bad” for
attributes. One way to do is to compute the
information gain.
Example:
At the root node of the restaurant problem,
there are 6 True samples and 6 False
samples.
Entropy(Parent) =1
8
6 positive
6 negative
2 negative
4 positive 2 positive
4 negative
None
2
Some
4
Full
6
0 positive
0 negative
Patrons
OMega TechEd
Choosing the best attribute
E(Patrons=None) = 0
E(Patrons=Some) = 0
E(Patrons=Full) =
= -2/6 * log2 (2/6) – 4/6 * log2 (4/6)
= -1/3 * (-1.59) - 2/3 * (-0.59)
= 0.53+0.39 ≈ 0.92
Weighted average of entropy for each node
E(Patrons)=
2/12 * 0 + 4/12 * 0 + 6/12 * 0.92 = 0.46
E(Patrons) ≈ 0.46
9
6 positive
6 negative
2 negative
4 positive 2 positive
4 negative
None
2
Some
4
Full
6
0 positive
0 negative
Patrons
Choosing the best attribute
10
Information Gain = (Entropy of parent node)-(Entropy of child node)
IG= 1-0.46 ≈ 0.54
Gain(Patrons) ≈ 0.54
OMega TechEd
Choosing the best attribute
E(Type=French) = 1
E(Type=Italian) = 1
E(Type=Thai) = 1
E(Type=Burger) =1
Weighted average of entropy for each node
E(Type)=
[2/12 * 1 + 2/12 * 1 + 4/12 * 1 + 4/12 *1]
= 1
E(Type) ≈ 1
11
6 positive
6 negative
2 positive
2 negative
French
2
Italian
2
Thai
4
1 negative
1 positive
1 positive
1 negative
Burger
4
2 positive
2 negative
Type
OMega TechEd
Choosing the best attribute
12
Information Gain = (Entropy of parent node)-(Entropy of child node)
IG= 1 - 1 ≈ 0
Gain(Type) ≈ 0
Confirming that Patrons is a better attribute than Type. In fact, at the root Patrons gives
the highest information gain.
OMega TechEd
Thank you
Reference:
Artificial Intelligence: A Modern Approach, 3rd ed.
Stuart Russell and Peter Norvig OMega TechEd

Entropy and information gain in decision tree.

  • 1.
    Entropy and Information Gain inDecision Tree OMega TechEd
  • 2.
    Entropy Entropy is themachine learning metric that measures the unpredictability or impurity in the system. Entropy is the measurement of disorder or impurities in the information processed in machine learning. It determines how a decision tree chooses to split data. 2 High Entropy Low Entropy OMega TechEd
  • 3.
    Entropy A random variablewith only one value, a coin that always comes up heads, has no uncertainty and thus its entropy is defined as zero; thus, we gain no information by observing its value. Entropy always lies between 0 and 1, however depending on the number of classes in the dataset, it can be greater than 1. In general, the entropy of a random variable V with values vk, each with probability P(vk), is defined as Entropy: H(V ) = − ∑ k P(vk) log2 P(vk) . Entropy of a fair coin flip H(Fair ) = −(0.5 log2 0.5+0.5 log2 0.5) = 1 . 3 OMega TechEd
  • 4.
    How to calculateEntropy? H(V ) = − ∑ k P(vk) log2 P(vk) . Example: If we had a total 10 data points in our dataset with 3 belonging to positive class and 7 belonging to negative class: -3/10 * log2 (3/10) – 7/10 * log2 (7/10) ≈ 0.876 The Entropy is approximately 0.88 . High entropy means low level of purity. 4 OMega TechEd
  • 5.
    Entropy (Cont.) Different cases 5 Entropy=0 Entropy=1Entropy=0.88 If dataset contain equal no of positive and negative data points entropy is 1. If dataset contain only positive or only negative data points entropy is 0. OMega TechEd
  • 6.
    Information Gain Information gainis defined as the pattern observed in the dataset and reduction in the entropy. Mathematically, information gain can be expressed with the below formula: Information Gain = (Entropy of parent node)-(Entropy of child node) 6 OMega TechEd
  • 7.
    Decision tree usinginformation gain 1. An attribute with the highest information gain from a set should be selected as the parent (root) node. 2. Build child nodes for every value of attribute A. 3. Repeat iteratively until we finish constructing the whole tree. 7 OMega TechEd
  • 8.
    Choosing the bestattribute We need a measure of “good” and “bad” for attributes. One way to do is to compute the information gain. Example: At the root node of the restaurant problem, there are 6 True samples and 6 False samples. Entropy(Parent) =1 8 6 positive 6 negative 2 negative 4 positive 2 positive 4 negative None 2 Some 4 Full 6 0 positive 0 negative Patrons OMega TechEd
  • 9.
    Choosing the bestattribute E(Patrons=None) = 0 E(Patrons=Some) = 0 E(Patrons=Full) = = -2/6 * log2 (2/6) – 4/6 * log2 (4/6) = -1/3 * (-1.59) - 2/3 * (-0.59) = 0.53+0.39 ≈ 0.92 Weighted average of entropy for each node E(Patrons)= 2/12 * 0 + 4/12 * 0 + 6/12 * 0.92 = 0.46 E(Patrons) ≈ 0.46 9 6 positive 6 negative 2 negative 4 positive 2 positive 4 negative None 2 Some 4 Full 6 0 positive 0 negative Patrons
  • 10.
    Choosing the bestattribute 10 Information Gain = (Entropy of parent node)-(Entropy of child node) IG= 1-0.46 ≈ 0.54 Gain(Patrons) ≈ 0.54 OMega TechEd
  • 11.
    Choosing the bestattribute E(Type=French) = 1 E(Type=Italian) = 1 E(Type=Thai) = 1 E(Type=Burger) =1 Weighted average of entropy for each node E(Type)= [2/12 * 1 + 2/12 * 1 + 4/12 * 1 + 4/12 *1] = 1 E(Type) ≈ 1 11 6 positive 6 negative 2 positive 2 negative French 2 Italian 2 Thai 4 1 negative 1 positive 1 positive 1 negative Burger 4 2 positive 2 negative Type OMega TechEd
  • 12.
    Choosing the bestattribute 12 Information Gain = (Entropy of parent node)-(Entropy of child node) IG= 1 - 1 ≈ 0 Gain(Type) ≈ 0 Confirming that Patrons is a better attribute than Type. In fact, at the root Patrons gives the highest information gain. OMega TechEd
  • 13.
    Thank you Reference: Artificial Intelligence:A Modern Approach, 3rd ed. Stuart Russell and Peter Norvig OMega TechEd