2. Assignment
4
Exercise 1: Pruning
1. Which problem do we try to address when using pruning?
“Overfitting and lack of generalization beyond training data, i.e. models that describe the training data
(too) well, but do not model the principles and characteristics underlying the data.”
On schema level we state that pruning merges a part of the tree together into one node. The difference
is descripted within the two schema’s below:
2. Describe the purpose of separating the data into training, development, and test data.
“Training data is used to build the model, and test data to test it. Just the Training data by itself is not
able to measure to what extend the model will perform (i.e.. generalize to) on unseen data. Test data
measures this, but we should not use the test data to directly inform our model construction. For this
purpose a third set is used: the development data set, which behaves like the test set but the feedback
can be used to change the model”
We create our training set to increase the accuracy of the classifier, which we use on the data. The
more data we train the more accurate the resulting model will be.
The other two sets are used to evaluate the performance of the classifier we use. The development set
is used to evaluate the accuracy of different configurations of our classifier. It’s called the development
set because we continuously need to evaluate the classification performance.
In the end we’ve got a model, which has a great performance on the test data. To get estimates on how
good the new model will deal with new data we use the test data.
2
3. Assignment
4
Exercise 2: Information Gain and Attributes with Many Values.
Information gain is defined as:
Following to this definition, information gain
favors attributes with many values.
Why? Give an example.
We use a training set with (as shown in the table):
• N number of instances
• A number of attributes
A1 … Ak A* class
1 T … Black V1 C1
2 T … White V2 C2
.. .. … … … …
n F … Black Vn Cn
If we want to classify a certain attribute we can state that we have a 50/50 chance of having a ‘-‘ and a
‘+’ classification. So Attribute A* could be a plus or a minus. We note this as follows.
[1+, 0-]
SVi (A*) = {
[0+, 1-]
We can calculate the Entropy (uncertainty) of both outcomes of a plus or minus classification:
H(S+) = - (1/1 log2 1/1 + 0/1 log2 0/1) = 0
H(S-) = - (0/1 log2 0/1 + 1/1 log2 1/1) = 0
For calculating our information gain we perform the following formula:
Gain(S, A*) = H(S) – (sum |Sv(A*)| / |S| * H(Sv(A*) )
Gain(S, A*) = Entropy of H(S) – (gain of H(S+) + gain of H(S-))
Gain(S, A*) = Entropy of H(S) – (0 + 0)
Gain(S, A*) = Entropy of H(S)
We see that the Entropy of H(S+) and H(S-) is 0. So in the end we will have a high information gain because there’s
nothing to deduct.
3
4. Assignment
4
Exercise 3: Missing Attribute Values
Consider the following set of training instances.
Instance 2 has a missing value
for attribute a1.
Apply at least two different strategies for dealing
with missing attribute values
and show how they work in this concrete example.
Example 1 :
We can give a prediction on the true/false value for the missing attribute ‘a1’ by looking at the attributes
from a2. Within the a2 attribute there’s an equal chance of having a ‘true’ value and having a ‘false’
value (50 % chance). We could also state this for attribute a1. In conclusion: the missing question mark
could be a ‘false’ value if we use this way of thinking.
Example 2:
We can also focus on the class attribute. Within a2 we can state the following:
• There’s a 100 % chance of having a ‘+’ when having the ‘true’ attribute.
• There’s a 50 % chance of having a ‘+’ value when having the ‘false’.
With this way of thinking we should write down the ‘true’ value at the question mark
Example 3:
Now we only look at the attribute a1. We can give a precise prediction of the value what should replace
the question mark.:
P(true) = 2/3
P(false) = 1/3
4
5. Assignment
4
Exercise 4: Regression Trees
1. What are the stopping conditions for decision trees predicting discrete
classes?
1. All instances under a node have the same label.
2. All attributes have been used along a branch
3. There are no instances under a node
By labeling every input value we can state that only one of these outcomes is the correct one. We’ve
seen this with the weather example from the lecture. Because we predefine certain outcomes we also
define stopping conditions where it’s ‘Yes or No.
2. Why and how do the stopping conditions have to be changed for decision
trees that predict numerical values (e.g., regression trees)?
1. Measure the standard deviation of all instances under a node. If this value is below a pre-defined
value, we stop.
2. and
3. as before
In stead of defining a certain value like ‘yes’ or ‘no’ we define a certain range where the value can be
any point within that range. I.e. for temperature we define a particular degree in stead of hot and warm.
With this way of making our model we can still put several stopping conditions within our decision tree.
5