Machine Learning with
DECISION TREES
by
19BQ1A0125
Agenda
● What are Decision trees?
Appropriate problems for DTrees
Information gain , Entropy
Issues with DTrees
Overfitting
●
●
●
●
What are Decision trees?
● A decision tree is a tree in which each branch
node represents a choice between a number
of alternatives, and each leaf node represents
a decision.
A type of supervised learning algorithm.
●
Example
Appropriate problems for DTrees
● Instances are represented by attribute-value pairs
● The target function has discrete output values
● The training data may contain errors
● The training data may have missing attribute values
Which node can be described easily?
➔ Less impure node requires less information to describe it.
More impure node requires more information.
➔
===> Information theory is a measure to
define this degree of disorganization in a
system known as Entropy.
So, we can conclude
Entropy - measuring
homogeneity of a learning set
● Entropy is a measure of the uncertainty about a source of
messages.
Given a collection S, containing positive and negative
examples of some target concept, the entropy of S relative to
this classification.
●
where, pi is the proportion of S belonging to class i
● Entropy is 0 if all the members
of S belong to the same class.
● Entropy is 1 when the collection
contains an equal no. of +ve
and -ve examples.
● Entropy is between 0 and 1 if
the collection contains unequal
no. of +ve and -ve examples.
Information gain
● Decides which attribute goes into a decision node.
To minimize the decision tree depth, the attribute with the
most entropy reduction is the best choice!
●
ISSUES
● How deeply to grow the decision tree?
Handling continuous attributes
Choosing an appropriate attribute selection
measure
Handling training data with missing attribute
values
●
●
●
Overfitting in Decision Trees
● If a decision tree is fully grown, it may lose
some generalization capability.
This is a phenomenon known as overfitting.
●
Overfitting
A hypothesis overfits the training examples if
some other hypothesis that fits the training
examples less well actually performs better
over the entire distribution of instances (i.e,
including instances beyond the training
examples)
.
Overfitting is an error that occurs in data modeling
as a result of a particular function aligning too
closely to a minimal set of data points.
Financial professionals are at risk of overfitting a
model based on limited data and ending up with
results that are flawed.
When a model has been compromised by
overfitting, the model may lose its value as a
predictive tool for investing.
A data model can also be underfitted, meaning it is
too simple, with too few data points to be effective.
Causes of Overfitting
● Overfitting Due to Presence of Noise -
Mislabeled instances may contradict the class
labels of other similar records.
Overfitting Due to Lack of Representative
Instances - Lack of representative instances
in the training data can prevent refinement of
the learning algorithm.
●
Overfitting Due to Noise: An
Example
Overfitting Due to Lack of
Samples
Overfitting Due to Lack of
Samples
● Although the model’s training error
is zero, its error rate on the test set
is 30%.
● Humans, elephants, and dolphins
are misclassified because the
decision tree classifies all
warmblooded vertebrates that do not
hibernate as non-mammals.
A good model must not only fit the
training data well
but also accurately classify records
it has never seen.
Thank You

MLF-2.pptx

  • 1.
    Machine Learning with DECISIONTREES by 19BQ1A0125
  • 2.
    Agenda ● What areDecision trees? Appropriate problems for DTrees Information gain , Entropy Issues with DTrees Overfitting ● ● ● ●
  • 3.
    What are Decisiontrees? ● A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision. A type of supervised learning algorithm. ●
  • 4.
  • 5.
    Appropriate problems forDTrees ● Instances are represented by attribute-value pairs ● The target function has discrete output values ● The training data may contain errors ● The training data may have missing attribute values
  • 6.
    Which node canbe described easily?
  • 7.
    ➔ Less impurenode requires less information to describe it. More impure node requires more information. ➔ ===> Information theory is a measure to define this degree of disorganization in a system known as Entropy. So, we can conclude
  • 8.
    Entropy - measuring homogeneityof a learning set ● Entropy is a measure of the uncertainty about a source of messages. Given a collection S, containing positive and negative examples of some target concept, the entropy of S relative to this classification. ● where, pi is the proportion of S belonging to class i
  • 9.
    ● Entropy is0 if all the members of S belong to the same class. ● Entropy is 1 when the collection contains an equal no. of +ve and -ve examples. ● Entropy is between 0 and 1 if the collection contains unequal no. of +ve and -ve examples.
  • 10.
    Information gain ● Decideswhich attribute goes into a decision node. To minimize the decision tree depth, the attribute with the most entropy reduction is the best choice! ●
  • 11.
    ISSUES ● How deeplyto grow the decision tree? Handling continuous attributes Choosing an appropriate attribute selection measure Handling training data with missing attribute values ● ● ●
  • 12.
    Overfitting in DecisionTrees ● If a decision tree is fully grown, it may lose some generalization capability. This is a phenomenon known as overfitting. ●
  • 13.
    Overfitting A hypothesis overfitsthe training examples if some other hypothesis that fits the training examples less well actually performs better over the entire distribution of instances (i.e, including instances beyond the training examples)
  • 14.
    . Overfitting is anerror that occurs in data modeling as a result of a particular function aligning too closely to a minimal set of data points. Financial professionals are at risk of overfitting a model based on limited data and ending up with results that are flawed. When a model has been compromised by overfitting, the model may lose its value as a predictive tool for investing. A data model can also be underfitted, meaning it is too simple, with too few data points to be effective.
  • 15.
    Causes of Overfitting ●Overfitting Due to Presence of Noise - Mislabeled instances may contradict the class labels of other similar records. Overfitting Due to Lack of Representative Instances - Lack of representative instances in the training data can prevent refinement of the learning algorithm. ●
  • 16.
    Overfitting Due toNoise: An Example
  • 17.
    Overfitting Due toLack of Samples
  • 18.
    Overfitting Due toLack of Samples ● Although the model’s training error is zero, its error rate on the test set is 30%. ● Humans, elephants, and dolphins are misclassified because the decision tree classifies all warmblooded vertebrates that do not hibernate as non-mammals.
  • 19.
    A good modelmust not only fit the training data well but also accurately classify records it has never seen.
  • 20.