Made by: -
Deopura karan 130410107014
Submitted to: -
Mitali sonar
Decision tree Induction
 Training dataset should be class-labelled for learning of decision.
 A decision-tree represent rules and it is very popular tool for classification
and prediction
 Rules are easy to understand and can be directly used in SQL to retrieve
records
 There are many algorithm to build decision tree:
o ID3(Iterative Dichotomiser 3)
o C4.5
o CART(Classification and Regression Tree)
o CHAID(Chi-squared Automatic Interaction Detector)
 Decision tree has tree type structure which has leaf nodes and decisions
node.
 A leaf node is the last node of each branch.
 A decision node is the node of tree which has leaf node or sub-tree.
Decision tree Representation
 Attribute for decision tree are selected by one of the following method:
1. Gini index(IBM IntelligentMiner)
2. Information Gain(ID3/C4.5)
3. Gain ratio
 Attribute are categories into two part:
1. Attribute whose domain is numerical are called numerical attribute
2. Attribute whose domain is non-numerical are called categorical attribute.
Attribute Selection
 It can be adapted for categorical attributes
 Uses in CART, SPRINT and IBM’s Intelligent miner System
 Formula for Gini index is
 For a valued attribute, the attribute providing the smallest gini is chosen to
split the node.
Gini index
 It can be adapted for continuous-valued attribute as well as categorical
data.
 Attribute which has highest information gain is selected for split.
 If Si contain pi examples of P and ni examples of N, the entropy to classify
object is
Information gain
 Expected amount of information needed to assign a class to a randomly
drawn object in S
 Calculate information gain i.e. gain(A) : Measure reduction in entropy
achieved because of split.
𝑮𝒂𝒊𝒏 𝑨 = 𝑰 𝒑, 𝒏 − 𝑬(𝑨)
Entropy
 Decision trees are able to generate understandable rules
 Perform classification without requiring much computation
 Handle categorical as well as continuous variable
 Provide clear induction of which fields are most important
Strength of decision tree
Weakness of decision tree
 Not suitable for prediction of continuous attribute
 Computationally expensive to train
 Two types
1. Prepruning
 Start pruning in the beginning while building the tree itself
 Stop the tree construction in early stage
 Avoid splitting node by checking the threshold
2. Postpruning
 Build the tree then start pruning
 Use different set of data than training dataset to get best pruned tree
Tree Pruning
A Training set
Age Car Type Risk
23 Family High
17 Sports High
43 Sports High
68 Family Low
32 Truck Low
20 Family High
Decision Tree
Age < 25
Car Type in {sports}
High
High Low
Decision tree

Decision tree

  • 1.
    Made by: - Deopurakaran 130410107014 Submitted to: - Mitali sonar
  • 3.
    Decision tree Induction Training dataset should be class-labelled for learning of decision.  A decision-tree represent rules and it is very popular tool for classification and prediction  Rules are easy to understand and can be directly used in SQL to retrieve records  There are many algorithm to build decision tree: o ID3(Iterative Dichotomiser 3) o C4.5 o CART(Classification and Regression Tree) o CHAID(Chi-squared Automatic Interaction Detector)
  • 4.
     Decision treehas tree type structure which has leaf nodes and decisions node.  A leaf node is the last node of each branch.  A decision node is the node of tree which has leaf node or sub-tree. Decision tree Representation
  • 5.
     Attribute fordecision tree are selected by one of the following method: 1. Gini index(IBM IntelligentMiner) 2. Information Gain(ID3/C4.5) 3. Gain ratio  Attribute are categories into two part: 1. Attribute whose domain is numerical are called numerical attribute 2. Attribute whose domain is non-numerical are called categorical attribute. Attribute Selection
  • 6.
     It canbe adapted for categorical attributes  Uses in CART, SPRINT and IBM’s Intelligent miner System  Formula for Gini index is  For a valued attribute, the attribute providing the smallest gini is chosen to split the node. Gini index
  • 7.
     It canbe adapted for continuous-valued attribute as well as categorical data.  Attribute which has highest information gain is selected for split.  If Si contain pi examples of P and ni examples of N, the entropy to classify object is Information gain
  • 8.
     Expected amountof information needed to assign a class to a randomly drawn object in S  Calculate information gain i.e. gain(A) : Measure reduction in entropy achieved because of split. 𝑮𝒂𝒊𝒏 𝑨 = 𝑰 𝒑, 𝒏 − 𝑬(𝑨) Entropy
  • 10.
     Decision treesare able to generate understandable rules  Perform classification without requiring much computation  Handle categorical as well as continuous variable  Provide clear induction of which fields are most important Strength of decision tree Weakness of decision tree  Not suitable for prediction of continuous attribute  Computationally expensive to train
  • 11.
     Two types 1.Prepruning  Start pruning in the beginning while building the tree itself  Stop the tree construction in early stage  Avoid splitting node by checking the threshold 2. Postpruning  Build the tree then start pruning  Use different set of data than training dataset to get best pruned tree Tree Pruning
  • 12.
    A Training set AgeCar Type Risk 23 Family High 17 Sports High 43 Sports High 68 Family Low 32 Truck Low 20 Family High
  • 13.
    Decision Tree Age <25 Car Type in {sports} High High Low