SlideShare a Scribd company logo
1 of 40
Decision Tree
Splitting Indices, Splitting Criteria,
Decision tree construction algorithm
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
4
Constructing decision trees
 Strategy: top down
Recursive divide-and-conquer fashion
 First: select attribute for root node
Create branch for each possible attribute value
 Then: split instances into subsets
One for each branch extending from the node
 Finally: repeat recursively for each branch, using
only instances that reach the branch
 Stop if all instances have the same class
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Play or not?
• The weather
dataset
5
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
6
Which attribute to select?
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Best Split
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
1. Evaluation of splits for each attribute and
the selection of the best split,
determination of splitting attribute,
2. Determination of splitting condition on
the selected splitting attribute
3. Partitioning the data using best split.
Splitting Indices
 Determining the goodness of a split
1. Information Gain
(From Information theory, entropy)
2. Gini Index
(From economics, measure of diversity )
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Computing purity: the information
measure
• information is a measure of a
reduction of uncertainty
• It represents the expected amount of
information that would be needed to
“place” a new instance in the branch.
7
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Which attribute to select?
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Final decision tree
 Splitting stops when data can’t be split any further
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Criterion for attribute selection
 Which is the best attribute?
 Want to get the smallest tree
 Heuristic: choose the attribute that produces the
“purest” nodes
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
-- Information gain: increases with the average purity of the
subsets
-- Strategy: choose attribute that gives greatest information
gain
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
How to compute Informaton
Gain: Entropy
1. When the number of either yes OR no is zero (that is
the node is pure) the information is zero.
2. When the number of yes and no is equal, the
information reaches its maximum because we are very
uncertain about the outcome.
3. Complex scenarios: the measure should be
applicable to a multiclass situation, where a multi-
staged decision must be made.
12
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Entropy: Formulas
 Formulas for computing entropy:
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Entropy: Outlook, sunny
 Formulae for computing the entropy:
= (((-2) / 5) log2(2 / 5)) + (((-3) / 5) x log2(3 / 5)) = 0.97095059445
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Measures: Information &
Entropy
• entropy is a probabilistic measure of
uncertainty or ignorance and
information is a measure of a reduction
of uncertainty
• However, in our context we use entropy (ie
the quantity of uncertainty) to measure the
purity of a node.
1
8
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Example: Outlook
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Computing Information Gain
 Information gain: information before splitting –
information after splitting
gain(Outlook ) = info([9,5]) –info([2,3],[4,0],[3,2])
= 0.940 – 0.693
= 0.247 bits
 Information gain for attributes from weather data:
gain(Outlook )
gain(Temperature )
gain(Humidity )
gain(Windy )
= 0.247 bits
= 0.029 bits
= 0.152 bits
= 0.048 bits
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Information Gain Drawbacks
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 Problematic: attributes with a large number
of values (extreme case: ID code)
Weather data with ID code
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
ID code Outlook Temp. Humidity Windy Play
A Sunny Hot High False No
B Sunny Hot High True No
C Overcast Hot High False Yes
D Rainy Mild High False Yes
E Rainy Cool Normal False Yes
F Rainy Cool Normal True No
G Overcast Cool Normal True Yes
H Sunny Mild High False No
I Sunny Cool Normal False Yes
J Rainy Mild Normal False Yes
K Sunny Mild Normal True Yes
L Overcast Mild High True Yes
M Overcast Hot Normal False Yes
N Rainy Mild High True No
Tree stump for ID code attribute
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 Entropy of split (see Weka book 2011: 105-108):
 Information gain is maximal for ID code (namely 0.940
bits)
Information Gain
Limitations
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 Problematic: attributes with a large number
of values (extreme case: ID code)
 Subsets are more likely to be pure if there is
a large number of values
 Information gain is biased towards choosing
attributes with a large number of values
 This may result in overfitting (selection of an
attribute that is non-optimal for prediction)
 (Another problem: fragmentation)
Gain ratio
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 Gain ratio: a modification of the information gain
that reduces its bias
 Gain ratio takes number and size of branches into
account when choosing an attribute
 It corrects the information gain by taking the intrinsic
information of a split into account
 Intrinsic information: information about the class is
disregarded.
Gain ratios for weather
data
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Outlook Temperature
Info: 0.693 Info: 0.911
Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029
Split info: info([5,4,5]) 1.577 Split info: info([4,6,4]) 1.557
Gain ratio: 0.247/1.577 0.157 Gain ratio: 0.029/1.557 0.019
Humidity Windy
Info: 0.788 Info: 0.892
Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048
Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985
Gain ratio: 0.152/1 0.152 Gain ratio: 0.048/0.985 0.049
More on the gain ratio
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 “Outlook” still comes out top
 However: “ID code” has greater gain ratio
 Standard fix: ad hoc test to prevent splitting on that
type of attribute
 Problem with gain ratio: it may overcompensate
 May choose an attribute just because its intrinsic
information is very low
 Standard fix: only consider attributes with greater
than average information gain
Gini index
 All attributes are assumed continuous-
valued
 Assume there exist several possible split
values for each attribute
 May need other tools, such as
clustering, to get the possible split
values
 Can be modified for categorical attributes
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Splitting Criteria
 Let attribute A be a numerical-valued attribute Must determine
the best split point for A (BINARY Split)
 Sort the values of A in increasing order
 Typically, the midpoint between each pair of adjacent values is
considered as a possible split point (ai+ai+1)/2 is the midpoint
between the values of ai and ai+1
 The point with the minimum expected information requirement
for A is selected as the split point
Split
 D1 is the set of tuples in D satisfying A ≤ split-point
 D2 is the set of tuples in D satisfying A > split-point
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Binary Split
 Numerical Values Attributes
 Examine each possible split point. The midpoint between each pair
of (sorted) adjacent values is taken as a possible split-point
 For each split-point, compute the weighted sum of the impurity of
each of the two resulting partitions (D1: A<=split-point, D2: A> split-
point)
 The point that gives the minimum Gini index for attribute A is
selected as its split-point
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Class Histogram
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Two class histograms are used to store the class
distribution for numerical attributes.
Binary Split
 Categorical Attributes
 Examine the partitions resulting from all possible subsets of
{a1…,av}
 Each subset SA is a binary test of attribute A of the form
“A∈SA?”
 2^v possible subsets. We exclude the power set and the
empty set, then we have 2^v-2 subsets
 The subset that gives the minimum Gini index for attribute
A is selected as its splitting subset
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Count Matrix
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
The count matrix stores the class distribution of
each value of a categorical attribute.
Decision tree construction algorithm
1. Information Gain
 • ID3
 • C4.5
 • C 5
 • J 48
2. Gini Index
 • SPRINT
 • SLIQ
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Iterative Dichotomizer (ID3)
 Quinlan (1986)
 Each node corresponds to a splitting attribute
 Each arc is a possible value of that attribute.
 At each node the splitting attribute is selected to be the most
informative among the attributes not yet considered in the path from
the root.
 Entropy is used to measure how informative is a node.
 The algorithm uses the criterion of information gain to determine the
goodness of a split.
 The attribute with the greatest information gain is taken as
the splitting attribute, and the data set is split for all distinct
values of the attribute.
34
C 4.5
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
CART
 A Classification and Regression Tree(CART) is a
predictive algorithm used in machine learning.
 It explains how a target variable's values can be
predicted based on other values.
 It is a decision tree where each fork is a split in a
predictor variable and each node at the end has a
prediction for the target variable.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Decision Tree Induction Methods
 SLIQ (1996 — Mehta et al.)
Builds an index for each attribute and only class list and the current
attribute list reside in memory
 SPRINT (1996 — J. Shafer et al.)
Constructs an attribute list data structure.
Both the algorithm:
Pre-sort and use attribute-list
Recursively construct the decision tree
Use gini Index
Re-write the dataset – Expensive!
 CLOUDS: Approximate version of SPRINT.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 PUBLIC (1998 — Rastogi & Shim)
Integrates tree splitting and tree pruning: stop growing the
tree earlier
 RainForest (1998 — Gehrke, Ramakrishnan & Ganti)
Builds an AVC-list (attribute, value, class label)
 BOAT (1999 — Gehrke, Ganti, Ramakrishnan & Loh)
Uses bootstrapping to create several small samples
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Random Forest
 Random Forest is an example of ensemble learning, in which
we combine multiple machine learning algorithms to obtain
better predictive performance.
Two key concepts that give it the name random:
 A random sampling of training data set when building trees.
 Random subsets of features considered when splitting nodes.
A technique known as bagging is used to create an ensemble of
trees where multiple training sets are generated with
replacement.
In the bagging technique, a data set is divided into N samples
using randomized sampling. Then, using a single learning
algorithm a model is built on all samples. Later, the resultant
predictions are combined using voting or averaging in parallel.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
The
End
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
40

More Related Content

Similar to unit 5 decision tree2.pptx

Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...
Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...
Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...butest
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptRvishnupriya2
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptRvishnupriya2
 
Classification (ML).ppt
Classification (ML).pptClassification (ML).ppt
Classification (ML).pptrajasamal1999
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)eSAT Journals
 
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.IJERD Editor
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
IRJET- A Data Mining with Big Data Disease Prediction
IRJET-  	  A Data Mining with Big Data Disease PredictionIRJET-  	  A Data Mining with Big Data Disease Prediction
IRJET- A Data Mining with Big Data Disease PredictionIRJET Journal
 
Decision trees
Decision treesDecision trees
Decision treesNcib Lotfi
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
decison tree and rules in data mining techniques
decison tree and rules in data mining techniquesdecison tree and rules in data mining techniques
decison tree and rules in data mining techniquesALIZAIB KHAN
 
Big Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxBig Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxPlacementsBCA
 
Decision tree lecture 3
Decision tree lecture 3Decision tree lecture 3
Decision tree lecture 3Laila Fatehy
 

Similar to unit 5 decision tree2.pptx (20)

Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...
Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...
Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
 
NETWORJS3.pdf
NETWORJS3.pdfNETWORJS3.pdf
NETWORJS3.pdf
 
Classification (ML).ppt
Classification (ML).pptClassification (ML).ppt
Classification (ML).ppt
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)
 
PPID3 AICCSA08
PPID3 AICCSA08PPID3 AICCSA08
PPID3 AICCSA08
 
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
Unit 3classification
Unit 3classificationUnit 3classification
Unit 3classification
 
IRJET- A Data Mining with Big Data Disease Prediction
IRJET-  	  A Data Mining with Big Data Disease PredictionIRJET-  	  A Data Mining with Big Data Disease Prediction
IRJET- A Data Mining with Big Data Disease Prediction
 
Decision trees
Decision treesDecision trees
Decision trees
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Decision tree learning
Decision tree learningDecision tree learning
Decision tree learning
 
decison tree and rules in data mining techniques
decison tree and rules in data mining techniquesdecison tree and rules in data mining techniques
decison tree and rules in data mining techniques
 
Big Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxBig Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptx
 
Decision tree lecture 3
Decision tree lecture 3Decision tree lecture 3
Decision tree lecture 3
 
Decision tree
Decision treeDecision tree
Decision tree
 

Recently uploaded

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

unit 5 decision tree2.pptx

  • 1. Decision Tree Splitting Indices, Splitting Criteria, Decision tree construction algorithm Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 2. 4 Constructing decision trees  Strategy: top down Recursive divide-and-conquer fashion  First: select attribute for root node Create branch for each possible attribute value  Then: split instances into subsets One for each branch extending from the node  Finally: repeat recursively for each branch, using only instances that reach the branch  Stop if all instances have the same class Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 3. Play or not? • The weather dataset 5 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 4. 6 Which attribute to select? Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 5. Best Split Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU 1. Evaluation of splits for each attribute and the selection of the best split, determination of splitting attribute, 2. Determination of splitting condition on the selected splitting attribute 3. Partitioning the data using best split.
  • 6. Splitting Indices  Determining the goodness of a split 1. Information Gain (From Information theory, entropy) 2. Gini Index (From economics, measure of diversity ) Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 7. Computing purity: the information measure • information is a measure of a reduction of uncertainty • It represents the expected amount of information that would be needed to “place” a new instance in the branch. 7 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 8. Which attribute to select? Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 9. Final decision tree  Splitting stops when data can’t be split any further Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 10. Criterion for attribute selection  Which is the best attribute?  Want to get the smallest tree  Heuristic: choose the attribute that produces the “purest” nodes Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 11. -- Information gain: increases with the average purity of the subsets -- Strategy: choose attribute that gives greatest information gain Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 12. How to compute Informaton Gain: Entropy 1. When the number of either yes OR no is zero (that is the node is pure) the information is zero. 2. When the number of yes and no is equal, the information reaches its maximum because we are very uncertain about the outcome. 3. Complex scenarios: the measure should be applicable to a multiclass situation, where a multi- staged decision must be made. 12 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 13. Entropy: Formulas  Formulas for computing entropy: Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 14. Entropy: Outlook, sunny  Formulae for computing the entropy: = (((-2) / 5) log2(2 / 5)) + (((-3) / 5) x log2(3 / 5)) = 0.97095059445 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 15. Measures: Information & Entropy • entropy is a probabilistic measure of uncertainty or ignorance and information is a measure of a reduction of uncertainty • However, in our context we use entropy (ie the quantity of uncertainty) to measure the purity of a node. 1 8 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 16. Example: Outlook Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 17. Computing Information Gain  Information gain: information before splitting – information after splitting gain(Outlook ) = info([9,5]) –info([2,3],[4,0],[3,2]) = 0.940 – 0.693 = 0.247 bits  Information gain for attributes from weather data: gain(Outlook ) gain(Temperature ) gain(Humidity ) gain(Windy ) = 0.247 bits = 0.029 bits = 0.152 bits = 0.048 bits Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 18. Information Gain Drawbacks Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  Problematic: attributes with a large number of values (extreme case: ID code)
  • 19. Weather data with ID code Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU ID code Outlook Temp. Humidity Windy Play A Sunny Hot High False No B Sunny Hot High True No C Overcast Hot High False Yes D Rainy Mild High False Yes E Rainy Cool Normal False Yes F Rainy Cool Normal True No G Overcast Cool Normal True Yes H Sunny Mild High False No I Sunny Cool Normal False Yes J Rainy Mild Normal False Yes K Sunny Mild Normal True Yes L Overcast Mild High True Yes M Overcast Hot Normal False Yes N Rainy Mild High True No
  • 20. Tree stump for ID code attribute Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  Entropy of split (see Weka book 2011: 105-108):  Information gain is maximal for ID code (namely 0.940 bits)
  • 21. Information Gain Limitations Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  Problematic: attributes with a large number of values (extreme case: ID code)  Subsets are more likely to be pure if there is a large number of values  Information gain is biased towards choosing attributes with a large number of values  This may result in overfitting (selection of an attribute that is non-optimal for prediction)  (Another problem: fragmentation)
  • 22. Gain ratio Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  Gain ratio: a modification of the information gain that reduces its bias  Gain ratio takes number and size of branches into account when choosing an attribute  It corrects the information gain by taking the intrinsic information of a split into account  Intrinsic information: information about the class is disregarded.
  • 23. Gain ratios for weather data Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU Outlook Temperature Info: 0.693 Info: 0.911 Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029 Split info: info([5,4,5]) 1.577 Split info: info([4,6,4]) 1.557 Gain ratio: 0.247/1.577 0.157 Gain ratio: 0.029/1.557 0.019 Humidity Windy Info: 0.788 Info: 0.892 Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048 Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985 Gain ratio: 0.152/1 0.152 Gain ratio: 0.048/0.985 0.049
  • 24. More on the gain ratio Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  “Outlook” still comes out top  However: “ID code” has greater gain ratio  Standard fix: ad hoc test to prevent splitting on that type of attribute  Problem with gain ratio: it may overcompensate  May choose an attribute just because its intrinsic information is very low  Standard fix: only consider attributes with greater than average information gain
  • 25. Gini index  All attributes are assumed continuous- valued  Assume there exist several possible split values for each attribute  May need other tools, such as clustering, to get the possible split values  Can be modified for categorical attributes Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 26. Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 27. Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 28. Splitting Criteria  Let attribute A be a numerical-valued attribute Must determine the best split point for A (BINARY Split)  Sort the values of A in increasing order  Typically, the midpoint between each pair of adjacent values is considered as a possible split point (ai+ai+1)/2 is the midpoint between the values of ai and ai+1  The point with the minimum expected information requirement for A is selected as the split point Split  D1 is the set of tuples in D satisfying A ≤ split-point  D2 is the set of tuples in D satisfying A > split-point Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 29. Binary Split  Numerical Values Attributes  Examine each possible split point. The midpoint between each pair of (sorted) adjacent values is taken as a possible split-point  For each split-point, compute the weighted sum of the impurity of each of the two resulting partitions (D1: A<=split-point, D2: A> split- point)  The point that gives the minimum Gini index for attribute A is selected as its split-point Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 30. Class Histogram Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU Two class histograms are used to store the class distribution for numerical attributes.
  • 31. Binary Split  Categorical Attributes  Examine the partitions resulting from all possible subsets of {a1…,av}  Each subset SA is a binary test of attribute A of the form “A∈SA?”  2^v possible subsets. We exclude the power set and the empty set, then we have 2^v-2 subsets  The subset that gives the minimum Gini index for attribute A is selected as its splitting subset Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 32. Count Matrix Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU The count matrix stores the class distribution of each value of a categorical attribute.
  • 33. Decision tree construction algorithm 1. Information Gain  • ID3  • C4.5  • C 5  • J 48 2. Gini Index  • SPRINT  • SLIQ Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 34. Iterative Dichotomizer (ID3)  Quinlan (1986)  Each node corresponds to a splitting attribute  Each arc is a possible value of that attribute.  At each node the splitting attribute is selected to be the most informative among the attributes not yet considered in the path from the root.  Entropy is used to measure how informative is a node.  The algorithm uses the criterion of information gain to determine the goodness of a split.  The attribute with the greatest information gain is taken as the splitting attribute, and the data set is split for all distinct values of the attribute. 34
  • 35. C 4.5 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 36. CART  A Classification and Regression Tree(CART) is a predictive algorithm used in machine learning.  It explains how a target variable's values can be predicted based on other values.  It is a decision tree where each fork is a split in a predictor variable and each node at the end has a prediction for the target variable. Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 37. Decision Tree Induction Methods  SLIQ (1996 — Mehta et al.) Builds an index for each attribute and only class list and the current attribute list reside in memory  SPRINT (1996 — J. Shafer et al.) Constructs an attribute list data structure. Both the algorithm: Pre-sort and use attribute-list Recursively construct the decision tree Use gini Index Re-write the dataset – Expensive!  CLOUDS: Approximate version of SPRINT. Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 38.  PUBLIC (1998 — Rastogi & Shim) Integrates tree splitting and tree pruning: stop growing the tree earlier  RainForest (1998 — Gehrke, Ramakrishnan & Ganti) Builds an AVC-list (attribute, value, class label)  BOAT (1999 — Gehrke, Ganti, Ramakrishnan & Loh) Uses bootstrapping to create several small samples Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 39. Random Forest  Random Forest is an example of ensemble learning, in which we combine multiple machine learning algorithms to obtain better predictive performance. Two key concepts that give it the name random:  A random sampling of training data set when building trees.  Random subsets of features considered when splitting nodes. A technique known as bagging is used to create an ensemble of trees where multiple training sets are generated with replacement. In the bagging technique, a data set is divided into N samples using randomized sampling. Then, using a single learning algorithm a model is built on all samples. Later, the resultant predictions are combined using voting or averaging in parallel. Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 40. The End Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU 40