SlideShare a Scribd company logo
1 of 57
Decision Trees in the Big Picture
• Classification (vs. Rule Pattern Discovery)
• Supervised Learning (vs. Unsupervised)
• Inductive
• Generation (vs. Discrimination)
Example
age income veteran
college_
educated
support_
hillary
youth low no no no
youth low yes no no
middle_aged low no no yes
senior low no no yes
senior medium no yes no
senior medium yes no yes
middle_aged medium no yes no
youth low no yes no
youth low no yes no
senior high no yes yes
youth low no no no
middle_aged high no yes no
middle_aged medium yes yes yes
senior high no yes no
Example
age income veteran
college_
educated
support_
hillary
youth low no no no
youth low yes no no
middle_aged low no no yes
senior low no no yes
senior medium no yes no
senior medium yes no yes
middle_aged medium no yes no
youth low no yes no
youth low no yes no
senior high no yes yes
youth low no no no
middle_aged high no yes no
middle_aged medium yes yes yes
senior high no yes no
Class-
labels
Example
age income veteran
college_
educated
support_
hillary
middle_aged medium no no ?????
no
age
youth middle_aged
college_
educated
income yes
yes
low medium high
no
senior
no yes
noyes
Inner nodes are ATTRIBUTES
Branches are attribute VALUES
Leaves are class-label VALUES
Example
age income veteran
college_
educated
support_
hillary
middle_aged medium no no yes (predicted)
no
age
youth middle_aged
college_
educated
income yes
yes
low medium high
no
senior
no yes
noyes
Inner nodes are ATTRIBUTES
Branches are attribute VALUES
Leaves are class-label VALUES
ANSWER
Example
no
age
youth middle_aged
college_
educated
income yes
yes
low medium high
no
senior
no yes
noyes
Induced Rules:
The youth do not
support Hillary.
All who are middle-
aged and low-income
support Hillary.
Seniors support
Hillary.
Etc…A rule is
generated for each
leaf.
Example
Induced Rules:
The youth do not support
Hillary.
All who are middle-aged
and low-income support
Hillary.
Seniors support Hillary.
Nested IF-THEN:
IF age == youth
THEN support_hillary = no
ELSE IF age == middle_aged
& income == low
THEN support_hillary = yes
ELSE IF age = senior
THEN support_hillary = yes
How do you construct one?
1. Select an attribute to place at the root node
and make one branch for each possible value.
14 tuples; Entire Training Set
5 tuples 4 tuples 5 tuples
age
youth middle_aged senior
How do you construct one?
2. For each branch, recursively process the
remaining training examples by choosing an
attribute to split them on. The chosen
attribute cannot be one used in the ancestor
nodes. If at anytime all the training examples
have the same class, stop processing that
part of the tree.
How do you construct one?
age=youth Income veteran
college_
educated
support_
hillary
youth low no no no
youth low yes no no
youth low no yes no
youth low no yes no
youth low no no no
no
age
youth
middle_aged senior
How do you construct one?
age=
middle_aged income veteran
college_
educated
supports_
hillary
middle_aged low no no yes
middle_aged medium no yes no
middle_aged high no yes no
middle_aged medium yes yes yes
no veteran
age
youth
middle_aged
senior
yes no
no veteran
age
youth
middle_aged
senior
yes
yes no
age=
middle_aged income veteran
college_
educated
supports_
hillary
middle_aged low no no yes
middle_aged medium no yes no
middle_aged high no yes no
middle_aged medium yes yes yes
no veteran
age
youth
middle_aged
senior
yes
yes no
age=
middle_aged income veteran
college_
educated
supports_
hillary
middle_aged low no no yes
middle_aged medium no yes no
middle_aged high no yes no
middle_aged medium yes yes yes
college_
educated
yes no
age=
middle_aged income veteran=no
college_
educated
supports_
hillary
middle_aged low no no yes
middle_aged medium no yes no
middle_aged high no yes no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes no
senior
no
age=
middle_aged income veteran=no
college_
educated
supports_
hillary
middle_aged low no no yes
middle_aged medium no yes no
middle_aged high no yes no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
no veteran
ageyouth
middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
age=senior income veteran
college_
educated
supports_
hillary
senior low no no yes
senior medium no yes no
senior medium yes no yes
senior high no yes yes
senior high no yes no
no veteran
ageyouth
middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
age=senior income veteran
college_
educated
supports_
hillary
senior low no no yes
senior medium no yes no
senior medium yes no yes
senior high no yes yes
senior high no yes no
college_
educated
yes no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
college_
educated
yes no
age=senior income veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
senior high no yes yes
senior high no yes no
income
low
medium
high
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
college_
educated
yes no
age=senior income veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
senior high no yes yes
senior high no yes no
income
low
medium
high
No low-income college-
educated seniors…
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
age=senior income veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
senior high no yes yes
senior high no yes no
income
low
medium
high
No low-income college-
educated seniors…
no
no
“Majority Vote”
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
age=senior
income=
medium veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
income
low
medium
high
no
no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
age=senior
income=
medium veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
income
low
medium
high
no
no
no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low
medium
high
no
no
no
age=senior income=high veteran
college_
educated=yes
supports_
hillary
senior high no yes yes
senior high no yes no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low medium high
no
no
no
age=senior income=high veteran
college_
educated=yes
supports_
hillary
senior high no yes yes
senior high no yes no
veteran
yes no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low medium high
no
no
no
age=senior income=high veteran
college_
educated=yes
supports_
hillary
senior high no yes yes
senior high no yes no
veteran
yes no
“Majority Vote” split…
No Veterans
??? ???
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low medium high
no
no
no veteran
yes no
??? ???
age=senior income veteran
college_
educated=no
supports_
hillary
senior low no no yes
senior medium yes no yes
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low medium high
no
no
no veteran
yes no
??? ???
age=senior income veteran
college_
educated=no
supports_
hillary
senior low no no yes
senior medium yes no yes
yes
no veteran
age
youth
middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low
medium
high
no
no
no veteran
yes no
??? ???
yes
Cost to grow?
n = number of Attributes
D = Training Set of tuples
O( n * |D| * log|D| )
Cost to grow?
n = number of Attributes
D = Training Set of tuples
O( n * |D| * log|D| )
Amount of
work at each
tree level
Max height
of the tree
How do we minimize the cost?
• Optimal decision trees are NP-complete
(shown by Hyafil and Rivest)
How do we minimize the cost?
• Optimal decision trees are NP-complete
(shown by Hyafil and Rivest)
• Need Heuristic to pick “best” attribute to split
on.
no veteran
age
youth
middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low
medium
high
no
no
no veteran
yes no
??? ???
yes
How do we minimize the cost?
• Optimal decision trees are NP-complete
(shown by Hyafil and Rivest)
• Most common approach is “greedy”
• Need Heuristic to pick “best” attribute to split
on.
• “Best” attribute results in “purest” split
Pure = all tuples belong to the same class
….A good split increase purity of all children nodes
Three Heuristics
1. Information gain
2. Gain Ratio
3. Gini Index
Information Gain
• Ross Quinlan’s ID3 (iterative dichotomizer 3rd)
uses info gain as its heuristic.
• Heuristic based on Claude Shannon’s
information theory.
HIGH
ENTROPY
LOW
ENTROPY
Calculate Entropy for D
D = Training Set D=14
m = num. of classes m=2
i = 1,…,m
Ci = distinct class C1 = yes, C2 = no
Ci,D = tuples in D of class Ci C1,D = yes, C2,D = no
pi = prob. a random tuple in p1 = 5/14, p2 = 9/14
D belongs to class Ci
=|Ci,D|/|D|
= -[ 5/14 * log(5/14) + 9/14 * log(9/14)]
= -[ .3571 * -1.4854 + .6428 * -.6374]
= -[ -.5304 + -.4097] = .9400 bits
Extremes:
= -[ 7/14 * log(7/14) + 7/14 * log(7/14)] = 1 bit
= -[ 1/14 * log(1/14) + 13/14 * log(13/14)] = .3712 bits
= -[ 0/14 * log(0/14) + 14/14 * log(14/14)] = 0 bits
Entropy for D split by A
A = attribute to split D on E.g. age
v = distinct values of A E.g. youth,
middle_aged, senior
j = 1,…,v
Dj = subset of D where A=j E.g. All tuples
where age=youth
Entropyage (D)= 5/14 * -[0/5*log(0/5) + 5/5*log(5/5)]
+ 4/14 * -[2/4*log(2/4) + 2/4*log(2/4)]
+ 5/14 * -[3/5*log(3/5) + 2/5*log(2/5)]
= .6324 bits
Entropyincome (D)= 7/14 * -[2/7*log(2/7) + 5/7*log(5/7)]
+ 4/14 * -[2/4*log(2/4) + 2/4*log(2/4)]
+ 3/14 * -[1/3*log(1/3) + 2/3*log(2/3)]
= .9140 bits
Entropyveteran (D)= 3/14 * -[2/3*log(2/3) + 1/3*log(1/3)]
+ 11/14 * -[3/11*log(3/11) + 8/11*log(8/11)]
= .8609 bits
Entropycollege_educated (D)= 8/14 * -[6/8*log(6/8) + 2/8*log(2/8)]
+ 6/14 * -[3/6*log(3/6) + 3/6*log(3/6)]
= .8921 bits
Information Gain
Gain(A) = Entropy(D) - EntropyA (D)
Set of tuples D Subset of D split on
attribute A
Choose the A with the highest Gain.
decreases Entropy
Gain(A) = Entropy(D) - EntropyA (D)
Gain(age) = Entropy(D) - Entropyage (D)
= .9400 - .6324 = .3076 bits
Gain(income) = .0259 bits
Gain(veteran) = .0790 bits
Gain(college_educated) = .0479 bits
Entropy with values >2
Entropy = -[7/13*log(7/13) + 2/13*log(2/13) + 2/13*log(2/13) + 2/13*log(2/13)]
= 1.7272 bits
Entropy = -[5/13*log(5/13) + 1/13*log(1/13) + 6/13*log(6/13) + 1/13*log(1/13)]
= 1.6143 bits
ss age income veteran
college_
educated
support_
hillary
215-98-9343 youth low no no no
238-34-3493 youth low yes no no
234-28-2434 middle_aged low no no yes
243-24-2343 senior low no no yes
634-35-2345 senior medium no yes no
553-32-2323 senior medium yes no yes
554-23-4324 middle_aged medium no yes no
523-43-2343 youth low no yes no
553-23-1223 youth low no yes no
344-23-2321 senior high no yes yes
212-23-1232 youth low no no no
112-12-4521 middle_aged high no yes no
423-13-3425 middle_aged medium yes yes yes
423-53-4817 senior high no yes no
Added social security number attribute
ss no
yes
yesnononoyes
no
yes
yes
no
no
no
no
215-98-9343……..423-53-4817
Will Information Gain split on ss?
ss no
yes
yesnononoyes
no
yes
yes
no
no
no
no
215-98-9343……..423-53-4817
Will Information Gain split on ss?
Yes, because Entropyss (D) = 0.
*Entropyss (D) = 1/14 * -14[1/1*log(1/1) + 0/1*log(0/1)]
Gain ratio
• C4.5, a successor of ID3, uses this heuristic.
• Attempts to overcome Information Gain’s bias
in favor of attributes with large number of
values.
Gain ratio
Gain ratio
Gain(ss) = .9400
SplitInfoss (D) = 3.9068
GainRatio(ss) = .2406
Gain(age) = .3076
SplitInfoage (D) = 1.5849
GainRatio(age) = .1940
Gini Index
• CART uses this heuristic.
• Binary splits.
• Not biased toward multi-value attributes like
Info Gain.
age
youth
middle_aged
senior
age
senioryouth,
middle_aged
Gini Index
For the attribute age the possible subsets are:
{youth, middle_aged, senior},
{youth, middle_aged}, {youth, senior},
{middle_aged, senior}, {youth},
{middle_aged}, {senior} and {}.
We exclude the powerset and the empty set.
So we have to examine 2v – 2 subsets.
Gini Index
For the attribute age the possible subsets are:
{youth, middle_aged, senior},
{youth, middle_aged}, {youth, senior},
{middle_aged, senior}, {youth},
{middle_aged}, {senior} and {}.
We exclude the powerset and the empty set.
So we have to examine 2v – 2 subsets.
CALCULATE GINI INDEX
ON EACH SUBSET
Gini Index
Miscellaneous thoughts
• Widely applicable to data exploration,
classification and scoring tasks
• Generate understandable rules
• Better for predicting discrete outcomes than
continuous (lumpy)
• Error-prone when # of training examples for a
class is small
• Most business cases trying to predict few
broad categories
Big picture of data mining

More Related Content

Similar to Big picture of data mining

Unify engauge practise
Unify engauge practiseUnify engauge practise
Unify engauge practiseplus plus
 
Discipline By Ms Chandla
Discipline By Ms ChandlaDiscipline By Ms Chandla
Discipline By Ms Chandlakulachihansraj
 
Sandbox Learning: How To Transition Your Children Into High School
Sandbox Learning: How To Transition Your Children Into High SchoolSandbox Learning: How To Transition Your Children Into High School
Sandbox Learning: How To Transition Your Children Into High SchoolElizabethNugent8
 
PT3 English Model 1 (Q)
PT3 English Model 1 (Q)PT3 English Model 1 (Q)
PT3 English Model 1 (Q)Miz Malinz
 
super exercise module english pt3 question with answer
super exercise module english pt3 question with answersuper exercise module english pt3 question with answer
super exercise module english pt3 question with answerSurryaraj Poobalan
 
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...Aida Cunha
 
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docx
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docxO Z Scores 68 O The Normal Curve 73 O Sample and Popul.docx
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docxhopeaustin33688
 
Valley Youth Center Presentation
Valley Youth Center PresentationValley Youth Center Presentation
Valley Youth Center PresentationChuck Horton
 
Interview question and answer
Interview question and answerInterview question and answer
Interview question and answerRakibul Islam
 

Similar to Big picture of data mining (14)

Unify engauge practise
Unify engauge practiseUnify engauge practise
Unify engauge practise
 
Discipline By Ms Chandla
Discipline By Ms ChandlaDiscipline By Ms Chandla
Discipline By Ms Chandla
 
Sample kesselus
Sample kesselusSample kesselus
Sample kesselus
 
Manager vs. leader
Manager vs. leaderManager vs. leader
Manager vs. leader
 
Sandbox Learning: How To Transition Your Children Into High School
Sandbox Learning: How To Transition Your Children Into High SchoolSandbox Learning: How To Transition Your Children Into High School
Sandbox Learning: How To Transition Your Children Into High School
 
PT3 English Model 1 (Q)
PT3 English Model 1 (Q)PT3 English Model 1 (Q)
PT3 English Model 1 (Q)
 
super exercise module english pt3 question with answer
super exercise module english pt3 question with answersuper exercise module english pt3 question with answer
super exercise module english pt3 question with answer
 
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...
 
Talent Essay
Talent EssayTalent Essay
Talent Essay
 
Udgam Matters March - April 2017
Udgam Matters March - April 2017Udgam Matters March - April 2017
Udgam Matters March - April 2017
 
S3 profile sheet
S3 profile sheetS3 profile sheet
S3 profile sheet
 
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docx
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docxO Z Scores 68 O The Normal Curve 73 O Sample and Popul.docx
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docx
 
Valley Youth Center Presentation
Valley Youth Center PresentationValley Youth Center Presentation
Valley Youth Center Presentation
 
Interview question and answer
Interview question and answerInterview question and answer
Interview question and answer
 

More from Fraboni Ec

Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreadingFraboni Ec
 
What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreadingFraboni Ec
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherenceFraboni Ec
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data miningFraboni Ec
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryFraboni Ec
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching worksFraboni Ec
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cacheFraboni Ec
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithmsFraboni Ec
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and pythonFraboni Ec
 
Abstract data types
Abstract data typesAbstract data types
Abstract data typesFraboni Ec
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsFraboni Ec
 
Abstraction file
Abstraction fileAbstraction file
Abstraction fileFraboni Ec
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysisFraboni Ec
 
Abstract class
Abstract classAbstract class
Abstract classFraboni Ec
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with javaFraboni Ec
 

More from Fraboni Ec (20)

Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreading
 
Lisp
LispLisp
Lisp
 
What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreading
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Cache recap
Cache recapCache recap
Cache recap
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Abstraction file
Abstraction fileAbstraction file
Abstraction file
 
Object model
Object modelObject model
Object model
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
 
Abstract class
Abstract classAbstract class
Abstract class
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 
Inheritance
InheritanceInheritance
Inheritance
 
Api crash
Api crashApi crash
Api crash
 

Recently uploaded

Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceIES VE
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxMarkSteadman7
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformWSO2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 

Recently uploaded (20)

Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 

Big picture of data mining

  • 1. Decision Trees in the Big Picture • Classification (vs. Rule Pattern Discovery) • Supervised Learning (vs. Unsupervised) • Inductive • Generation (vs. Discrimination)
  • 2. Example age income veteran college_ educated support_ hillary youth low no no no youth low yes no no middle_aged low no no yes senior low no no yes senior medium no yes no senior medium yes no yes middle_aged medium no yes no youth low no yes no youth low no yes no senior high no yes yes youth low no no no middle_aged high no yes no middle_aged medium yes yes yes senior high no yes no
  • 3. Example age income veteran college_ educated support_ hillary youth low no no no youth low yes no no middle_aged low no no yes senior low no no yes senior medium no yes no senior medium yes no yes middle_aged medium no yes no youth low no yes no youth low no yes no senior high no yes yes youth low no no no middle_aged high no yes no middle_aged medium yes yes yes senior high no yes no Class- labels
  • 4. Example age income veteran college_ educated support_ hillary middle_aged medium no no ????? no age youth middle_aged college_ educated income yes yes low medium high no senior no yes noyes Inner nodes are ATTRIBUTES Branches are attribute VALUES Leaves are class-label VALUES
  • 5. Example age income veteran college_ educated support_ hillary middle_aged medium no no yes (predicted) no age youth middle_aged college_ educated income yes yes low medium high no senior no yes noyes Inner nodes are ATTRIBUTES Branches are attribute VALUES Leaves are class-label VALUES ANSWER
  • 6. Example no age youth middle_aged college_ educated income yes yes low medium high no senior no yes noyes Induced Rules: The youth do not support Hillary. All who are middle- aged and low-income support Hillary. Seniors support Hillary. Etc…A rule is generated for each leaf.
  • 7. Example Induced Rules: The youth do not support Hillary. All who are middle-aged and low-income support Hillary. Seniors support Hillary. Nested IF-THEN: IF age == youth THEN support_hillary = no ELSE IF age == middle_aged & income == low THEN support_hillary = yes ELSE IF age = senior THEN support_hillary = yes
  • 8. How do you construct one? 1. Select an attribute to place at the root node and make one branch for each possible value. 14 tuples; Entire Training Set 5 tuples 4 tuples 5 tuples age youth middle_aged senior
  • 9. How do you construct one? 2. For each branch, recursively process the remaining training examples by choosing an attribute to split them on. The chosen attribute cannot be one used in the ancestor nodes. If at anytime all the training examples have the same class, stop processing that part of the tree.
  • 10. How do you construct one? age=youth Income veteran college_ educated support_ hillary youth low no no no youth low yes no no youth low no yes no youth low no yes no youth low no no no no age youth middle_aged senior
  • 11. How do you construct one? age= middle_aged income veteran college_ educated supports_ hillary middle_aged low no no yes middle_aged medium no yes no middle_aged high no yes no middle_aged medium yes yes yes no veteran age youth middle_aged senior yes no
  • 12. no veteran age youth middle_aged senior yes yes no age= middle_aged income veteran college_ educated supports_ hillary middle_aged low no no yes middle_aged medium no yes no middle_aged high no yes no middle_aged medium yes yes yes
  • 13. no veteran age youth middle_aged senior yes yes no age= middle_aged income veteran college_ educated supports_ hillary middle_aged low no no yes middle_aged medium no yes no middle_aged high no yes no middle_aged medium yes yes yes college_ educated yes no
  • 14. age= middle_aged income veteran=no college_ educated supports_ hillary middle_aged low no no yes middle_aged medium no yes no middle_aged high no yes no no veteran age youth middle_aged yes yes no college_ educated yes no senior no
  • 15. age= middle_aged income veteran=no college_ educated supports_ hillary middle_aged low no no yes middle_aged medium no yes no middle_aged high no yes no no veteran age youth middle_aged yes yes no college_ educated yes no senior no yes
  • 16. no veteran ageyouth middle_aged yes yes no college_ educated yes no senior no yes age=senior income veteran college_ educated supports_ hillary senior low no no yes senior medium no yes no senior medium yes no yes senior high no yes yes senior high no yes no
  • 17. no veteran ageyouth middle_aged yes yes no college_ educated yes no senior no yes age=senior income veteran college_ educated supports_ hillary senior low no no yes senior medium no yes no senior medium yes no yes senior high no yes yes senior high no yes no college_ educated yes no
  • 18. no veteran age youth middle_aged yes yes no college_ educated yes no senior no yes college_ educated yes no age=senior income veteran college_ educated=yes supports_ hillary senior medium no yes no senior high no yes yes senior high no yes no income low medium high
  • 19. no veteran age youth middle_aged yes yes no college_ educated yes no senior no yes college_ educated yes no age=senior income veteran college_ educated=yes supports_ hillary senior medium no yes no senior high no yes yes senior high no yes no income low medium high No low-income college- educated seniors…
  • 20. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no age=senior income veteran college_ educated=yes supports_ hillary senior medium no yes no senior high no yes yes senior high no yes no income low medium high No low-income college- educated seniors… no no “Majority Vote”
  • 21. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no age=senior income= medium veteran college_ educated=yes supports_ hillary senior medium no yes no income low medium high no no
  • 22. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no age=senior income= medium veteran college_ educated=yes supports_ hillary senior medium no yes no income low medium high no no no
  • 23. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no age=senior income=high veteran college_ educated=yes supports_ hillary senior high no yes yes senior high no yes no
  • 24. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no age=senior income=high veteran college_ educated=yes supports_ hillary senior high no yes yes senior high no yes no veteran yes no
  • 25. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no age=senior income=high veteran college_ educated=yes supports_ hillary senior high no yes yes senior high no yes no veteran yes no “Majority Vote” split… No Veterans ??? ???
  • 26. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no veteran yes no ??? ??? age=senior income veteran college_ educated=no supports_ hillary senior low no no yes senior medium yes no yes
  • 27. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no veteran yes no ??? ??? age=senior income veteran college_ educated=no supports_ hillary senior low no no yes senior medium yes no yes yes
  • 28. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no veteran yes no ??? ??? yes
  • 29. Cost to grow? n = number of Attributes D = Training Set of tuples O( n * |D| * log|D| )
  • 30. Cost to grow? n = number of Attributes D = Training Set of tuples O( n * |D| * log|D| ) Amount of work at each tree level Max height of the tree
  • 31. How do we minimize the cost? • Optimal decision trees are NP-complete (shown by Hyafil and Rivest)
  • 32. How do we minimize the cost? • Optimal decision trees are NP-complete (shown by Hyafil and Rivest) • Need Heuristic to pick “best” attribute to split on.
  • 33. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no veteran yes no ??? ??? yes
  • 34. How do we minimize the cost? • Optimal decision trees are NP-complete (shown by Hyafil and Rivest) • Most common approach is “greedy” • Need Heuristic to pick “best” attribute to split on. • “Best” attribute results in “purest” split Pure = all tuples belong to the same class
  • 35. ….A good split increase purity of all children nodes
  • 36. Three Heuristics 1. Information gain 2. Gain Ratio 3. Gini Index
  • 37. Information Gain • Ross Quinlan’s ID3 (iterative dichotomizer 3rd) uses info gain as its heuristic. • Heuristic based on Claude Shannon’s information theory.
  • 39. Calculate Entropy for D D = Training Set D=14 m = num. of classes m=2 i = 1,…,m Ci = distinct class C1 = yes, C2 = no Ci,D = tuples in D of class Ci C1,D = yes, C2,D = no pi = prob. a random tuple in p1 = 5/14, p2 = 9/14 D belongs to class Ci =|Ci,D|/|D|
  • 40. = -[ 5/14 * log(5/14) + 9/14 * log(9/14)] = -[ .3571 * -1.4854 + .6428 * -.6374] = -[ -.5304 + -.4097] = .9400 bits Extremes: = -[ 7/14 * log(7/14) + 7/14 * log(7/14)] = 1 bit = -[ 1/14 * log(1/14) + 13/14 * log(13/14)] = .3712 bits = -[ 0/14 * log(0/14) + 14/14 * log(14/14)] = 0 bits
  • 41. Entropy for D split by A A = attribute to split D on E.g. age v = distinct values of A E.g. youth, middle_aged, senior j = 1,…,v Dj = subset of D where A=j E.g. All tuples where age=youth
  • 42. Entropyage (D)= 5/14 * -[0/5*log(0/5) + 5/5*log(5/5)] + 4/14 * -[2/4*log(2/4) + 2/4*log(2/4)] + 5/14 * -[3/5*log(3/5) + 2/5*log(2/5)] = .6324 bits Entropyincome (D)= 7/14 * -[2/7*log(2/7) + 5/7*log(5/7)] + 4/14 * -[2/4*log(2/4) + 2/4*log(2/4)] + 3/14 * -[1/3*log(1/3) + 2/3*log(2/3)] = .9140 bits Entropyveteran (D)= 3/14 * -[2/3*log(2/3) + 1/3*log(1/3)] + 11/14 * -[3/11*log(3/11) + 8/11*log(8/11)] = .8609 bits Entropycollege_educated (D)= 8/14 * -[6/8*log(6/8) + 2/8*log(2/8)] + 6/14 * -[3/6*log(3/6) + 3/6*log(3/6)] = .8921 bits
  • 43. Information Gain Gain(A) = Entropy(D) - EntropyA (D) Set of tuples D Subset of D split on attribute A Choose the A with the highest Gain. decreases Entropy
  • 44. Gain(A) = Entropy(D) - EntropyA (D) Gain(age) = Entropy(D) - Entropyage (D) = .9400 - .6324 = .3076 bits Gain(income) = .0259 bits Gain(veteran) = .0790 bits Gain(college_educated) = .0479 bits
  • 45. Entropy with values >2 Entropy = -[7/13*log(7/13) + 2/13*log(2/13) + 2/13*log(2/13) + 2/13*log(2/13)] = 1.7272 bits Entropy = -[5/13*log(5/13) + 1/13*log(1/13) + 6/13*log(6/13) + 1/13*log(1/13)] = 1.6143 bits
  • 46. ss age income veteran college_ educated support_ hillary 215-98-9343 youth low no no no 238-34-3493 youth low yes no no 234-28-2434 middle_aged low no no yes 243-24-2343 senior low no no yes 634-35-2345 senior medium no yes no 553-32-2323 senior medium yes no yes 554-23-4324 middle_aged medium no yes no 523-43-2343 youth low no yes no 553-23-1223 youth low no yes no 344-23-2321 senior high no yes yes 212-23-1232 youth low no no no 112-12-4521 middle_aged high no yes no 423-13-3425 middle_aged medium yes yes yes 423-53-4817 senior high no yes no Added social security number attribute
  • 48. ss no yes yesnononoyes no yes yes no no no no 215-98-9343……..423-53-4817 Will Information Gain split on ss? Yes, because Entropyss (D) = 0. *Entropyss (D) = 1/14 * -14[1/1*log(1/1) + 0/1*log(0/1)]
  • 49. Gain ratio • C4.5, a successor of ID3, uses this heuristic. • Attempts to overcome Information Gain’s bias in favor of attributes with large number of values.
  • 51. Gain ratio Gain(ss) = .9400 SplitInfoss (D) = 3.9068 GainRatio(ss) = .2406 Gain(age) = .3076 SplitInfoage (D) = 1.5849 GainRatio(age) = .1940
  • 52. Gini Index • CART uses this heuristic. • Binary splits. • Not biased toward multi-value attributes like Info Gain. age youth middle_aged senior age senioryouth, middle_aged
  • 53. Gini Index For the attribute age the possible subsets are: {youth, middle_aged, senior}, {youth, middle_aged}, {youth, senior}, {middle_aged, senior}, {youth}, {middle_aged}, {senior} and {}. We exclude the powerset and the empty set. So we have to examine 2v – 2 subsets.
  • 54. Gini Index For the attribute age the possible subsets are: {youth, middle_aged, senior}, {youth, middle_aged}, {youth, senior}, {middle_aged, senior}, {youth}, {middle_aged}, {senior} and {}. We exclude the powerset and the empty set. So we have to examine 2v – 2 subsets. CALCULATE GINI INDEX ON EACH SUBSET
  • 56. Miscellaneous thoughts • Widely applicable to data exploration, classification and scoring tasks • Generate understandable rules • Better for predicting discrete outcomes than continuous (lumpy) • Error-prone when # of training examples for a class is small • Most business cases trying to predict few broad categories