SlideShare a Scribd company logo
1 of 57
Decision Trees in the Big Picture
• Classification (vs. Rule Pattern Discovery)
• Supervised Learning (vs. Unsupervised)
• Inductive
• Generation (vs. Discrimination)
Example
age income veteran
college_
educated
support_
hillary
youth low no no no
youth low yes no no
middle_aged low no no yes
senior low no no yes
senior medium no yes no
senior medium yes no yes
middle_aged medium no yes no
youth low no yes no
youth low no yes no
senior high no yes yes
youth low no no no
middle_aged high no yes no
middle_aged medium yes yes yes
senior high no yes no
Example
age income veteran
college_
educated
support_
hillary
youth low no no no
youth low yes no no
middle_aged low no no yes
senior low no no yes
senior medium no yes no
senior medium yes no yes
middle_aged medium no yes no
youth low no yes no
youth low no yes no
senior high no yes yes
youth low no no no
middle_aged high no yes no
middle_aged medium yes yes yes
senior high no yes no
Class-
labels
Example
age income veteran
college_
educated
support_
hillary
middle_aged medium no no ?????
no
age
youth middle_aged
college_
educated
income yes
yes
low medium high
no
senior
no yes
noyes
Inner nodes are ATTRIBUTES
Branches are attribute VALUES
Leaves are class-label VALUES
Example
age income veteran
college_
educated
support_
hillary
middle_aged medium no no yes (predicted)
no
age
youth middle_aged
college_
educated
income yes
yes
low medium high
no
senior
no yes
noyes
Inner nodes are ATTRIBUTES
Branches are attribute VALUES
Leaves are class-label VALUES
ANSWER
Example
no
age
youth middle_aged
college_
educated
income yes
yes
low medium high
no
senior
no yes
noyes
Induced Rules:
The youth do not
support Hillary.
All who are middle-
aged and low-income
support Hillary.
Seniors support
Hillary.
Etc…A rule is
generated for each
leaf.
Example
Induced Rules:
The youth do not support
Hillary.
All who are middle-aged
and low-income support
Hillary.
Seniors support Hillary.
Nested IF-THEN:
IF age == youth
THEN support_hillary = no
ELSE IF age == middle_aged
& income == low
THEN support_hillary = yes
ELSE IF age = senior
THEN support_hillary = yes
How do you construct one?
1. Select an attribute to place at the root node
and make one branch for each possible value.
14 tuples; Entire Training Set
5 tuples 4 tuples 5 tuples
age
youth middle_aged senior
How do you construct one?
2. For each branch, recursively process the
remaining training examples by choosing an
attribute to split them on. The chosen
attribute cannot be one used in the ancestor
nodes. If at anytime all the training examples
have the same class, stop processing that
part of the tree.
How do you construct one?
age=youth Income veteran
college_
educated
support_
hillary
youth low no no no
youth low yes no no
youth low no yes no
youth low no yes no
youth low no no no
no
age
youth
middle_aged senior
How do you construct one?
age=
middle_aged income veteran
college_
educated
supports_
hillary
middle_aged low no no yes
middle_aged medium no yes no
middle_aged high no yes no
middle_aged medium yes yes yes
no veteran
age
youth
middle_aged
senior
yes no
no veteran
age
youth
middle_aged
senior
yes
yes no
age=
middle_aged income veteran
college_
educated
supports_
hillary
middle_aged low no no yes
middle_aged medium no yes no
middle_aged high no yes no
middle_aged medium yes yes yes
no veteran
age
youth
middle_aged
senior
yes
yes no
age=
middle_aged income veteran
college_
educated
supports_
hillary
middle_aged low no no yes
middle_aged medium no yes no
middle_aged high no yes no
middle_aged medium yes yes yes
college_
educated
yes no
age=
middle_aged income veteran=no
college_
educated
supports_
hillary
middle_aged low no no yes
middle_aged medium no yes no
middle_aged high no yes no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes no
senior
no
age=
middle_aged income veteran=no
college_
educated
supports_
hillary
middle_aged low no no yes
middle_aged medium no yes no
middle_aged high no yes no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
no veteran
ageyouth
middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
age=senior income veteran
college_
educated
supports_
hillary
senior low no no yes
senior medium no yes no
senior medium yes no yes
senior high no yes yes
senior high no yes no
no veteran
ageyouth
middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
age=senior income veteran
college_
educated
supports_
hillary
senior low no no yes
senior medium no yes no
senior medium yes no yes
senior high no yes yes
senior high no yes no
college_
educated
yes no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
college_
educated
yes no
age=senior income veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
senior high no yes yes
senior high no yes no
income
low
medium
high
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
college_
educated
yes no
age=senior income veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
senior high no yes yes
senior high no yes no
income
low
medium
high
No low-income college-
educated seniors…
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
age=senior income veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
senior high no yes yes
senior high no yes no
income
low
medium
high
No low-income college-
educated seniors…
no
no
“Majority Vote”
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
age=senior
income=
medium veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
income
low
medium
high
no
no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
age=senior
income=
medium veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
income
low
medium
high
no
no
no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low
medium
high
no
no
no
age=senior income=high veteran
college_
educated=yes
supports_
hillary
senior high no yes yes
senior high no yes no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low medium high
no
no
no
age=senior income=high veteran
college_
educated=yes
supports_
hillary
senior high no yes yes
senior high no yes no
veteran
yes no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low medium high
no
no
no
age=senior income=high veteran
college_
educated=yes
supports_
hillary
senior high no yes yes
senior high no yes no
veteran
yes no
“Majority Vote” split…
No Veterans
??? ???
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low medium high
no
no
no veteran
yes no
??? ???
age=senior income veteran
college_
educated=no
supports_
hillary
senior low no no yes
senior medium yes no yes
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low medium high
no
no
no veteran
yes no
??? ???
age=senior income veteran
college_
educated=no
supports_
hillary
senior low no no yes
senior medium yes no yes
yes
no veteran
age
youth
middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low
medium
high
no
no
no veteran
yes no
??? ???
yes
Cost to grow?
n = number of Attributes
D = Training Set of tuples
O( n * |D| * log|D| )
Cost to grow?
n = number of Attributes
D = Training Set of tuples
O( n * |D| * log|D| )
Amount of
work at each
tree level
Max height
of the tree
How do we minimize the cost?
• Optimal decision trees are NP-complete
(shown by Hyafil and Rivest)
How do we minimize the cost?
• Optimal decision trees are NP-complete
(shown by Hyafil and Rivest)
• Need Heuristic to pick “best” attribute to split
on.
no veteran
age
youth
middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low
medium
high
no
no
no veteran
yes no
??? ???
yes
How do we minimize the cost?
• Optimal decision trees are NP-complete
(shown by Hyafil and Rivest)
• Most common approach is “greedy”
• Need Heuristic to pick “best” attribute to split
on.
• “Best” attribute results in “purest” split
Pure = all tuples belong to the same class
….A good split increase purity of all children nodes
Three Heuristics
1. Information gain
2. Gain Ratio
3. Gini Index
Information Gain
• Ross Quinlan’s ID3 (iterative dichotomizer 3rd)
uses info gain as its heuristic.
• Heuristic based on Claude Shannon’s
information theory.
HIGH
ENTROPY
LOW
ENTROPY
Calculate Entropy for D
D = Training Set D=14
m = num. of classes m=2
i = 1,…,m
Ci = distinct class C1 = yes, C2 = no
Ci,D = tuples in D of class Ci C1,D = yes, C2,D = no
pi = prob. a random tuple in p1 = 5/14, p2 = 9/14
D belongs to class Ci
=|Ci,D|/|D|
= -[ 5/14 * log(5/14) + 9/14 * log(9/14)]
= -[ .3571 * -1.4854 + .6428 * -.6374]
= -[ -.5304 + -.4097] = .9400 bits
Extremes:
= -[ 7/14 * log(7/14) + 7/14 * log(7/14)] = 1 bit
= -[ 1/14 * log(1/14) + 13/14 * log(13/14)] = .3712 bits
= -[ 0/14 * log(0/14) + 14/14 * log(14/14)] = 0 bits
Entropy for D split by A
A = attribute to split D on E.g. age
v = distinct values of A E.g. youth,
middle_aged, senior
j = 1,…,v
Dj = subset of D where A=j E.g. All tuples
where age=youth
Entropyage (D)= 5/14 * -[0/5*log(0/5) + 5/5*log(5/5)]
+ 4/14 * -[2/4*log(2/4) + 2/4*log(2/4)]
+ 5/14 * -[3/5*log(3/5) + 2/5*log(2/5)]
= .6324 bits
Entropyincome (D)= 7/14 * -[2/7*log(2/7) + 5/7*log(5/7)]
+ 4/14 * -[2/4*log(2/4) + 2/4*log(2/4)]
+ 3/14 * -[1/3*log(1/3) + 2/3*log(2/3)]
= .9140 bits
Entropyveteran (D)= 3/14 * -[2/3*log(2/3) + 1/3*log(1/3)]
+ 11/14 * -[3/11*log(3/11) + 8/11*log(8/11)]
= .8609 bits
Entropycollege_educated (D)= 8/14 * -[6/8*log(6/8) + 2/8*log(2/8)]
+ 6/14 * -[3/6*log(3/6) + 3/6*log(3/6)]
= .8921 bits
Information Gain
Gain(A) = Entropy(D) - EntropyA (D)
Set of tuples D Subset of D split on
attribute A
Choose the A with the highest Gain.
decreases Entropy
Gain(A) = Entropy(D) - EntropyA (D)
Gain(age) = Entropy(D) - Entropyage (D)
= .9400 - .6324 = .3076 bits
Gain(income) = .0259 bits
Gain(veteran) = .0790 bits
Gain(college_educated) = .0479 bits
Entropy with values >2
Entropy = -[7/13*log(7/13) + 2/13*log(2/13) + 2/13*log(2/13) + 2/13*log(2/13)]
= 1.7272 bits
Entropy = -[5/13*log(5/13) + 1/13*log(1/13) + 6/13*log(6/13) + 1/13*log(1/13)]
= 1.6143 bits
ss age income veteran
college_
educated
support_
hillary
215-98-9343 youth low no no no
238-34-3493 youth low yes no no
234-28-2434 middle_aged low no no yes
243-24-2343 senior low no no yes
634-35-2345 senior medium no yes no
553-32-2323 senior medium yes no yes
554-23-4324 middle_aged medium no yes no
523-43-2343 youth low no yes no
553-23-1223 youth low no yes no
344-23-2321 senior high no yes yes
212-23-1232 youth low no no no
112-12-4521 middle_aged high no yes no
423-13-3425 middle_aged medium yes yes yes
423-53-4817 senior high no yes no
Added social security number attribute
ss no
yes
yesnononoyes
no
yes
yes
no
no
no
no
215-98-9343……..423-53-4817
Will Information Gain split on ss?
ss no
yes
yesnononoyes
no
yes
yes
no
no
no
no
215-98-9343……..423-53-4817
Will Information Gain split on ss?
Yes, because Entropyss (D) = 0.
*Entropyss (D) = 1/14 * -14[1/1*log(1/1) + 0/1*log(0/1)]
Gain ratio
• C4.5, a successor of ID3, uses this heuristic.
• Attempts to overcome Information Gain’s bias
in favor of attributes with large number of
values.
Gain ratio
Gain ratio
Gain(ss) = .9400
SplitInfoss (D) = 3.9068
GainRatio(ss) = .2406
Gain(age) = .3076
SplitInfoage (D) = 1.5849
GainRatio(age) = .1940
Gini Index
• CART uses this heuristic.
• Binary splits.
• Not biased toward multi-value attributes like
Info Gain.
age
youth
middle_aged
senior
age
senioryouth,
middle_aged
Gini Index
For the attribute age the possible subsets are:
{youth, middle_aged, senior},
{youth, middle_aged}, {youth, senior},
{middle_aged, senior}, {youth},
{middle_aged}, {senior} and {}.
We exclude the powerset and the empty set.
So we have to examine 2v – 2 subsets.
Gini Index
For the attribute age the possible subsets are:
{youth, middle_aged, senior},
{youth, middle_aged}, {youth, senior},
{middle_aged, senior}, {youth},
{middle_aged}, {senior} and {}.
We exclude the powerset and the empty set.
So we have to examine 2v – 2 subsets.
CALCULATE GINI INDEX
ON EACH SUBSET
Gini Index
Miscellaneous thoughts
• Widely applicable to data exploration,
classification and scoring tasks
• Generate understandable rules
• Better for predicting discrete outcomes than
continuous (lumpy)
• Error-prone when # of training examples for a
class is small
• Most business cases trying to predict few
broad categories
Big picture of data mining

More Related Content

Similar to Big picture of data mining

Growth mindset dhs conf
Growth mindset dhs confGrowth mindset dhs conf
Growth mindset dhs conf
shaunallison
 
Quantitative Aptitude And Mathematics 14 October Ii
Quantitative Aptitude And Mathematics 14 October IiQuantitative Aptitude And Mathematics 14 October Ii
Quantitative Aptitude And Mathematics 14 October Ii
Dr. Trilok Kumar Jain
 
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...
Aida Cunha
 
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docx
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docxO Z Scores 68 O The Normal Curve 73 O Sample and Popul.docx
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docx
hopeaustin33688
 

Similar to Big picture of data mining (19)

Growth mindset dhs conf
Growth mindset dhs confGrowth mindset dhs conf
Growth mindset dhs conf
 
Quantitative Aptitude And Mathematics 14 October Ii
Quantitative Aptitude And Mathematics 14 October IiQuantitative Aptitude And Mathematics 14 October Ii
Quantitative Aptitude And Mathematics 14 October Ii
 
Mindset
MindsetMindset
Mindset
 
Connect with Maths: Advocating for the mathematically highly capable
Connect with Maths: Advocating for the mathematically highly capableConnect with Maths: Advocating for the mathematically highly capable
Connect with Maths: Advocating for the mathematically highly capable
 
Addie Builds a House
Addie Builds a HouseAddie Builds a House
Addie Builds a House
 
Unify engauge practise
Unify engauge practiseUnify engauge practise
Unify engauge practise
 
Discipline By Ms Chandla
Discipline By Ms ChandlaDiscipline By Ms Chandla
Discipline By Ms Chandla
 
Sample kesselus
Sample kesselusSample kesselus
Sample kesselus
 
Manager vs. leader
Manager vs. leaderManager vs. leader
Manager vs. leader
 
Sandbox Learning: How To Transition Your Children Into High School
Sandbox Learning: How To Transition Your Children Into High SchoolSandbox Learning: How To Transition Your Children Into High School
Sandbox Learning: How To Transition Your Children Into High School
 
PT3 English Model 1 (Q)
PT3 English Model 1 (Q)PT3 English Model 1 (Q)
PT3 English Model 1 (Q)
 
super exercise module english pt3 question with answer
super exercise module english pt3 question with answersuper exercise module english pt3 question with answer
super exercise module english pt3 question with answer
 
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...
 
Talent Essay
Talent EssayTalent Essay
Talent Essay
 
Udgam Matters March - April 2017
Udgam Matters March - April 2017Udgam Matters March - April 2017
Udgam Matters March - April 2017
 
S3 profile sheet
S3 profile sheetS3 profile sheet
S3 profile sheet
 
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docx
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docxO Z Scores 68 O The Normal Curve 73 O Sample and Popul.docx
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docx
 
Valley Youth Center Presentation
Valley Youth Center PresentationValley Youth Center Presentation
Valley Youth Center Presentation
 
Interview question and answer
Interview question and answerInterview question and answer
Interview question and answer
 

More from Hoang Nguyen

More from Hoang Nguyen (20)

Rest api to integrate with your site
Rest api to integrate with your siteRest api to integrate with your site
Rest api to integrate with your site
 
How to build a rest api
How to build a rest apiHow to build a rest api
How to build a rest api
 
Api crash
Api crashApi crash
Api crash
 
Smm and caching
Smm and cachingSmm and caching
Smm and caching
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
Cache recap
Cache recapCache recap
Cache recap
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python basics
Python basicsPython basics
Python basics
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in python
 
Learning python
Learning pythonLearning python
Learning python
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with python
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 
Object oriented programming using c++
Object oriented programming using c++Object oriented programming using c++
Object oriented programming using c++
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
 
Object model
Object modelObject model
Object model
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Big picture of data mining

  • 1. Decision Trees in the Big Picture • Classification (vs. Rule Pattern Discovery) • Supervised Learning (vs. Unsupervised) • Inductive • Generation (vs. Discrimination)
  • 2. Example age income veteran college_ educated support_ hillary youth low no no no youth low yes no no middle_aged low no no yes senior low no no yes senior medium no yes no senior medium yes no yes middle_aged medium no yes no youth low no yes no youth low no yes no senior high no yes yes youth low no no no middle_aged high no yes no middle_aged medium yes yes yes senior high no yes no
  • 3. Example age income veteran college_ educated support_ hillary youth low no no no youth low yes no no middle_aged low no no yes senior low no no yes senior medium no yes no senior medium yes no yes middle_aged medium no yes no youth low no yes no youth low no yes no senior high no yes yes youth low no no no middle_aged high no yes no middle_aged medium yes yes yes senior high no yes no Class- labels
  • 4. Example age income veteran college_ educated support_ hillary middle_aged medium no no ????? no age youth middle_aged college_ educated income yes yes low medium high no senior no yes noyes Inner nodes are ATTRIBUTES Branches are attribute VALUES Leaves are class-label VALUES
  • 5. Example age income veteran college_ educated support_ hillary middle_aged medium no no yes (predicted) no age youth middle_aged college_ educated income yes yes low medium high no senior no yes noyes Inner nodes are ATTRIBUTES Branches are attribute VALUES Leaves are class-label VALUES ANSWER
  • 6. Example no age youth middle_aged college_ educated income yes yes low medium high no senior no yes noyes Induced Rules: The youth do not support Hillary. All who are middle- aged and low-income support Hillary. Seniors support Hillary. Etc…A rule is generated for each leaf.
  • 7. Example Induced Rules: The youth do not support Hillary. All who are middle-aged and low-income support Hillary. Seniors support Hillary. Nested IF-THEN: IF age == youth THEN support_hillary = no ELSE IF age == middle_aged & income == low THEN support_hillary = yes ELSE IF age = senior THEN support_hillary = yes
  • 8. How do you construct one? 1. Select an attribute to place at the root node and make one branch for each possible value. 14 tuples; Entire Training Set 5 tuples 4 tuples 5 tuples age youth middle_aged senior
  • 9. How do you construct one? 2. For each branch, recursively process the remaining training examples by choosing an attribute to split them on. The chosen attribute cannot be one used in the ancestor nodes. If at anytime all the training examples have the same class, stop processing that part of the tree.
  • 10. How do you construct one? age=youth Income veteran college_ educated support_ hillary youth low no no no youth low yes no no youth low no yes no youth low no yes no youth low no no no no age youth middle_aged senior
  • 11. How do you construct one? age= middle_aged income veteran college_ educated supports_ hillary middle_aged low no no yes middle_aged medium no yes no middle_aged high no yes no middle_aged medium yes yes yes no veteran age youth middle_aged senior yes no
  • 12. no veteran age youth middle_aged senior yes yes no age= middle_aged income veteran college_ educated supports_ hillary middle_aged low no no yes middle_aged medium no yes no middle_aged high no yes no middle_aged medium yes yes yes
  • 13. no veteran age youth middle_aged senior yes yes no age= middle_aged income veteran college_ educated supports_ hillary middle_aged low no no yes middle_aged medium no yes no middle_aged high no yes no middle_aged medium yes yes yes college_ educated yes no
  • 14. age= middle_aged income veteran=no college_ educated supports_ hillary middle_aged low no no yes middle_aged medium no yes no middle_aged high no yes no no veteran age youth middle_aged yes yes no college_ educated yes no senior no
  • 15. age= middle_aged income veteran=no college_ educated supports_ hillary middle_aged low no no yes middle_aged medium no yes no middle_aged high no yes no no veteran age youth middle_aged yes yes no college_ educated yes no senior no yes
  • 16. no veteran ageyouth middle_aged yes yes no college_ educated yes no senior no yes age=senior income veteran college_ educated supports_ hillary senior low no no yes senior medium no yes no senior medium yes no yes senior high no yes yes senior high no yes no
  • 17. no veteran ageyouth middle_aged yes yes no college_ educated yes no senior no yes age=senior income veteran college_ educated supports_ hillary senior low no no yes senior medium no yes no senior medium yes no yes senior high no yes yes senior high no yes no college_ educated yes no
  • 18. no veteran age youth middle_aged yes yes no college_ educated yes no senior no yes college_ educated yes no age=senior income veteran college_ educated=yes supports_ hillary senior medium no yes no senior high no yes yes senior high no yes no income low medium high
  • 19. no veteran age youth middle_aged yes yes no college_ educated yes no senior no yes college_ educated yes no age=senior income veteran college_ educated=yes supports_ hillary senior medium no yes no senior high no yes yes senior high no yes no income low medium high No low-income college- educated seniors…
  • 20. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no age=senior income veteran college_ educated=yes supports_ hillary senior medium no yes no senior high no yes yes senior high no yes no income low medium high No low-income college- educated seniors… no no “Majority Vote”
  • 21. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no age=senior income= medium veteran college_ educated=yes supports_ hillary senior medium no yes no income low medium high no no
  • 22. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no age=senior income= medium veteran college_ educated=yes supports_ hillary senior medium no yes no income low medium high no no no
  • 23. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no age=senior income=high veteran college_ educated=yes supports_ hillary senior high no yes yes senior high no yes no
  • 24. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no age=senior income=high veteran college_ educated=yes supports_ hillary senior high no yes yes senior high no yes no veteran yes no
  • 25. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no age=senior income=high veteran college_ educated=yes supports_ hillary senior high no yes yes senior high no yes no veteran yes no “Majority Vote” split… No Veterans ??? ???
  • 26. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no veteran yes no ??? ??? age=senior income veteran college_ educated=no supports_ hillary senior low no no yes senior medium yes no yes
  • 27. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no veteran yes no ??? ??? age=senior income veteran college_ educated=no supports_ hillary senior low no no yes senior medium yes no yes yes
  • 28. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no veteran yes no ??? ??? yes
  • 29. Cost to grow? n = number of Attributes D = Training Set of tuples O( n * |D| * log|D| )
  • 30. Cost to grow? n = number of Attributes D = Training Set of tuples O( n * |D| * log|D| ) Amount of work at each tree level Max height of the tree
  • 31. How do we minimize the cost? • Optimal decision trees are NP-complete (shown by Hyafil and Rivest)
  • 32. How do we minimize the cost? • Optimal decision trees are NP-complete (shown by Hyafil and Rivest) • Need Heuristic to pick “best” attribute to split on.
  • 33. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no veteran yes no ??? ??? yes
  • 34. How do we minimize the cost? • Optimal decision trees are NP-complete (shown by Hyafil and Rivest) • Most common approach is “greedy” • Need Heuristic to pick “best” attribute to split on. • “Best” attribute results in “purest” split Pure = all tuples belong to the same class
  • 35. ….A good split increase purity of all children nodes
  • 36. Three Heuristics 1. Information gain 2. Gain Ratio 3. Gini Index
  • 37. Information Gain • Ross Quinlan’s ID3 (iterative dichotomizer 3rd) uses info gain as its heuristic. • Heuristic based on Claude Shannon’s information theory.
  • 39. Calculate Entropy for D D = Training Set D=14 m = num. of classes m=2 i = 1,…,m Ci = distinct class C1 = yes, C2 = no Ci,D = tuples in D of class Ci C1,D = yes, C2,D = no pi = prob. a random tuple in p1 = 5/14, p2 = 9/14 D belongs to class Ci =|Ci,D|/|D|
  • 40. = -[ 5/14 * log(5/14) + 9/14 * log(9/14)] = -[ .3571 * -1.4854 + .6428 * -.6374] = -[ -.5304 + -.4097] = .9400 bits Extremes: = -[ 7/14 * log(7/14) + 7/14 * log(7/14)] = 1 bit = -[ 1/14 * log(1/14) + 13/14 * log(13/14)] = .3712 bits = -[ 0/14 * log(0/14) + 14/14 * log(14/14)] = 0 bits
  • 41. Entropy for D split by A A = attribute to split D on E.g. age v = distinct values of A E.g. youth, middle_aged, senior j = 1,…,v Dj = subset of D where A=j E.g. All tuples where age=youth
  • 42. Entropyage (D)= 5/14 * -[0/5*log(0/5) + 5/5*log(5/5)] + 4/14 * -[2/4*log(2/4) + 2/4*log(2/4)] + 5/14 * -[3/5*log(3/5) + 2/5*log(2/5)] = .6324 bits Entropyincome (D)= 7/14 * -[2/7*log(2/7) + 5/7*log(5/7)] + 4/14 * -[2/4*log(2/4) + 2/4*log(2/4)] + 3/14 * -[1/3*log(1/3) + 2/3*log(2/3)] = .9140 bits Entropyveteran (D)= 3/14 * -[2/3*log(2/3) + 1/3*log(1/3)] + 11/14 * -[3/11*log(3/11) + 8/11*log(8/11)] = .8609 bits Entropycollege_educated (D)= 8/14 * -[6/8*log(6/8) + 2/8*log(2/8)] + 6/14 * -[3/6*log(3/6) + 3/6*log(3/6)] = .8921 bits
  • 43. Information Gain Gain(A) = Entropy(D) - EntropyA (D) Set of tuples D Subset of D split on attribute A Choose the A with the highest Gain. decreases Entropy
  • 44. Gain(A) = Entropy(D) - EntropyA (D) Gain(age) = Entropy(D) - Entropyage (D) = .9400 - .6324 = .3076 bits Gain(income) = .0259 bits Gain(veteran) = .0790 bits Gain(college_educated) = .0479 bits
  • 45. Entropy with values >2 Entropy = -[7/13*log(7/13) + 2/13*log(2/13) + 2/13*log(2/13) + 2/13*log(2/13)] = 1.7272 bits Entropy = -[5/13*log(5/13) + 1/13*log(1/13) + 6/13*log(6/13) + 1/13*log(1/13)] = 1.6143 bits
  • 46. ss age income veteran college_ educated support_ hillary 215-98-9343 youth low no no no 238-34-3493 youth low yes no no 234-28-2434 middle_aged low no no yes 243-24-2343 senior low no no yes 634-35-2345 senior medium no yes no 553-32-2323 senior medium yes no yes 554-23-4324 middle_aged medium no yes no 523-43-2343 youth low no yes no 553-23-1223 youth low no yes no 344-23-2321 senior high no yes yes 212-23-1232 youth low no no no 112-12-4521 middle_aged high no yes no 423-13-3425 middle_aged medium yes yes yes 423-53-4817 senior high no yes no Added social security number attribute
  • 48. ss no yes yesnononoyes no yes yes no no no no 215-98-9343……..423-53-4817 Will Information Gain split on ss? Yes, because Entropyss (D) = 0. *Entropyss (D) = 1/14 * -14[1/1*log(1/1) + 0/1*log(0/1)]
  • 49. Gain ratio • C4.5, a successor of ID3, uses this heuristic. • Attempts to overcome Information Gain’s bias in favor of attributes with large number of values.
  • 51. Gain ratio Gain(ss) = .9400 SplitInfoss (D) = 3.9068 GainRatio(ss) = .2406 Gain(age) = .3076 SplitInfoage (D) = 1.5849 GainRatio(age) = .1940
  • 52. Gini Index • CART uses this heuristic. • Binary splits. • Not biased toward multi-value attributes like Info Gain. age youth middle_aged senior age senioryouth, middle_aged
  • 53. Gini Index For the attribute age the possible subsets are: {youth, middle_aged, senior}, {youth, middle_aged}, {youth, senior}, {middle_aged, senior}, {youth}, {middle_aged}, {senior} and {}. We exclude the powerset and the empty set. So we have to examine 2v – 2 subsets.
  • 54. Gini Index For the attribute age the possible subsets are: {youth, middle_aged, senior}, {youth, middle_aged}, {youth, senior}, {middle_aged, senior}, {youth}, {middle_aged}, {senior} and {}. We exclude the powerset and the empty set. So we have to examine 2v – 2 subsets. CALCULATE GINI INDEX ON EACH SUBSET
  • 56. Miscellaneous thoughts • Widely applicable to data exploration, classification and scoring tasks • Generate understandable rules • Better for predicting discrete outcomes than continuous (lumpy) • Error-prone when # of training examples for a class is small • Most business cases trying to predict few broad categories