SlideShare a Scribd company logo
1 of 57
Decision Trees in the Big Picture
• Classification (vs. Rule Pattern Discovery)
• Supervised Learning (vs. Unsupervised)
• Inductive
• Generation (vs. Discrimination)
Example
age income veteran
college_
educated
support_
hillary
youth low no no no
youth low yes no no
middle_aged low no no yes
senior low no no yes
senior medium no yes no
senior medium yes no yes
middle_aged medium no yes no
youth low no yes no
youth low no yes no
senior high no yes yes
youth low no no no
middle_aged high no yes no
middle_aged medium yes yes yes
senior high no yes no
Example
age income veteran
college_
educated
support_
hillary
youth low no no no
youth low yes no no
middle_aged low no no yes
senior low no no yes
senior medium no yes no
senior medium yes no yes
middle_aged medium no yes no
youth low no yes no
youth low no yes no
senior high no yes yes
youth low no no no
middle_aged high no yes no
middle_aged medium yes yes yes
senior high no yes no
Class-
labels
Example
age income veteran
college_
educated
support_
hillary
middle_aged medium no no ?????
no
age
youth middle_aged
college_
educated
income yes
yes
low medium high
no
senior
no yes
noyes
Inner nodes are ATTRIBUTES
Branches are attribute VALUES
Leaves are class-label VALUES
Example
age income veteran
college_
educated
support_
hillary
middle_aged medium no no yes (predicted)
no
age
youth middle_aged
college_
educated
income yes
yes
low medium high
no
senior
no yes
noyes
Inner nodes are ATTRIBUTES
Branches are attribute VALUES
Leaves are class-label VALUES
ANSWER
Example
no
age
youth middle_aged
college_
educated
income yes
yes
low medium high
no
senior
no yes
noyes
Induced Rules:
The youth do not
support Hillary.
All who are middle-
aged and low-income
support Hillary.
Seniors support
Hillary.
Etc…A rule is
generated for each
leaf.
Example
Induced Rules:
The youth do not support
Hillary.
All who are middle-aged
and low-income support
Hillary.
Seniors support Hillary.
Nested IF-THEN:
IF age == youth
THEN support_hillary = no
ELSE IF age == middle_aged
& income == low
THEN support_hillary = yes
ELSE IF age = senior
THEN support_hillary = yes
How do you construct one?
1. Select an attribute to place at the root node
and make one branch for each possible value.
14 tuples; Entire Training Set
5 tuples 4 tuples 5 tuples
age
youth middle_aged senior
How do you construct one?
2. For each branch, recursively process the
remaining training examples by choosing an
attribute to split them on. The chosen
attribute cannot be one used in the ancestor
nodes. If at anytime all the training examples
have the same class, stop processing that
part of the tree.
How do you construct one?
age=youth Income veteran
college_
educated
support_
hillary
youth low no no no
youth low yes no no
youth low no yes no
youth low no yes no
youth low no no no
no
age
youth
middle_aged senior
How do you construct one?
age=
middle_aged income veteran
college_
educated
supports_
hillary
middle_aged low no no yes
middle_aged medium no yes no
middle_aged high no yes no
middle_aged medium yes yes yes
no veteran
age
youth
middle_aged
senior
yes no
no veteran
age
youth
middle_aged
senior
yes
yes no
age=
middle_aged income veteran
college_
educated
supports_
hillary
middle_aged low no no yes
middle_aged medium no yes no
middle_aged high no yes no
middle_aged medium yes yes yes
no veteran
age
youth
middle_aged
senior
yes
yes no
age=
middle_aged income veteran
college_
educated
supports_
hillary
middle_aged low no no yes
middle_aged medium no yes no
middle_aged high no yes no
middle_aged medium yes yes yes
college_
educated
yes no
age=
middle_aged income veteran=no
college_
educated
supports_
hillary
middle_aged low no no yes
middle_aged medium no yes no
middle_aged high no yes no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes no
senior
no
age=
middle_aged income veteran=no
college_
educated
supports_
hillary
middle_aged low no no yes
middle_aged medium no yes no
middle_aged high no yes no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
no veteran
ageyouth
middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
age=senior income veteran
college_
educated
supports_
hillary
senior low no no yes
senior medium no yes no
senior medium yes no yes
senior high no yes yes
senior high no yes no
no veteran
ageyouth
middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
age=senior income veteran
college_
educated
supports_
hillary
senior low no no yes
senior medium no yes no
senior medium yes no yes
senior high no yes yes
senior high no yes no
college_
educated
yes no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
college_
educated
yes no
age=senior income veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
senior high no yes yes
senior high no yes no
income
low
medium
high
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes no
senior
no yes
college_
educated
yes no
age=senior income veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
senior high no yes yes
senior high no yes no
income
low
medium
high
No low-income college-
educated seniors…
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
age=senior income veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
senior high no yes yes
senior high no yes no
income
low
medium
high
No low-income college-
educated seniors…
no
no
“Majority Vote”
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
age=senior
income=
medium veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
income
low
medium
high
no
no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
age=senior
income=
medium veteran
college_
educated=yes
supports_
hillary
senior medium no yes no
income
low
medium
high
no
no
no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low
medium
high
no
no
no
age=senior income=high veteran
college_
educated=yes
supports_
hillary
senior high no yes yes
senior high no yes no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low medium high
no
no
no
age=senior income=high veteran
college_
educated=yes
supports_
hillary
senior high no yes yes
senior high no yes no
veteran
yes no
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low medium high
no
no
no
age=senior income=high veteran
college_
educated=yes
supports_
hillary
senior high no yes yes
senior high no yes no
veteran
yes no
“Majority Vote” split…
No Veterans
??? ???
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low medium high
no
no
no veteran
yes no
??? ???
age=senior income veteran
college_
educated=no
supports_
hillary
senior low no no yes
senior medium yes no yes
no veteran
age
youth middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low medium high
no
no
no veteran
yes no
??? ???
age=senior income veteran
college_
educated=no
supports_
hillary
senior low no no yes
senior medium yes no yes
yes
no veteran
age
youth
middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low
medium
high
no
no
no veteran
yes no
??? ???
yes
Cost to grow?
n = number of Attributes
D = Training Set of tuples
O( n * |D| * log|D| )
Cost to grow?
n = number of Attributes
D = Training Set of tuples
O( n * |D| * log|D| )
Amount of
work at each
tree level
Max height
of the tree
How do we minimize the cost?
• Optimal decision trees are NP-complete
(shown by Hyafil and Rivest)
How do we minimize the cost?
• Optimal decision trees are NP-complete
(shown by Hyafil and Rivest)
• Need Heuristic to pick “best” attribute to split
on.
no veteran
age
youth
middle_aged
yes
yes no
college_
educated
yes
senior
no yes
college_
educated
yes no
income
low
medium
high
no
no
no veteran
yes no
??? ???
yes
How do we minimize the cost?
• Optimal decision trees are NP-complete
(shown by Hyafil and Rivest)
• Most common approach is “greedy”
• Need Heuristic to pick “best” attribute to split
on.
• “Best” attribute results in “purest” split
Pure = all tuples belong to the same class
….A good split increase purity of all children nodes
Three Heuristics
1. Information gain
2. Gain Ratio
3. Gini Index
Information Gain
• Ross Quinlan’s ID3 (iterative dichotomizer 3rd)
uses info gain as its heuristic.
• Heuristic based on Claude Shannon’s
information theory.
HIGH
ENTROPY
LOW
ENTROPY
Calculate Entropy for D
D = Training Set D=14
m = num. of classes m=2
i = 1,…,m
Ci = distinct class C1 = yes, C2 = no
Ci,D = tuples in D of class Ci C1,D = yes, C2,D = no
pi = prob. a random tuple in p1 = 5/14, p2 = 9/14
D belongs to class Ci
=|Ci,D|/|D|
= -[ 5/14 * log(5/14) + 9/14 * log(9/14)]
= -[ .3571 * -1.4854 + .6428 * -.6374]
= -[ -.5304 + -.4097] = .9400 bits
Extremes:
= -[ 7/14 * log(7/14) + 7/14 * log(7/14)] = 1 bit
= -[ 1/14 * log(1/14) + 13/14 * log(13/14)] = .3712 bits
= -[ 0/14 * log(0/14) + 14/14 * log(14/14)] = 0 bits
Entropy for D split by A
A = attribute to split D on E.g. age
v = distinct values of A E.g. youth,
middle_aged, senior
j = 1,…,v
Dj = subset of D where A=j E.g. All tuples
where age=youth
Entropyage (D)= 5/14 * -[0/5*log(0/5) + 5/5*log(5/5)]
+ 4/14 * -[2/4*log(2/4) + 2/4*log(2/4)]
+ 5/14 * -[3/5*log(3/5) + 2/5*log(2/5)]
= .6324 bits
Entropyincome (D)= 7/14 * -[2/7*log(2/7) + 5/7*log(5/7)]
+ 4/14 * -[2/4*log(2/4) + 2/4*log(2/4)]
+ 3/14 * -[1/3*log(1/3) + 2/3*log(2/3)]
= .9140 bits
Entropyveteran (D)= 3/14 * -[2/3*log(2/3) + 1/3*log(1/3)]
+ 11/14 * -[3/11*log(3/11) + 8/11*log(8/11)]
= .8609 bits
Entropycollege_educated (D)= 8/14 * -[6/8*log(6/8) + 2/8*log(2/8)]
+ 6/14 * -[3/6*log(3/6) + 3/6*log(3/6)]
= .8921 bits
Information Gain
Gain(A) = Entropy(D) - EntropyA (D)
Set of tuples D Subset of D split on
attribute A
Choose the A with the highest Gain.
decreases Entropy
Gain(A) = Entropy(D) - EntropyA (D)
Gain(age) = Entropy(D) - Entropyage (D)
= .9400 - .6324 = .3076 bits
Gain(income) = .0259 bits
Gain(veteran) = .0790 bits
Gain(college_educated) = .0479 bits
Entropy with values >2
Entropy = -[7/13*log(7/13) + 2/13*log(2/13) + 2/13*log(2/13) + 2/13*log(2/13)]
= 1.7272 bits
Entropy = -[5/13*log(5/13) + 1/13*log(1/13) + 6/13*log(6/13) + 1/13*log(1/13)]
= 1.6143 bits
ss age income veteran
college_
educated
support_
hillary
215-98-9343 youth low no no no
238-34-3493 youth low yes no no
234-28-2434 middle_aged low no no yes
243-24-2343 senior low no no yes
634-35-2345 senior medium no yes no
553-32-2323 senior medium yes no yes
554-23-4324 middle_aged medium no yes no
523-43-2343 youth low no yes no
553-23-1223 youth low no yes no
344-23-2321 senior high no yes yes
212-23-1232 youth low no no no
112-12-4521 middle_aged high no yes no
423-13-3425 middle_aged medium yes yes yes
423-53-4817 senior high no yes no
Added social security number attribute
ss no
yes
yesnononoyes
no
yes
yes
no
no
no
no
215-98-9343……..423-53-4817
Will Information Gain split on ss?
ss no
yes
yesnononoyes
no
yes
yes
no
no
no
no
215-98-9343……..423-53-4817
Will Information Gain split on ss?
Yes, because Entropyss (D) = 0.
*Entropyss (D) = 1/14 * -14[1/1*log(1/1) + 0/1*log(0/1)]
Gain ratio
• C4.5, a successor of ID3, uses this heuristic.
• Attempts to overcome Information Gain’s bias
in favor of attributes with large number of
values.
Gain ratio
Gain ratio
Gain(ss) = .9400
SplitInfoss (D) = 3.9068
GainRatio(ss) = .2406
Gain(age) = .3076
SplitInfoage (D) = 1.5849
GainRatio(age) = .1940
Gini Index
• CART uses this heuristic.
• Binary splits.
• Not biased toward multi-value attributes like
Info Gain.
age
youth
middle_aged
senior
age
senioryouth,
middle_aged
Gini Index
For the attribute age the possible subsets are:
{youth, middle_aged, senior},
{youth, middle_aged}, {youth, senior},
{middle_aged, senior}, {youth},
{middle_aged}, {senior} and {}.
We exclude the powerset and the empty set.
So we have to examine 2v – 2 subsets.
Gini Index
For the attribute age the possible subsets are:
{youth, middle_aged, senior},
{youth, middle_aged}, {youth, senior},
{middle_aged, senior}, {youth},
{middle_aged}, {senior} and {}.
We exclude the powerset and the empty set.
So we have to examine 2v – 2 subsets.
CALCULATE GINI INDEX
ON EACH SUBSET
Gini Index
Miscellaneous thoughts
• Widely applicable to data exploration,
classification and scoring tasks
• Generate understandable rules
• Better for predicting discrete outcomes than
continuous (lumpy)
• Error-prone when # of training examples for a
class is small
• Most business cases trying to predict few
broad categories
Big picture of data mining

More Related Content

Viewers also liked

Polymorphism
PolymorphismPolymorphism
PolymorphismJames Wong
 
Data and assessment
Data and assessmentData and assessment
Data and assessmentJames Wong
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cacheJames Wong
 
Gm theory
Gm theoryGm theory
Gm theoryJames Wong
 
Api crash
Api crashApi crash
Api crashJames Wong
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data miningJames Wong
 
Reflection
ReflectionReflection
ReflectionJames Wong
 
How to build a rest api.pptx
How to build a rest api.pptxHow to build a rest api.pptx
How to build a rest api.pptxJames Wong
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with pythonJames Wong
 
Database constraints
Database constraintsDatabase constraints
Database constraintsJames Wong
 
Poo java
Poo javaPoo java
Poo javaJames Wong
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and pythonJames Wong
 

Viewers also liked (13)

Polymorphism
PolymorphismPolymorphism
Polymorphism
 
Data and assessment
Data and assessmentData and assessment
Data and assessment
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
 
Html5
Html5Html5
Html5
 
Gm theory
Gm theoryGm theory
Gm theory
 
Api crash
Api crashApi crash
Api crash
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Reflection
ReflectionReflection
Reflection
 
How to build a rest api.pptx
How to build a rest api.pptxHow to build a rest api.pptx
How to build a rest api.pptx
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with python
 
Database constraints
Database constraintsDatabase constraints
Database constraints
 
Poo java
Poo javaPoo java
Poo java
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 

Similar to Big picture of data mining

Growth mindset dhs conf
Growth mindset dhs confGrowth mindset dhs conf
Growth mindset dhs confshaunallison
 
Quantitative Aptitude And Mathematics 14 October Ii
Quantitative Aptitude And Mathematics 14 October IiQuantitative Aptitude And Mathematics 14 October Ii
Quantitative Aptitude And Mathematics 14 October IiDr. Trilok Kumar Jain
 
Connect with Maths: Advocating for the mathematically highly capable
Connect with Maths: Advocating for the mathematically highly capableConnect with Maths: Advocating for the mathematically highly capable
Connect with Maths: Advocating for the mathematically highly capableRenee Hoareau
 
Addie Builds a House
Addie Builds a HouseAddie Builds a House
Addie Builds a HouseJim Piechocki
 
Unify engauge practise
Unify engauge practiseUnify engauge practise
Unify engauge practiseplus plus
 
Discipline By Ms Chandla
Discipline By Ms ChandlaDiscipline By Ms Chandla
Discipline By Ms Chandlakulachihansraj
 
Sandbox Learning: How To Transition Your Children Into High School
Sandbox Learning: How To Transition Your Children Into High SchoolSandbox Learning: How To Transition Your Children Into High School
Sandbox Learning: How To Transition Your Children Into High SchoolElizabethNugent8
 
PT3 English Model 1 (Q)
PT3 English Model 1 (Q)PT3 English Model 1 (Q)
PT3 English Model 1 (Q)Miz Malinz
 
super exercise module english pt3 question with answer
super exercise module english pt3 question with answersuper exercise module english pt3 question with answer
super exercise module english pt3 question with answerSurryaraj Poobalan
 
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...Aida Cunha
 
S3 profile sheet
S3 profile sheetS3 profile sheet
S3 profile sheetNicholas Green
 
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docx
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docxO Z Scores 68 O The Normal Curve 73 O Sample and Popul.docx
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docxhopeaustin33688
 
Valley Youth Center Presentation
Valley Youth Center PresentationValley Youth Center Presentation
Valley Youth Center PresentationChuck Horton
 
Interview question and answer
Interview question and answerInterview question and answer
Interview question and answerRakibul Islam
 

Similar to Big picture of data mining (19)

Growth mindset dhs conf
Growth mindset dhs confGrowth mindset dhs conf
Growth mindset dhs conf
 
Quantitative Aptitude And Mathematics 14 October Ii
Quantitative Aptitude And Mathematics 14 October IiQuantitative Aptitude And Mathematics 14 October Ii
Quantitative Aptitude And Mathematics 14 October Ii
 
Mindset
MindsetMindset
Mindset
 
Connect with Maths: Advocating for the mathematically highly capable
Connect with Maths: Advocating for the mathematically highly capableConnect with Maths: Advocating for the mathematically highly capable
Connect with Maths: Advocating for the mathematically highly capable
 
Addie Builds a House
Addie Builds a HouseAddie Builds a House
Addie Builds a House
 
Unify engauge practise
Unify engauge practiseUnify engauge practise
Unify engauge practise
 
Discipline By Ms Chandla
Discipline By Ms ChandlaDiscipline By Ms Chandla
Discipline By Ms Chandla
 
Sample kesselus
Sample kesselusSample kesselus
Sample kesselus
 
Manager vs. leader
Manager vs. leaderManager vs. leader
Manager vs. leader
 
Sandbox Learning: How To Transition Your Children Into High School
Sandbox Learning: How To Transition Your Children Into High SchoolSandbox Learning: How To Transition Your Children Into High School
Sandbox Learning: How To Transition Your Children Into High School
 
PT3 English Model 1 (Q)
PT3 English Model 1 (Q)PT3 English Model 1 (Q)
PT3 English Model 1 (Q)
 
super exercise module english pt3 question with answer
super exercise module english pt3 question with answersuper exercise module english pt3 question with answer
super exercise module english pt3 question with answer
 
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...
ae-i-t10-test2-recurso-do-professor-testes-20202021-submetido-por-email-para-...
 
Talent Essay
Talent EssayTalent Essay
Talent Essay
 
Udgam Matters March - April 2017
Udgam Matters March - April 2017Udgam Matters March - April 2017
Udgam Matters March - April 2017
 
S3 profile sheet
S3 profile sheetS3 profile sheet
S3 profile sheet
 
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docx
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docxO Z Scores 68 O The Normal Curve 73 O Sample and Popul.docx
O Z Scores 68 O The Normal Curve 73 O Sample and Popul.docx
 
Valley Youth Center Presentation
Valley Youth Center PresentationValley Youth Center Presentation
Valley Youth Center Presentation
 
Interview question and answer
Interview question and answerInterview question and answer
Interview question and answer
 

More from James Wong

Data race
Data raceData race
Data raceJames Wong
 
Multi threaded rtos
Multi threaded rtosMulti threaded rtos
Multi threaded rtosJames Wong
 
Recursion
RecursionRecursion
RecursionJames Wong
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryJames Wong
 
Cache recap
Cache recapCache recap
Cache recapJames Wong
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching worksJames Wong
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherenceJames Wong
 
Abstract data types
Abstract data typesAbstract data types
Abstract data typesJames Wong
 
Abstraction file
Abstraction fileAbstraction file
Abstraction fileJames Wong
 
Object model
Object modelObject model
Object modelJames Wong
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysisJames Wong
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with javaJames Wong
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithmsJames Wong
 
Inheritance
InheritanceInheritance
InheritanceJames Wong
 
Learning python
Learning pythonLearning python
Learning pythonJames Wong
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in pythonJames Wong
 
Python language data types
Python language data typesPython language data types
Python language data typesJames Wong
 
Smm and caching
Smm and cachingSmm and caching
Smm and cachingJames Wong
 
Python basics
Python basicsPython basics
Python basicsJames Wong
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendJames Wong
 

More from James Wong (20)

Data race
Data raceData race
Data race
 
Multi threaded rtos
Multi threaded rtosMulti threaded rtos
Multi threaded rtos
 
Recursion
RecursionRecursion
Recursion
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Cache recap
Cache recapCache recap
Cache recap
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Abstraction file
Abstraction fileAbstraction file
Abstraction file
 
Object model
Object modelObject model
Object model
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Inheritance
InheritanceInheritance
Inheritance
 
Learning python
Learning pythonLearning python
Learning python
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in python
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Smm and caching
Smm and cachingSmm and caching
Smm and caching
 
Python basics
Python basicsPython basics
Python basics
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 

Recently uploaded

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 

Recently uploaded (20)

Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Big picture of data mining

  • 1. Decision Trees in the Big Picture • Classification (vs. Rule Pattern Discovery) • Supervised Learning (vs. Unsupervised) • Inductive • Generation (vs. Discrimination)
  • 2. Example age income veteran college_ educated support_ hillary youth low no no no youth low yes no no middle_aged low no no yes senior low no no yes senior medium no yes no senior medium yes no yes middle_aged medium no yes no youth low no yes no youth low no yes no senior high no yes yes youth low no no no middle_aged high no yes no middle_aged medium yes yes yes senior high no yes no
  • 3. Example age income veteran college_ educated support_ hillary youth low no no no youth low yes no no middle_aged low no no yes senior low no no yes senior medium no yes no senior medium yes no yes middle_aged medium no yes no youth low no yes no youth low no yes no senior high no yes yes youth low no no no middle_aged high no yes no middle_aged medium yes yes yes senior high no yes no Class- labels
  • 4. Example age income veteran college_ educated support_ hillary middle_aged medium no no ????? no age youth middle_aged college_ educated income yes yes low medium high no senior no yes noyes Inner nodes are ATTRIBUTES Branches are attribute VALUES Leaves are class-label VALUES
  • 5. Example age income veteran college_ educated support_ hillary middle_aged medium no no yes (predicted) no age youth middle_aged college_ educated income yes yes low medium high no senior no yes noyes Inner nodes are ATTRIBUTES Branches are attribute VALUES Leaves are class-label VALUES ANSWER
  • 6. Example no age youth middle_aged college_ educated income yes yes low medium high no senior no yes noyes Induced Rules: The youth do not support Hillary. All who are middle- aged and low-income support Hillary. Seniors support Hillary. Etc…A rule is generated for each leaf.
  • 7. Example Induced Rules: The youth do not support Hillary. All who are middle-aged and low-income support Hillary. Seniors support Hillary. Nested IF-THEN: IF age == youth THEN support_hillary = no ELSE IF age == middle_aged & income == low THEN support_hillary = yes ELSE IF age = senior THEN support_hillary = yes
  • 8. How do you construct one? 1. Select an attribute to place at the root node and make one branch for each possible value. 14 tuples; Entire Training Set 5 tuples 4 tuples 5 tuples age youth middle_aged senior
  • 9. How do you construct one? 2. For each branch, recursively process the remaining training examples by choosing an attribute to split them on. The chosen attribute cannot be one used in the ancestor nodes. If at anytime all the training examples have the same class, stop processing that part of the tree.
  • 10. How do you construct one? age=youth Income veteran college_ educated support_ hillary youth low no no no youth low yes no no youth low no yes no youth low no yes no youth low no no no no age youth middle_aged senior
  • 11. How do you construct one? age= middle_aged income veteran college_ educated supports_ hillary middle_aged low no no yes middle_aged medium no yes no middle_aged high no yes no middle_aged medium yes yes yes no veteran age youth middle_aged senior yes no
  • 12. no veteran age youth middle_aged senior yes yes no age= middle_aged income veteran college_ educated supports_ hillary middle_aged low no no yes middle_aged medium no yes no middle_aged high no yes no middle_aged medium yes yes yes
  • 13. no veteran age youth middle_aged senior yes yes no age= middle_aged income veteran college_ educated supports_ hillary middle_aged low no no yes middle_aged medium no yes no middle_aged high no yes no middle_aged medium yes yes yes college_ educated yes no
  • 14. age= middle_aged income veteran=no college_ educated supports_ hillary middle_aged low no no yes middle_aged medium no yes no middle_aged high no yes no no veteran age youth middle_aged yes yes no college_ educated yes no senior no
  • 15. age= middle_aged income veteran=no college_ educated supports_ hillary middle_aged low no no yes middle_aged medium no yes no middle_aged high no yes no no veteran age youth middle_aged yes yes no college_ educated yes no senior no yes
  • 16. no veteran ageyouth middle_aged yes yes no college_ educated yes no senior no yes age=senior income veteran college_ educated supports_ hillary senior low no no yes senior medium no yes no senior medium yes no yes senior high no yes yes senior high no yes no
  • 17. no veteran ageyouth middle_aged yes yes no college_ educated yes no senior no yes age=senior income veteran college_ educated supports_ hillary senior low no no yes senior medium no yes no senior medium yes no yes senior high no yes yes senior high no yes no college_ educated yes no
  • 18. no veteran age youth middle_aged yes yes no college_ educated yes no senior no yes college_ educated yes no age=senior income veteran college_ educated=yes supports_ hillary senior medium no yes no senior high no yes yes senior high no yes no income low medium high
  • 19. no veteran age youth middle_aged yes yes no college_ educated yes no senior no yes college_ educated yes no age=senior income veteran college_ educated=yes supports_ hillary senior medium no yes no senior high no yes yes senior high no yes no income low medium high No low-income college- educated seniors…
  • 20. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no age=senior income veteran college_ educated=yes supports_ hillary senior medium no yes no senior high no yes yes senior high no yes no income low medium high No low-income college- educated seniors… no no “Majority Vote”
  • 21. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no age=senior income= medium veteran college_ educated=yes supports_ hillary senior medium no yes no income low medium high no no
  • 22. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no age=senior income= medium veteran college_ educated=yes supports_ hillary senior medium no yes no income low medium high no no no
  • 23. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no age=senior income=high veteran college_ educated=yes supports_ hillary senior high no yes yes senior high no yes no
  • 24. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no age=senior income=high veteran college_ educated=yes supports_ hillary senior high no yes yes senior high no yes no veteran yes no
  • 25. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no age=senior income=high veteran college_ educated=yes supports_ hillary senior high no yes yes senior high no yes no veteran yes no “Majority Vote” split… No Veterans ??? ???
  • 26. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no veteran yes no ??? ??? age=senior income veteran college_ educated=no supports_ hillary senior low no no yes senior medium yes no yes
  • 27. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no veteran yes no ??? ??? age=senior income veteran college_ educated=no supports_ hillary senior low no no yes senior medium yes no yes yes
  • 28. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no veteran yes no ??? ??? yes
  • 29. Cost to grow? n = number of Attributes D = Training Set of tuples O( n * |D| * log|D| )
  • 30. Cost to grow? n = number of Attributes D = Training Set of tuples O( n * |D| * log|D| ) Amount of work at each tree level Max height of the tree
  • 31. How do we minimize the cost? • Optimal decision trees are NP-complete (shown by Hyafil and Rivest)
  • 32. How do we minimize the cost? • Optimal decision trees are NP-complete (shown by Hyafil and Rivest) • Need Heuristic to pick “best” attribute to split on.
  • 33. no veteran age youth middle_aged yes yes no college_ educated yes senior no yes college_ educated yes no income low medium high no no no veteran yes no ??? ??? yes
  • 34. How do we minimize the cost? • Optimal decision trees are NP-complete (shown by Hyafil and Rivest) • Most common approach is “greedy” • Need Heuristic to pick “best” attribute to split on. • “Best” attribute results in “purest” split Pure = all tuples belong to the same class
  • 35. ….A good split increase purity of all children nodes
  • 36. Three Heuristics 1. Information gain 2. Gain Ratio 3. Gini Index
  • 37. Information Gain • Ross Quinlan’s ID3 (iterative dichotomizer 3rd) uses info gain as its heuristic. • Heuristic based on Claude Shannon’s information theory.
  • 39. Calculate Entropy for D D = Training Set D=14 m = num. of classes m=2 i = 1,…,m Ci = distinct class C1 = yes, C2 = no Ci,D = tuples in D of class Ci C1,D = yes, C2,D = no pi = prob. a random tuple in p1 = 5/14, p2 = 9/14 D belongs to class Ci =|Ci,D|/|D|
  • 40. = -[ 5/14 * log(5/14) + 9/14 * log(9/14)] = -[ .3571 * -1.4854 + .6428 * -.6374] = -[ -.5304 + -.4097] = .9400 bits Extremes: = -[ 7/14 * log(7/14) + 7/14 * log(7/14)] = 1 bit = -[ 1/14 * log(1/14) + 13/14 * log(13/14)] = .3712 bits = -[ 0/14 * log(0/14) + 14/14 * log(14/14)] = 0 bits
  • 41. Entropy for D split by A A = attribute to split D on E.g. age v = distinct values of A E.g. youth, middle_aged, senior j = 1,…,v Dj = subset of D where A=j E.g. All tuples where age=youth
  • 42. Entropyage (D)= 5/14 * -[0/5*log(0/5) + 5/5*log(5/5)] + 4/14 * -[2/4*log(2/4) + 2/4*log(2/4)] + 5/14 * -[3/5*log(3/5) + 2/5*log(2/5)] = .6324 bits Entropyincome (D)= 7/14 * -[2/7*log(2/7) + 5/7*log(5/7)] + 4/14 * -[2/4*log(2/4) + 2/4*log(2/4)] + 3/14 * -[1/3*log(1/3) + 2/3*log(2/3)] = .9140 bits Entropyveteran (D)= 3/14 * -[2/3*log(2/3) + 1/3*log(1/3)] + 11/14 * -[3/11*log(3/11) + 8/11*log(8/11)] = .8609 bits Entropycollege_educated (D)= 8/14 * -[6/8*log(6/8) + 2/8*log(2/8)] + 6/14 * -[3/6*log(3/6) + 3/6*log(3/6)] = .8921 bits
  • 43. Information Gain Gain(A) = Entropy(D) - EntropyA (D) Set of tuples D Subset of D split on attribute A Choose the A with the highest Gain. decreases Entropy
  • 44. Gain(A) = Entropy(D) - EntropyA (D) Gain(age) = Entropy(D) - Entropyage (D) = .9400 - .6324 = .3076 bits Gain(income) = .0259 bits Gain(veteran) = .0790 bits Gain(college_educated) = .0479 bits
  • 45. Entropy with values >2 Entropy = -[7/13*log(7/13) + 2/13*log(2/13) + 2/13*log(2/13) + 2/13*log(2/13)] = 1.7272 bits Entropy = -[5/13*log(5/13) + 1/13*log(1/13) + 6/13*log(6/13) + 1/13*log(1/13)] = 1.6143 bits
  • 46. ss age income veteran college_ educated support_ hillary 215-98-9343 youth low no no no 238-34-3493 youth low yes no no 234-28-2434 middle_aged low no no yes 243-24-2343 senior low no no yes 634-35-2345 senior medium no yes no 553-32-2323 senior medium yes no yes 554-23-4324 middle_aged medium no yes no 523-43-2343 youth low no yes no 553-23-1223 youth low no yes no 344-23-2321 senior high no yes yes 212-23-1232 youth low no no no 112-12-4521 middle_aged high no yes no 423-13-3425 middle_aged medium yes yes yes 423-53-4817 senior high no yes no Added social security number attribute
  • 48. ss no yes yesnononoyes no yes yes no no no no 215-98-9343……..423-53-4817 Will Information Gain split on ss? Yes, because Entropyss (D) = 0. *Entropyss (D) = 1/14 * -14[1/1*log(1/1) + 0/1*log(0/1)]
  • 49. Gain ratio • C4.5, a successor of ID3, uses this heuristic. • Attempts to overcome Information Gain’s bias in favor of attributes with large number of values.
  • 51. Gain ratio Gain(ss) = .9400 SplitInfoss (D) = 3.9068 GainRatio(ss) = .2406 Gain(age) = .3076 SplitInfoage (D) = 1.5849 GainRatio(age) = .1940
  • 52. Gini Index • CART uses this heuristic. • Binary splits. • Not biased toward multi-value attributes like Info Gain. age youth middle_aged senior age senioryouth, middle_aged
  • 53. Gini Index For the attribute age the possible subsets are: {youth, middle_aged, senior}, {youth, middle_aged}, {youth, senior}, {middle_aged, senior}, {youth}, {middle_aged}, {senior} and {}. We exclude the powerset and the empty set. So we have to examine 2v – 2 subsets.
  • 54. Gini Index For the attribute age the possible subsets are: {youth, middle_aged, senior}, {youth, middle_aged}, {youth, senior}, {middle_aged, senior}, {youth}, {middle_aged}, {senior} and {}. We exclude the powerset and the empty set. So we have to examine 2v – 2 subsets. CALCULATE GINI INDEX ON EACH SUBSET
  • 56. Miscellaneous thoughts • Widely applicable to data exploration, classification and scoring tasks • Generate understandable rules • Better for predicting discrete outcomes than continuous (lumpy) • Error-prone when # of training examples for a class is small • Most business cases trying to predict few broad categories