20 Simple CART

•

0 likes•16 views

Vishal Dutt

Data Mining Lecture Series CART VII

Education

Prof. Neeraj Bhargava
Vishal Dutt
Department of Computer Science, School of
Engineering & System Sciences
MDS University, Ajmer

CART Gains Chart
 How do the three trees
compare?
 Use gains chart on test data.
 Outer black line: the best
one could do
 45o line: monkey throwing
darts
 The bigger trees are about
equally good in catching 80%
of the spam.
 We do lose something with
the simpler tree.
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Perc.Total.Pop
Perc.Spam
perfect model
unpruned tree
pruned tree #1
pruned tree #2
Spam Email Detection - Gains Charts

Other Models
 Fit a purely additive MARS model to the data.
 No interactions among basis functions
 Fit a neural network with 3 hidden nodes.
 Fit a logistic regression (GLM).
 Using the 20 strongest variables
 Fit an ordinary multiple regression.
 A statistical sin: the target is binary, not normal

GLM model
Logistic regression run
on 20 of the most
powerful predictive
variables

Comparison of Techniques
 All techniques add value.
 MARS/NNET beats GLM.
 But note: we used all variables
for MARS/NNET; only 20 for
GLM.
 GLM beats CART.
 In real life we’d probably use
the GLM model but refer to
the tree for “rules” and
intuition.
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Perc.Total.Pop
Perc.Spam
perfect model
mars
neural net
pruned tree #1
glm
regression
Spam Email Detection - Gains Charts

$Parting Shot: Hybrid GLM model  We can use the simple decision tree (#3) to motivate the creation of two ‘interaction’ terms:  “Goodnode”: (freq_$ < .0565) & (freq_remove < .065) & (freq_! <.524)  “Badnode”: (freq_$ > .0565) & (freq_hp <.16) & (freq_! > .375)  We read these off tree (#3)  Code them as {0,1} dummy variables  Include in GLM model  At the same time, remove terms no longer significant.$

Hybrid GLM model
•The Goodnode and
Badnode indicators are
highly significant.
•Note that we also
removed 5 variables that
were in the original GLM

Hybrid Model Result
 Slight improvement over the
original GLM.
 See gains chart
 See confusion matrix
 Improvement not huge in this
particular model…
 … but proves the concept
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Perc.Total.Pop
Perc.Spam
perfect model
neural net
decision tree #2
glm
hybrid glm
Spam Email Detection - Gains Charts

Concluding Thoughts
 In many cases, CART will likely under-perform tried-and-
true techniques like GLM.
 Poor at handling linear structure
 Data gets chopped thinner at each split
 BUT: is highly intuitive and a great way to:
 Get a feel for your data
 Select variables
 Search for interactions
 Search for “rules”
 Bin variables

What's hot

Chap08Syam Kumar

Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...Sri Ambati

Scikit Learn: Data Normalization Techniques That WorkDamian R. Mingle, MBA

Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...gregoryg

PREDICTION MODELS BASED ON MAX-STEMS Episode One: One-Word Based ahmet furkan emrehan

WEKA: Credibility Evaluating Whats Been LearnedDataminingTools Inc

Random forest sgv_ai_talk_oct_2_2018digitalzombie

Overfitting and-tblDigvijay Singh

Chapter 18,19heba_ahmad

[M2A3] Data Analysis and Interpretation Specialization Andrea Rubio

Matlab for marketing peopleToshiaki Takeuchi

Bank loan purchase modelingSaleesh Satheeshchandran

3.6 (1)Shaq Excel-Sl

Implement principal component analysis (PCA) in python from scratchEshanAgarwal4

Creating Your First Predictive Model In PythonRobert Dempsey

Deep Learning Class #1 - Go Deep or Go HomeHolberton School

DL Classe 1 - Go Deep or Go HomeGregory Renard

What's hot (17)

Chap08

Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...

Scikit Learn: Data Normalization Techniques That Work

Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...

PREDICTION MODELS BASED ON MAX-STEMS Episode One: One-Word Based

WEKA: Credibility Evaluating Whats Been Learned

Random forest sgv_ai_talk_oct_2_2018

Overfitting and-tbl

Chapter 18,19

[M2A3] Data Analysis and Interpretation Specialization

Matlab for marketing people

Bank loan purchase modeling

3.6 (1)

Implement principal component analysis (PCA) in python from scratch

Creating Your First Predictive Model In Python

Deep Learning Class #1 - Go Deep or Go Home

DL Classe 1 - Go Deep or Go Home

Similar to 20 Simple CART

Sample_Subjective_Questions_Answers (1).pdfAaryanArora10

Musings of kagglerKai Xin Thia

Two methods for optimising cognitive model parametersUniversity of Huddersfield

Tensors Are All You Need: Faster Inference with HummingbirdDatabricks

Heuristic design of experiments w meta gradient searchGreg Makowski

GLM & GBM in H2OSri Ambati

Data Science - Part V - Decision Trees & Random Forests Derek Kane

Explore ml day 2preetikumara

Overfitting & UnderfittingSOUMIT KAR

Regression Analysis and model comparison on the Boston Housing DataShivaram Prakash

Higgs Boson ChallengeRaouf KESKES

Guide for building GLMSAli T. Lotia

Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi

deep CNN vs conventional MLChao Han chaohan@vt.edu

Taking r to its limits. 70+ tipsIlya Shutov

Toward a Unified Approach to Fitting Loss ModelsJacques Rioux

reportArthur He

WEKA:Practical Machine Learning Tools And Techniquesweka Content

ADA Unit — 2 Greedy Strategy and Examples | RGPV De BunkersRGPV De Bunkers

Data Assessment and Analysis for Model Evaluation SaravanakumarSekar4

Similar to 20 Simple CART (20)

Sample_Subjective_Questions_Answers (1).pdf

Musings of kaggler

Two methods for optimising cognitive model parameters

Tensors Are All You Need: Faster Inference with Hummingbird

Heuristic design of experiments w meta gradient search

GLM & GBM in H2O

Data Science - Part V - Decision Trees & Random Forests

Explore ml day 2

Overfitting & Underfitting

Regression Analysis and model comparison on the Boston Housing Data

Higgs Boson Challenge

Guide for building GLMS

Machine learning session6(decision trees random forrest)

deep CNN vs conventional ML

Taking r to its limits. 70+ tips

Toward a Unified Approach to Fitting Loss Models

report

WEKA:Practical Machine Learning Tools And Techniques

ADA Unit — 2 Greedy Strategy and Examples | RGPV De Bunkers

Data Assessment and Analysis for Model Evaluation

Recently uploaded

Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid

Measures of Central Tendency: Mean, Median and ModeThiyagu K

Advance Mobile Application Development class 07Dr. Mazin Mohamed alkathiri

Class 11th Physics NEET formula sheet pdfAyushMahapatra5

SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...KokoStevan

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD

Accessible design: Minimum effort, maximum impactdawncurless

APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82

psychiatric nursing HISTORY COLLECTION .docxPoojaSen20

Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K

Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic

Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K

Gardella_Mateo_IntellectualProperty.pdf.MateoGardella

Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417

The basics of sentences session 2pptx copy.pptxheathfieldcps1

Grant Readiness 101 TechSoup and Remy ConsultingTechSoup

Paris 2024 Olympic Geographies - an activityGeoBlogs

Activity 01 - Artificial Culture (1).pdfciinovamais

fourth grading exam for kindergarten in writingTeacherCyreneCayanan

Recently uploaded (20)

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx

Measures of Central Tendency: Mean, Median and Mode

Advance Mobile Application Development class 07

Class 11th Physics NEET formula sheet pdf

SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...

Accessible design: Minimum effort, maximum impact

APM Welcome, APM North West Network Conference, Synergies Across Sectors

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi

psychiatric nursing HISTORY COLLECTION .docx

Measures of Dispersion and Variability: Range, QD, AD and SD

Key note speaker Neum_Admir Softic_ENG.pdf

Z Score,T Score, Percential Rank and Box Plot Graph

Gardella_Mateo_IntellectualProperty.pdf.

Unit-V; Pricing (Pharma Marketing Management).pptx

The basics of sentences session 2pptx copy.pptx

Grant Readiness 101 TechSoup and Remy Consulting

Paris 2024 Olympic Geographies - an activity

Activity 01 - Artificial Culture (1).pdf

fourth grading exam for kindergarten in writing

20 Simple CART

1. Prof. Neeraj Bhargava Vishal Dutt Department of Computer Science, School of Engineering & System Sciences MDS University, Ajmer

3. CART Gains Chart  How do the three trees compare?  Use gains chart on test data.  Outer black line: the best one could do  45o line: monkey throwing darts  The bigger trees are about equally good in catching 80% of the spam.  We do lose something with the simpler tree. 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Perc.Total.Pop Perc.Spam perfect model unpruned tree pruned tree #1 pruned tree #2 Spam Email Detection - Gains Charts

4. Other Models  Fit a purely additive MARS model to the data.  No interactions among basis functions  Fit a neural network with 3 hidden nodes.  Fit a logistic regression (GLM).  Using the 20 strongest variables  Fit an ordinary multiple regression.  A statistical sin: the target is binary, not normal

5. GLM model Logistic regression run on 20 of the most powerful predictive variables

6. Neural Net Weights

7. Comparison of Techniques  All techniques add value.  MARS/NNET beats GLM.  But note: we used all variables for MARS/NNET; only 20 for GLM.  GLM beats CART.  In real life we’d probably use the GLM model but refer to the tree for “rules” and intuition. 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Perc.Total.Pop Perc.Spam perfect model mars neural net pruned tree #1 glm regression Spam Email Detection - Gains Charts

8. Parting Shot: Hybrid GLM model  We can use the simple decision tree (#3) to motivate the creation of two ‘interaction’ terms:  “Goodnode”: (freq_$ < .0565) & (freq_remove < .065) & (freq_! <.524)  “Badnode”: (freq_$ > .0565) & (freq_hp <.16) & (freq_! > .375)  We read these off tree (#3)  Code them as {0,1} dummy variables  Include in GLM model  At the same time, remove terms no longer significant.

9. Hybrid GLM model •The Goodnode and Badnode indicators are highly significant. •Note that we also removed 5 variables that were in the original GLM

10. Hybrid Model Result  Slight improvement over the original GLM.  See gains chart  See confusion matrix  Improvement not huge in this particular model…  … but proves the concept 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Perc.Total.Pop Perc.Spam perfect model neural net decision tree #2 glm hybrid glm Spam Email Detection - Gains Charts

11. Concluding Thoughts  In many cases, CART will likely under-perform tried-and- true techniques like GLM.  Poor at handling linear structure  Data gets chopped thinner at each split  BUT: is highly intuitive and a great way to:  Get a feel for your data  Select variables  Search for interactions  Search for “rules”  Bin variables

20 Simple CART

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to 20 Simple CART

Similar to 20 Simple CART (20)

More from Vishal Dutt

More from Vishal Dutt (20)

Recently uploaded

Recently uploaded (20)

20 Simple CART

Editor's Notes