SlideShare a Scribd company logo
1 of 10
Prof. Neeraj Bhargava
Vishal Dutt
Department of Computer Science, School of
Engineering & System Sciences
MDS University, Ajmer
The Data
 Goal: build a model to predict whether an incoming
email is spam.
 Analogous to insurance fraud detection
 About 21,000 data points, each representing an email
message sent to an HP scientist.
 Binary target variable
 1 = the message was spam: 8%
 0 = the message was not spam 92%
 Predictive variables created based on frequencies of
various words & characters.
The Predictive Variables
 57 variables created
 Frequency of “George” (the scientist’s first name)
 Frequency of “!”, “$”, etc.
 Frequency of long strings of capital letters
 Frequency of “receive”, “free”, “credit”….
 Etc
 Variables creation required insight that (as yet) can’t
be automated.
 Analogous to the insurance variables an insightful actuary or
underwriter can create.
Sample Data Points
Methodology
 Divide data 60%-40% into train-test.
 Use multiple techniques to fit models on train data.
 Apply the models to the test data.
 Compare their power using gains charts.
Un-pruned Tree
 Just let CART keep splitting
as long as it can.
 Too big.
 Messy
 More importantly: this
tree over-fits the data
 Use Cross-Validation (on the
train data) to prune back.
 Select the optimal sub-
tree.
|
Pruning Back
 Plot cross-validated error rate vs. size of tree
 Note: error can actually increase if the tree is too big (over-
fit)
 Looks like the ≈ optimal tree has 52 nodes
 So prune the tree back to 52 nodes
X-valRelativeError
0.20.40.60.81.0
Inf 0.09 0.043 0.018 0.011 0.0096 0.0061 0.0049 0.0033 0.002 0.0011 2.4e-05
1 3 5 7 8 10 11 13 15 17 19 20 22 24 25 30 37 52 56 66 71 83
size of tree
Pruned Tree #1
 The pruned tree is still pretty
big.
 Can we get away with pruning
the tree back even further?
 Let’s be radical and prune way
back to a tree we actually
wouldn’t mind looking at.
|
cp
X-valRelativeError
0.20.40.60.81.0
Inf 0.09 0.043 0.018 0.011 0.0096 0.0061 0.0049 0.0033 0.002 0.0011 2.4e-05
1 3 5 7 8 10 11 13 15 17 19 20 22 24 25 30 37 52 56 66 71 83
size of tree
Pruned Tree #2
|
freq_DOLLARSIGN< 0.0555
freq_remove< 0.065
freq_EXCL< 0.5235
tot.CAPS< 83.5
freq_free< 0.77
freq_george>=0.14
freq_hp>=0.16
freq_EXCL< 0.3765
avg.CAPS< 2.92
freq_remove< 0.025
0
1.061e+04/170
0
415/29
1
0/13
1
20/75
0
59/0
1
70/178
0
285/12
0
208/54
1
12/51
1
46/193
1
4/290
Suggests rule:
Many “$” signs, caps, and “!” and few instances of
company name (“HP”)  spam!

More Related Content

What's hot

Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use casesSridhar Ratakonda
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
 
Disease prediction using machine learning
Disease prediction using machine learningDisease prediction using machine learning
Disease prediction using machine learningJinishaKG
 
Machine Learning - Decision Trees
Machine Learning - Decision TreesMachine Learning - Decision Trees
Machine Learning - Decision TreesRupak Roy
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithmsShalitha Suranga
 
Qwizdom averages
Qwizdom   averagesQwizdom   averages
Qwizdom averagesQwizdom UK
 
Introduction to RandomForests 2004
Introduction to RandomForests 2004Introduction to RandomForests 2004
Introduction to RandomForests 2004Salford Systems
 
Data Types with Matt Hansen at StatStuff
Data Types with Matt Hansen at StatStuffData Types with Matt Hansen at StatStuff
Data Types with Matt Hansen at StatStuffMatt Hansen
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine LearningAnkit Rai
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced dataSaurabhWani6
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018digitalzombie
 
Regression tree
Regression treeRegression tree
Regression treeSatishH5
 
Machine Learning for Product Managers
Machine Learning for Product ManagersMachine Learning for Product Managers
Machine Learning for Product ManagersNeal Lathia
 

What's hot (18)

Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use cases
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
16 Simple CART
16 Simple CART16 Simple CART
16 Simple CART
 
Excel Datamining Addin Advanced
Excel Datamining Addin AdvancedExcel Datamining Addin Advanced
Excel Datamining Addin Advanced
 
Disease prediction using machine learning
Disease prediction using machine learningDisease prediction using machine learning
Disease prediction using machine learning
 
Machine Learning - Decision Trees
Machine Learning - Decision TreesMachine Learning - Decision Trees
Machine Learning - Decision Trees
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithms
 
Qwizdom averages
Qwizdom   averagesQwizdom   averages
Qwizdom averages
 
Introduction to RandomForests 2004
Introduction to RandomForests 2004Introduction to RandomForests 2004
Introduction to RandomForests 2004
 
Data Types with Matt Hansen at StatStuff
Data Types with Matt Hansen at StatStuffData Types with Matt Hansen at StatStuff
Data Types with Matt Hansen at StatStuff
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine Learning
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced data
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
 
Regression tree
Regression treeRegression tree
Regression tree
 
Machine Learning for Product Managers
Machine Learning for Product ManagersMachine Learning for Product Managers
Machine Learning for Product Managers
 

Similar to 19 Simple CART

An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boostingbutest
 
Accurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - PosterAccurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - PosterJieming Wei
 
Statistics in real life engineering
Statistics in real life engineeringStatistics in real life engineering
Statistics in real life engineeringMD TOUFIQ HASAN ANIK
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsData Science Milan
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & UnderfittingSOUMIT KAR
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
 
A General Framework for Accurate and Fast Regression by Data Summarization in...
A General Framework for Accurate and Fast Regression by Data Summarization in...A General Framework for Accurate and Fast Regression by Data Summarization in...
A General Framework for Accurate and Fast Regression by Data Summarization in...Yao Wu
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchjim
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Researchbutest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 
The Dark Art of Production Alerting
The Dark Art of Production AlertingThe Dark Art of Production Alerting
The Dark Art of Production AlertingAlois Reitbauer
 
Bigger Data v Better Math
Bigger Data v Better MathBigger Data v Better Math
Bigger Data v Better MathBrent Schneeman
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsJieming Wei
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .pptbutest
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Kazi Toufiq Wadud
 
Dimd_m_004 DL.pdf
Dimd_m_004 DL.pdfDimd_m_004 DL.pdf
Dimd_m_004 DL.pdfjuan631
 

Similar to 19 Simple CART (20)

An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
 
Accurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - PosterAccurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - Poster
 
Statistics in real life engineering
Statistics in real life engineeringStatistics in real life engineering
Statistics in real life engineering
 
A04 Sample Size
A04 Sample SizeA04 Sample Size
A04 Sample Size
 
A04 Sample Size
A04 Sample SizeA04 Sample Size
A04 Sample Size
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning Methods
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
A General Framework for Accurate and Fast Regression by Data Summarization in...
A General Framework for Accurate and Fast Regression by Data Summarization in...A General Framework for Accurate and Fast Regression by Data Summarization in...
A General Framework for Accurate and Fast Regression by Data Summarization in...
 
20 Simple CART
20 Simple CART20 Simple CART
20 Simple CART
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
The Dark Art of Production Alerting
The Dark Art of Production AlertingThe Dark Art of Production Alerting
The Dark Art of Production Alerting
 
Bigger Data v Better Math
Bigger Data v Better MathBigger Data v Better Math
Bigger Data v Better Math
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification Algorithms
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?
 
Dimd_m_004 DL.pdf
Dimd_m_004 DL.pdfDimd_m_004 DL.pdf
Dimd_m_004 DL.pdf
 

More from Vishal Dutt

Grid computing components
Grid computing componentsGrid computing components
Grid computing componentsVishal Dutt
 
Python files / directories part16
Python files / directories  part16Python files / directories  part16
Python files / directories part16Vishal Dutt
 
Python Classes and Objects part14
Python Classes and Objects  part14Python Classes and Objects  part14
Python Classes and Objects part14Vishal Dutt
 
Python Classes and Objects part13
Python Classes and Objects  part13Python Classes and Objects  part13
Python Classes and Objects part13Vishal Dutt
 
Python files / directories part15
Python files / directories  part15Python files / directories  part15
Python files / directories part15Vishal Dutt
 
Python functions part12
Python functions  part12Python functions  part12
Python functions part12Vishal Dutt
 
Python functions part11
Python functions  part11Python functions  part11
Python functions part11Vishal Dutt
 
Python functions part10
Python functions  part10Python functions  part10
Python functions part10Vishal Dutt
 
Python decision making_loops_control statements part9
Python decision making_loops_control statements part9Python decision making_loops_control statements part9
Python decision making_loops_control statements part9Vishal Dutt
 
Python decision making_loops_control statements part8
Python decision making_loops_control statements part8Python decision making_loops_control statements part8
Python decision making_loops_control statements part8Vishal Dutt
 
Python decision making_loops part7
Python decision making_loops part7Python decision making_loops part7
Python decision making_loops part7Vishal Dutt
 
Python decision making_loops part6
Python decision making_loops part6Python decision making_loops part6
Python decision making_loops part6Vishal Dutt
 
Python decision making part5
Python decision making part5Python decision making part5
Python decision making part5Vishal Dutt
 
Python decision making part4
Python decision making part4Python decision making part4
Python decision making part4Vishal Dutt
 
Python operators part3
Python operators part3Python operators part3
Python operators part3Vishal Dutt
 

More from Vishal Dutt (20)

Grid computing components
Grid computing componentsGrid computing components
Grid computing components
 
Python files / directories part16
Python files / directories  part16Python files / directories  part16
Python files / directories part16
 
Python Classes and Objects part14
Python Classes and Objects  part14Python Classes and Objects  part14
Python Classes and Objects part14
 
Python Classes and Objects part13
Python Classes and Objects  part13Python Classes and Objects  part13
Python Classes and Objects part13
 
Python files / directories part15
Python files / directories  part15Python files / directories  part15
Python files / directories part15
 
Python functions part12
Python functions  part12Python functions  part12
Python functions part12
 
Python functions part11
Python functions  part11Python functions  part11
Python functions part11
 
Python functions part10
Python functions  part10Python functions  part10
Python functions part10
 
List view5
List view5List view5
List view5
 
Python decision making_loops_control statements part9
Python decision making_loops_control statements part9Python decision making_loops_control statements part9
Python decision making_loops_control statements part9
 
List view4
List view4List view4
List view4
 
List view3
List view3List view3
List view3
 
Python decision making_loops_control statements part8
Python decision making_loops_control statements part8Python decision making_loops_control statements part8
Python decision making_loops_control statements part8
 
Python decision making_loops part7
Python decision making_loops part7Python decision making_loops part7
Python decision making_loops part7
 
Python decision making_loops part6
Python decision making_loops part6Python decision making_loops part6
Python decision making_loops part6
 
List view2
List view2List view2
List view2
 
List view1
List view1List view1
List view1
 
Python decision making part5
Python decision making part5Python decision making part5
Python decision making part5
 
Python decision making part4
Python decision making part4Python decision making part4
Python decision making part4
 
Python operators part3
Python operators part3Python operators part3
Python operators part3
 

Recently uploaded

Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.MateoGardella
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfSanaAli374401
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 

Recently uploaded (20)

Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 

19 Simple CART

  • 1. Prof. Neeraj Bhargava Vishal Dutt Department of Computer Science, School of Engineering & System Sciences MDS University, Ajmer
  • 2.
  • 3. The Data  Goal: build a model to predict whether an incoming email is spam.  Analogous to insurance fraud detection  About 21,000 data points, each representing an email message sent to an HP scientist.  Binary target variable  1 = the message was spam: 8%  0 = the message was not spam 92%  Predictive variables created based on frequencies of various words & characters.
  • 4. The Predictive Variables  57 variables created  Frequency of “George” (the scientist’s first name)  Frequency of “!”, “$”, etc.  Frequency of long strings of capital letters  Frequency of “receive”, “free”, “credit”….  Etc  Variables creation required insight that (as yet) can’t be automated.  Analogous to the insurance variables an insightful actuary or underwriter can create.
  • 6. Methodology  Divide data 60%-40% into train-test.  Use multiple techniques to fit models on train data.  Apply the models to the test data.  Compare their power using gains charts.
  • 7. Un-pruned Tree  Just let CART keep splitting as long as it can.  Too big.  Messy  More importantly: this tree over-fits the data  Use Cross-Validation (on the train data) to prune back.  Select the optimal sub- tree. |
  • 8. Pruning Back  Plot cross-validated error rate vs. size of tree  Note: error can actually increase if the tree is too big (over- fit)  Looks like the ≈ optimal tree has 52 nodes  So prune the tree back to 52 nodes X-valRelativeError 0.20.40.60.81.0 Inf 0.09 0.043 0.018 0.011 0.0096 0.0061 0.0049 0.0033 0.002 0.0011 2.4e-05 1 3 5 7 8 10 11 13 15 17 19 20 22 24 25 30 37 52 56 66 71 83 size of tree
  • 9. Pruned Tree #1  The pruned tree is still pretty big.  Can we get away with pruning the tree back even further?  Let’s be radical and prune way back to a tree we actually wouldn’t mind looking at. | cp X-valRelativeError 0.20.40.60.81.0 Inf 0.09 0.043 0.018 0.011 0.0096 0.0061 0.0049 0.0033 0.002 0.0011 2.4e-05 1 3 5 7 8 10 11 13 15 17 19 20 22 24 25 30 37 52 56 66 71 83 size of tree
  • 10. Pruned Tree #2 | freq_DOLLARSIGN< 0.0555 freq_remove< 0.065 freq_EXCL< 0.5235 tot.CAPS< 83.5 freq_free< 0.77 freq_george>=0.14 freq_hp>=0.16 freq_EXCL< 0.3765 avg.CAPS< 2.92 freq_remove< 0.025 0 1.061e+04/170 0 415/29 1 0/13 1 20/75 0 59/0 1 70/178 0 285/12 0 208/54 1 12/51 1 46/193 1 4/290 Suggests rule: Many “$” signs, caps, and “!” and few instances of company name (“HP”)  spam!

Editor's Notes

  1. 3
  2. 4
  3. 5
  4. 6
  5. 7
  6. 8
  7. 9
  8. 10