SlideShare a Scribd company logo
CART Modeling Strategies Slide 1
CART Modeling Strategies For
Experienced Data Analysts
CART Modeling Strategies For
Experienced Data Analysts
• CART takes a significant step towards
automated data analysis
– One of CART’s predecessors was called
AAutomatic IInteraction DDetector (AIDAID)
• Nevertheless, high quality CART results
require careful planning & expert guidance
• No realistic prospect that CART analyses or
any other sophisticated modeling can be
automated in the near term
CART Modeling Strategies Slide 2
All Data analysis, regardless
of methods employed, have
certain prerequisites
All Data analysis, regardless
of methods employed, have
certain prerequisites
• Complete understanding of the data
available
– Correct variable definitions
– Sample sources and relationship to study
population
– Review of conventional summary statistics,
percentiles
– Standard reports that would be generated in the
process of data integrity checks
– Calculations verified: check that totals can be
generated from components
– Consistency checks: related fields do not conflict
CART Modeling Strategies Slide 3
Careful data preparationCareful data preparation
• CART is far better suited to dirty data analysis
than conventional statistical modeling or NN tools
– capable of dealing with missing values, outliers
• Nevertheless, considerable benefits to proper
data preparation
– the better the data the better a model can perform
• Includes
– correct identification of missing value codes (998
valid or .)
– uniform data handling when records come from
different entities (branches, regions, behavioral
groups)
– if responder data is processed separately from and
differently than non-responder data, completely
erroneous results will be produced
CART Modeling Strategies Slide 4
Some core preparatory stepsSome core preparatory steps
• Identify illegal variables to be excluded from all
models
– ID variables
– post event variables
– variables unlikely to be available in future, or
against which CART model is intended to compete
(eg Bankruptcy scores)
– variables disallowed by regulators (banking,
insurance)
– variables derived in part from dependent variables,
or generated from target variable behavior
– variables too closely connected to target for any
reason
CART Modeling Strategies Slide 5
Exploratory Data Analysis with
CART:
Pre-modeling
Exploratory Data Analysis with
CART:
Pre-modeling
• Run a single split tree and report all competitors
– ranks ability of all variables to separate target
variable into homogeneous groups
– command settings
LIMIT DEPTH=1
ERROR EXPLORE
BOPTIONS COMPETITORS=large number
• Run limited depth trees for target using one
predictor at a time (again exploratory--non-tested
trees)
– LIMIT DEPTH=2 (up to 4 nodes) or LIMIT DEPTH=3
(up to 8 nodes) (actual number depends on
redundant node pruning)
– provides optimal binning of variables
– binned versions could be used in parametric models
CART Modeling Strategies Slide 6
The CART Non-linear
Correlation Matrix
The CART Non-linear
Correlation Matrix
• Run CART models using every pair of legal
variables
– should be unlimited depth
– could be tested or exploratory
– will detect non-linear dependencies
• Results will be asymmetric
– results can be used to fill out a correlation matrix
• Alternate Procedure
– run simple regressions using all pairs of variables
– use CART to predict residuals
– correlation determined by both linear and CART
components
CART Modeling Strategies Slide 7
Example Pearson and CART
correlation Matrices
Example Pearson and CART
correlation Matrices
• From Kerry
CART Modeling Strategies Slide 8
CART Affiliation MatricesCART Affiliation Matrices
• Select a group of interesting variables
• Let each variable in turn be the target variable,
all others in group are predictors
• Grow standard trees (not depth limited) with test
procedure to prune
• Each column in matrix is a target variable
• Rows are filled with importance scores (scaled to
0,1)
• Provides a picture of variable interdependencies
• Can highlight surprise relationships between
predictors
– can help in detecting data errors
– when affiliations stringer or weaker than expected
CART Modeling Strategies Slide 9
Detection of multivariate
outliers
Detection of multivariate
outliers
• Grow CART tree for every variable as
predicted by a trimmed down variable list
• Predict each variable in turn from all other
variables
• Restrict trees to moderate to large terminal
nodes
– use ATOM or MINCHILD controls
• For regression: measure deviation of each
data point from predicted
• For classification: check if class value of
data point is rare in predicted terminal node
• Use results to investigate unusual
observations
CART Modeling Strategies Slide 10
Once data QC is complete
serious CART modeling can
begin
Once data QC is complete
serious CART modeling can
begin
• Need to understand nature of problem:
– what would be the appropriate statistical models to
use for problem at hand
– e.g. is problem a simple binary outcome (respond or
not to a direct mail piece)
– alternatively, does it have an inherent time
dimension (how long will customer remain customer
-- telecommunications churn)
latter problem involves censored data
– is study of a fundamentally time series or panel data
type
– then need to allow for lagged variables, etc.
CART Modeling Strategies Slide 11
CART cannot protect you from
using an improper analysis
strategy
CART cannot protect you from
using an improper analysis
strategy
• CART will help you execute your analysis strategy
more quickly and often more accurately
• If the modeling strategy you have selected will
produce biased results CART may just exacerbate
the problem
• A definitive modeling approach is not required,
but a defensible approach is
CART Modeling Strategies Slide 12
Example: Targeting model for a
catalog to maximize profit
Example: Targeting model for a
catalog to maximize profit
• Sensible to model in stages
– 1) yes/no response model: use classification tree
– 2) Dollar volume of order for those who do respond
modeled conditional on response=yes
modeled just on subset of responders
regression tree plausible
or classification tree on binned order amounts
– Final model could be an expected profit model
prob(respond)*Expected(Revenue| Respond)
model could be all CART, all logit, or a mixture
such models discussed later
CART Modeling Strategies Slide 13
Modeling strategy will also
dictate test strategy
Modeling strategy will also
dictate test strategy
• Suppose we are tracking purchase behavior over
time
• Data organized as one record per purchase
opportunity
• The unit of observation will be a complete case
history
– ideally will want to assign some complete case
histories to training data
– other entire case histories to test data
– important not to allow random assignment between
train and test on a record by record basis
– might want to hold back some records from longer
case histories as an additional source of test data
CART Modeling Strategies Slide 14
Initial CART analyses are
strictly exploratory
Initial CART analyses are
strictly exploratory
• Intended to reveal summary and descriptive
information about the data
• Omnibus Model: dependent variable(s) fit to
virtually all legal variables
– Certain obvious exclusions necessary: ID
numbers, clones and transforms of the dependent
variable as discussed above
– Omnibus Model reveals something about the
predictability of the dependent variable
– recall that largest tree has error no more than
twice Bayes rate
CART Modeling Strategies Slide 15
Determine Splitting Rule to
Use
Determine Splitting Rule to
Use
• Gini, Twoing, power modified Twoing for
classification
– possibly ordered twoing
• Least squares (LS) or Least Absolute Deviation
(LAD) for regression
• Best splitting rule can be selected very early in
project and typically does not have to be revisited
CART Modeling Strategies Slide 16
Assess agreement among
different test methods
Assess agreement among
different test methods
• If data set is small cross validation is required
• In this case rerun trees several times with
different starting random number seeds
– use to assess stability of size and error rate of best
trees
• With large data sets reassign cases between
learn and test several times
– initial check is on error rates and sizes of best trees
CART Modeling Strategies Slide 17
Run all as batch of startup
CART trees
Run all as batch of startup
CART trees
• Using three or four splitting rules, and three or
four test sets will get some initial feel for
predictability of target variable
• Useful to develop some text processing scripts to
extract components of the classic CART reports
most interesting
– tree sequence
– misclassification results (which classes are wrong)
– prediction success table
– importance rankings
latter can be aggregated as follows:
add up all importance scores for each variable across
all trees
rescale so that highest score is 100
• LOPTION NOPRINT gives summary tables only
– no tree detail; very helpful when trees tend to be
CART Modeling Strategies Slide 18
Derived variables almost
certainly need to be created
Derived variables almost
certainly need to be created
• Almost impossible to develop high performance
models without analyst creation of derived
variables
• Many derived variables are “obvious” to domain
specialists
– to predict purchase amounts look at customer
lifetime totals
– possibly aggregate previous purchases into
category subtotals
– calculate trend; have orders been increasing or
decreasing over time?
• Consider standard statistical summaries of
groups of variables:
– mean, standard deviation, min, max, trend
CART Modeling Strategies Slide 19
Use linear combination splits
to search for new derived
variables
Use linear combination splits
to search for new derived
variables
• Linear combinations found by CART can suggest
new derived variables
• Recommend that the delete option be set high
and that the required sample size also be
substantial
• LINEAR N=1000 DELETE=.4
– permits linear combination splits only in nodes with
more than 1,000 cases
– the higher the DELETE parameter the fewer terms in
the combination
• E.g.
CART Modeling Strategies Slide 20
Results of first models are
used to generate the first cut
back list of predictors
Results of first models are
used to generate the first cut
back list of predictors
• List is determined through a combination of
judgment and perusal of initial CART runs
• Purpose is error avoidance, exclusion of
nuisance, pernicious and not believable variables
• Variables that seem odd in the context, and thus
probably should not have predictive value also
excluded
– Important not to exclude any variables that prior
knowledge, conventional wisdom would include
– Purpose of this stage is not radical pruning but
elimination of valueless variables
CART Modeling Strategies Slide 21
Can be useful to explore trees
for selected predictor variables
or other variables of interest
Can be useful to explore trees
for selected predictor variables
or other variables of interest
• Can think of the CART tree as an extended
non-parametric version of correlation
analysis
• Results simply reveal what variables are in
some way associated in the data
• Could construct a table of variables in the
columns against variables that predict in
the rows
CART Modeling Strategies Slide 22
Same procedure could be
used to impute values
for missing data points
Same procedure could be
used to impute values
for missing data points
• Actual procedure is complex and will be
discussed in another context
• Our proposed missing value imputation
procedure is iterative
• Also might start selecting complexity values
that restrain growth of trees to reasonable
sizes
– A large data set might allow trees with many
hundreds of terminal nodes
– Yet optimal models might fall into the 20-100
terminal node size
CART Modeling Strategies Slide 23
Next set of models should
explore the impact of
alternative splitting and testing
rules
Next set of models should
explore the impact of
alternative splitting and testing
rules
• Useful to look at GINI, TWOING, and
TWOING POWER=1
• Useful to compare external test data with
cross-validation in smaller data sets
• These runs may suggest which splitting
rules are most promising for further work
• In most problems the default GINI is the
best rule to use
– Definitively better than ENTROPY, often slightly
better than TWOING
CART Modeling Strategies Slide 24
Impact of alternative splitting
and testing rules; continued
Impact of alternative splitting
and testing rules; continued
• In some problems, usually problems with
poor predictability, TWOING, POWER=1
works well
– e.g. Relative error in best GINI tree is .8 or
higher
– In these cases, the more balanced splitting
strategy seems to yield better trees
CART Modeling Strategies Slide 25
Also want to compare results
from different test procedures
Also want to compare results
from different test procedures
• Compare runs with different subsets of test
data randomly chosen from larger data sets
• e.g., Create two uniform random variables
– %LET TEST20A=urn <0.20
– %LET TEST20B=urn >0.20
– Use TEST20A to pick out test sample in one run
and use TEST20B in another run
CART Modeling Strategies Slide 26
We hope results will be very
similar across test sets
We hope results will be very
similar across test sets
• Approximate size of optimal tree
• Approximate relative error
• Importance ranking of variables — which
variables appear near top of list
• Reasonable overlap of primary splitters in
trees
CART Modeling Strategies Slide 27
Instability of results across test
data sets is a warning sign
Instability of results across test
data sets is a warning sign
• May need to carefully review interdependencies
of predictor variables
• Results may be due to a set of closely competing
predictors with different information content
• If so, will want to consider whether one or more of
these competitors should be dropped
• In this case, a judgment is made concerning
variables to exclude from the model
• Results may be unstable due to inherent variance
of the tree predictor
• In this case, will ultimately want to consider
aggregation of experts discussed below
CART Modeling Strategies Slide 28
Experiments with Linear
Combination Splits
Experiments with Linear
Combination Splits
• Linear combinations are occasionally instructive
• Not useful when many variables are involved
• We recommend restriction to 2-variable linear
combinations
• Helpful if there are strictly positive variables
transformed to logs
– 2-variable linear combination might reveal a form
like
c1*log (X1) - c2*log(X2) ,
which is a ratio of the predictors
CART Modeling Strategies Slide 29
Reading CART resultsReading CART results
• Useful to prepare a series of summary reports
after CART runs are done
• One report should just include the TREE
SEQUENCE
– Reveals the size of the optimal tree, relative error
rate
– Can be used to reject certain runs – too large, too
small, too inaccurate
• Another report extracts just the split variables:
– Contains a listing of the node split variables
– Provides an brief outline of how the tree evolved
CART Modeling Strategies Slide 30
Reports are used to select
trees that appear to be
promising
Reports are used to select
trees that appear to be
promising
• It is possible that no promising trees are
found in the early rounds of analysis
• Attractive trees need to be printed to
facilitate absorption of the implicit model
CART Modeling Strategies Slide 31
Currently we use
allCLEAR to print
Currently we use
allCLEAR to print
• Future CART will include its own pretty print but
will still support allCLEAR
• We request the “splits” level of detail in the
output
– Includes split variable, split value, class assignment
– Table of class distribution in the node might be too
voluminous
CART Modeling Strategies Slide 32
Trees need to be read for
the story they tell and
assessed for plausibility
Trees need to be read for
the story they tell and
assessed for plausibility
• Particularly at the higher levels of the tree
(lower levels might disappear with pruning)
• Does the predictive model agree with
intuition and prior expectations?
CART Modeling Strategies Slide 33
When troubling patterns
emerge, need to look at the
competitors of a node
When troubling patterns
emerge, need to look at the
competitors of a node
• Reveals what other variable would be used to
split the node if the main splitter were not
available
• If the competitor is more acceptable than the
primary in a node can consider dropping the
primary
• Method will only work if analyst is willing to
exclude the variable from anywhere in the tree
• On the basis of these reports and prints can
determine candidate second round models
CART Modeling Strategies Slide 34
Now can move on to tools
for model refinement
Now can move on to tools
for model refinement
• Selection of right-sized trees based on
judgment
• Altering costs of misclassification
• Creation of new variables
CART Modeling Strategies Slide 35
Judgmental Pruning of Trees:
A necessary step in
model development
Judgmental Pruning of Trees:
A necessary step in
model development
• When the CART monograph was published in
1984 the authors suggested that the best tree
was the “one-se-rule tree”
• This is the smallest tree within one standard
error of the minimum cost tree
• The reasoning was: all trees within a one
standard error band are statistically
indistinguishable, and small trees are
inherently more comprehensible and preferable
CART Modeling Strategies Slide 36
Judgmental Pruning of Trees:
continued
Judgmental Pruning of Trees:
continued
• The current view of the CART originators is that
one should accept the literal minimum cost tree
produced by CART
• This view is based on a further dozen years of
experience which has revealed that the “one-
se-rule” may be too conservative
• Nonetheless, compelling reasons exist to prefer
smaller trees in data-mining investigations
CART Modeling Strategies Slide 37
In data-mining exercises
trees can easily grow to
unmanageable depths
In data-mining exercises
trees can easily grow to
unmanageable depths
• With the prodigious volumes of warehoused data, greedy
analysis tools can develop complex models without
restraint
• Paradoxically, the large quantities of data can serve to
mislead
• The problem is similar to that noted by statisticians who
first analyzed large national probability sample
databases: in regression, t-test, and chi-square tests,
almost every estimated coefficient is “significantlysignificantly”
different from zero, and every null is rejected
• In the tree-growing context, elaborate trees of great
depth appear to perform extremely well even on
independent hold-out samples
CART Modeling Strategies Slide 38
A way to “discount”
findings based on very
large data sets is needed
A way to “discount”
findings based on very
large data sets is needed
• The solution in the conventional modeling context
has been to adjust the significance level required
before placing too much faith in a finding
• For example, a t-statistic of 2.2 for a regression
coefficient based on 30 degrees of freedom
should be considered more compelling than the
same t-statistic based on 100,000 degrees of
freedom
• In the CART context it would be useful to have
optimal tree size selection criteria that adapted to
the volume of data available
CART Modeling Strategies Slide 39
Three tools for adjusting
an analysis to data richness
are available in CART
Three tools for adjusting
an analysis to data richness
are available in CART
• The ATOM or minimum node size available
for splitting: as the data set size increases,
ATOM size can also be increased (perhaps
with the log of sample size)
– The thinking is: as data sets increase in size,
require the amount of data needed to support a
split to increase also
CART Modeling Strategies Slide 40
Three tools for adjusting
an analysis; continued
Three tools for adjusting
an analysis; continued
• The minimum child size can also be adjusted.
MINCHILD prevents CART from splitting off nodes too
small to support separate analysis
– For example, we might not want to attempt inferring the
probability of prepay in any node containing less than 100
observations
– MINCHILD and ATOM are closely related but are different
concepts. MINCHILD guarantees that no terminal node will
ever be smaller than its predetermined value. ATOM
determines the minimum size of a node that is eligible to
be split. ATOM must always be at least 2*MINCHILD so
that if the smallest node eligible for splitting is split into
two equal parts, each part will be at least as large as
MINCHILD.
• Trees other than the “optimal” tree can be PICKED from
the tree sequence
CART Modeling Strategies Slide 41
The third tool is selection of a
tree from the CART sequence
The third tool is selection of a
tree from the CART sequence
• Analyst intervention in tree selection is both
desirable and unavoidable
• Allows the incorporation of prior knowledge and
domain expertise
• This type of selection is really just pruning: the
analyst decides to prune back further than the CART
algorithms recommend
• Topic is mentioned briefly in the CART monograph
where the authors discuss their decision to eliminate
one or two nodes near the bottom of a medical
diagnosis tree:
– MD’s running the study did not believe that these lower
level splits captured the underlying biology
• This is similar to a statistician deciding to exclude a
borderline significant interaction in a regression
CART Modeling Strategies Slide 42
In the data-mining context,
tree selection can be guided by
the relative error plot
In the data-mining context,
tree selection can be guided by
the relative error plot
• Each CART run produces a plot of relative error
against number of nodes and the relative error is
printed on the TREE SEQUENCE report
• In data mining these plots have a characteristic
shape: steep declines in the relative error as tree
initially evolves followed by lengthy flat portions in
which further error reduction is extremely small with
each additional node
• Further, the test data support the hypothesis that
many of these error reductions are “statisticallystatistically
significantsignificant.” In the CART context the claim is that the
more complex larger trees will predict well on fresh
data and thus contain valuable information.
CART Modeling Strategies Slide 43
An analyst could defensibly
decide to trade off a large
block
of nodes for a small “increase”
in prediction error
An analyst could defensibly
decide to trade off a large
block
of nodes for a small “increase”
in prediction error• In one of our CART models the “optimaloptimal” tree had
100 terminal nodes and a relative error of 0.333968
+/- 0.00578
• Yet the sub-tree with 63 terminal nodes only has a
relative error of 0.34339, a one-point apparent loss
in accuracy.
• And 29 terminal nodes yield a relative error of .
38564
CART Modeling Strategies Slide 44
Final tree selection based on
the relative error plot alone
Final tree selection based on
the relative error plot alone
• In many applications it will be difficult to
make a final tree selection based on the
relative error plot alone
• The plot reveals many opportunities for
selection, but rarely serves to single out a
best tree
• In some problems it is possible to find the
tree that exhausts all substantial
improvements and that separates a steeply
sloping section from a flat plateau
CART Modeling Strategies Slide 45
The next step of tree
assessment
The next step of tree
assessment
• Carefully review of a relatively large tree
chosen by CART
• Examination of a large tree node-by-node
will be very instructive
• We are assuming that the early splits of the
tree have already been examined and found
to be convincing and acceptable
CART Modeling Strategies Slide 46
Review of a relatively large
tree chosen by CART
Review of a relatively large
tree chosen by CART
• Purpose of this stage of review is to consider the
lower branches:
– Do any of the splits appear fortuitous or not
particularly believable?
– Are the same variables being used repeatedly to
minutely subdivide a predictor?
– Is it worth pursuing additional refinement of the sub-
sample reached at a particular juncture in the tree?
– Is there any concern for whatever reason that the
splits are not reasonable representations of reality?
CART Modeling Strategies Slide 47
Additional ConsiderationsAdditional Considerations
• The tree that results when questionable or
low value sections of the CART optimal tree
are dropped should be considered
– Unfortunately, there appears to be no substitute for
the careful and detailed examination of the CART
tree node-by-node
– However, the only contribution of judgment here is
to eliminate nodes that are thought to be the result
of over-fitting
CART Modeling Strategies Slide 48
Goodness-Of-Fit Measures
for Classification Trees
in Classic CART
Goodness-Of-Fit Measures
for Classification Trees
in Classic CART
• CART classification trees automatically generate
diagnostic reports
– Relative Error Rate for all trees in pruned sequence
– Misclassification Rate By Class for Learn and Test
data
– Misclassification Table: Actual vs. Predicted Class
• CART class probability trees display only the
relative error sequence
• Although these reports are helpful in sorting out
the most promising trees early on in CART
analyses, they contain far less information than
needed for proper model assessment
CART Modeling Strategies Slide 49
Characteristics of the CART
GINI Measure
Characteristics of the CART
GINI Measure
• Measure is zero whenever a node is pure
• Most CART trees are grown and pruned using the
Gini measure of within node diversity
• Gini is largest when distribution of classes in a
node is uniform
• CART trees usually grown with priors EQUAL
– Essential to encourage promising tree evolution
when class distribution is skewed
– Practical impact is to make make CART strive for
roughly equal accuracy in all classes
– Priors DATA and priors MIX rarely work well
• CART Gini measure will then be priors adjusted
i t pi
i
( )= −∑1 2
CART Modeling Strategies Slide 50
One new measure of tree
performance — “Rho-squaredRho-squared”
One new measure of tree
performance — “Rho-squaredRho-squared”
• Although the growing process is improved
with equal priors, the practical evaluation of
the tree requires using data priors
– Actual node distributions, not priors adjusted
• We therefore compute unadjusted Gini for
entire tree and compare this with the Gini
of the root
• Provides a measure of the improvement
due to splitting
CART Modeling Strategies Slide 51
“Rho-squaredRho-squared”; continued“Rho-squaredRho-squared”; continued
• Formal definition of Rho-squared
Rho-squared = 1 - Gini(tree)/Gini(root)
– If Gini(tree)=Gini(root) we have no improvement
and rho-squared=0
– If Gini(tree)=0, meaning all terminal nodes are
perfectly pure, then rho-squared=1
– Thus, rho-squared measures how the gap from
Gini(root) to a Gini of 0 is closed by the model
• Can be used to compare competing tree
models
CART Modeling Strategies Slide 52
Second new measure
compares learn vs. test class
distribution
in terminal nodes
Second new measure
compares learn vs. test class
distribution
in terminal nodes
• Every classification tree generates a distribution
of the dependent variable in each terminal node
• This learn data distribution can be compared with
the distribution observed in other data:
– The test data used to calibrate relative error rates
and select the optimal tree
– A test data set independent of both learn and test
data used in the tree modeling
– Data from other sources that are not necessarily
expected to be similar to the tree under study
• Might also want to compare the test data with
external data
CART Modeling Strategies Slide 53
Performance comparisons
can be summarized in
a chi-square statistic
Performance comparisons
can be summarized in
a chi-square statistic
– If there are K classes then each terminal node
contributes a chi-square statistic with K-1 df
– With T terminal nodes the overall statistic for the
tree has T*(K-1) degrees of freedom
– Can decompose the statistic by node or by class
– Useful when the statistic is large to determine
source of large deviations
Are we fitting badly in a specific subtree?
Are the deviations concentrated in one class?
CART Modeling Strategies Slide 54
Class Probability TreesClass Probability Trees
• Technically, project Oracle uses class probability
trees for forecasts and simulation
• Class probability trees use the same GINI method
for growing
• Uses GINI for pruning trees as well
• Nevertheless, we used classification trees
throughout and interpreted the results as class
probability trees
• Several reasons for this approach
– Classification trees produce misclassification
reports
– Can be guided by variable cost of misclassification
– Class probability trees sometimes much smaller
than classification trees
CART Modeling Strategies Slide 55
Class Probability Trees;
continued
Class Probability Trees;
continued
• Main problem with class probability trees
– Pruning based on equal priors
– Want pruning based on data priors, not yet possible
in CART
• Hence, use of classification tree to allow
judgmental pruning
• Nonetheless, looking at class probability tree
sizes can be used to bound right sized tree
• Would be desirable to modify CAR to allow
different priors in growing and pruning

More Related Content

What's hot

Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
Derek Kane
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
Lippo Group Digital
 
Decision tree
Decision treeDecision tree
Decision tree
SEMINARGROOT
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
SOUMIT KAR
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
SANTHOSH RAJA M G
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
Krish_ver2
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
Student
 
Decision tree
Decision treeDecision tree
Decision tree
ShraddhaPandey45
 
Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest
Rupak Roy
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
Viet-Trung TRAN
 
Decision tree
Decision treeDecision tree
Decision tree
Soujanya V
 
Decision trees & random forests
Decision trees & random forestsDecision trees & random forests
Decision trees & random forests
SC5.io
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
Ensemble methods
Ensemble methods Ensemble methods
Ensemble methods
zekeLabs Technologies
 
Classification Using Decision tree
Classification Using Decision treeClassification Using Decision tree
Classification Using Decision tree
Mohd. Noor Abdul Hamid
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
Milind Gokhale
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
Pınar Yahşi
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
Krish_ver2
 
Nonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemNonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problem
Michele Filannino
 

What's hot (20)

Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
Decision tree
Decision treeDecision tree
Decision tree
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Random forest
Random forestRandom forest
Random forest
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Decision tree
Decision treeDecision tree
Decision tree
 
Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Decision tree
Decision treeDecision tree
Decision tree
 
Decision trees & random forests
Decision trees & random forestsDecision trees & random forests
Decision trees & random forests
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Ensemble methods
Ensemble methods Ensemble methods
Ensemble methods
 
Classification Using Decision tree
Classification Using Decision treeClassification Using Decision tree
Classification Using Decision tree
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
Nonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemNonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problem
 

Viewers also liked

Classification and Regression Tree Analysis in Biomedical Research
Classification and Regression Tree Analysis in Biomedical Research Classification and Regression Tree Analysis in Biomedical Research
Classification and Regression Tree Analysis in Biomedical Research Salford Systems
 
Neuro-fuzzy systems
Neuro-fuzzy systemsNeuro-fuzzy systems
Neuro-fuzzy systems
Sagar Ahire
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example Dataset
Salford Systems
 
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
Kun Le
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 
Fuzzy logic ppt
Fuzzy logic pptFuzzy logic ppt
Fuzzy logic ppt
Priya_Srivastava
 

Viewers also liked (7)

Classification and Regression Tree Analysis in Biomedical Research
Classification and Regression Tree Analysis in Biomedical Research Classification and Regression Tree Analysis in Biomedical Research
Classification and Regression Tree Analysis in Biomedical Research
 
Neuro-fuzzy systems
Neuro-fuzzy systemsNeuro-fuzzy systems
Neuro-fuzzy systems
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example Dataset
 
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Fuzzy logic ppt
Fuzzy logic pptFuzzy logic ppt
Fuzzy logic ppt
 
Lecture5 - C4.5
Lecture5 - C4.5Lecture5 - C4.5
Lecture5 - C4.5
 

Similar to CART Classification and Regression Trees Experienced User Guide

Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
Sanghamitra Deb
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
Matthew Evans
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new pptSalford Systems
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
eShikshak
 
GLM & GBM in H2O
GLM & GBM in H2OGLM & GBM in H2O
GLM & GBM in H2O
Sri Ambati
 
Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09
Greg Makowski
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
Stock Price Prediction using ML Techniques
Stock Price Prediction using ML TechniquesStock Price Prediction using ML Techniques
Stock Price Prediction using ML Techniques
NarayanJee4
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
Roger Barga
 
DataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxDataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptx
Akash527744
 
Customer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenCustomer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R Open
Poo Kuan Hoong
 
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris..."A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
Quantopian
 
Modeling and analysis
Modeling and analysisModeling and analysis
Modeling and analysis
Shwetabh Jaiswal
 
Parallel Rule Generation For Efficient Classification System
Parallel Rule Generation For Efficient Classification SystemParallel Rule Generation For Efficient Classification System
Parallel Rule Generation For Efficient Classification System
Talha Ghaffar
 
Lecture 2 Data mining process.pdf
Lecture 2 Data mining process.pdfLecture 2 Data mining process.pdf
Lecture 2 Data mining process.pdf
Kaushik Kundu
 
Machine learning it is time...
Machine learning it is time...Machine learning it is time...
Machine learning it is time...
Sandip Chatterjee
 
Mini datathon - Bengaluru
Mini datathon - BengaluruMini datathon - Bengaluru
Mini datathon - Bengaluru
Kunal Jain
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Girish Khanzode
 

Similar to CART Classification and Regression Trees Experienced User Guide (20)

Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new ppt
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
 
GLM & GBM in H2O
GLM & GBM in H2OGLM & GBM in H2O
GLM & GBM in H2O
 
Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Stock Price Prediction using ML Techniques
Stock Price Prediction using ML TechniquesStock Price Prediction using ML Techniques
Stock Price Prediction using ML Techniques
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
DataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxDataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptx
 
Customer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenCustomer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R Open
 
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris..."A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
 
Modeling and analysis
Modeling and analysisModeling and analysis
Modeling and analysis
 
Parallel Rule Generation For Efficient Classification System
Parallel Rule Generation For Efficient Classification SystemParallel Rule Generation For Efficient Classification System
Parallel Rule Generation For Efficient Classification System
 
Lecture 2 Data mining process.pdf
Lecture 2 Data mining process.pdfLecture 2 Data mining process.pdf
Lecture 2 Data mining process.pdf
 
Machine learning it is time...
Machine learning it is time...Machine learning it is time...
Machine learning it is time...
 
Mini datathon - Bengaluru
Mini datathon - BengaluruMini datathon - Bengaluru
Mini datathon - Bengaluru
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 

More from Salford Systems

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4
Salford Systems
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
Salford Systems
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Salford Systems
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
Salford Systems
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data Mining
Salford Systems
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele Cutler
Salford Systems
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
Salford Systems
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To Remember
Salford Systems
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to marsSalford Systems
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher Education
Salford Systems
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingSalford Systems
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hivSalford Systems
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
Salford Systems
 
SPM v7.0 Feature Matrix
SPM v7.0 Feature MatrixSPM v7.0 Feature Matrix
SPM v7.0 Feature Matrix
Salford Systems
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARS
Salford Systems
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998Salford Systems
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPM
Salford Systems
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7
Salford Systems
 
TreeNet Overview - Updated October 2012
TreeNet Overview  - Updated October 2012TreeNet Overview  - Updated October 2012
TreeNet Overview - Updated October 2012Salford Systems
 
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles and CART  Decision Trees:  A Winning CombinationTreeNet Tree Ensembles and CART  Decision Trees:  A Winning Combination
TreeNet Tree Ensembles and CART Decision Trees: A Winning CombinationSalford Systems
 

More from Salford Systems (20)

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data Mining
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele Cutler
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To Remember
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher Education
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modeling
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hiv
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
 
SPM v7.0 Feature Matrix
SPM v7.0 Feature MatrixSPM v7.0 Feature Matrix
SPM v7.0 Feature Matrix
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARS
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPM
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7
 
TreeNet Overview - Updated October 2012
TreeNet Overview  - Updated October 2012TreeNet Overview  - Updated October 2012
TreeNet Overview - Updated October 2012
 
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles and CART  Decision Trees:  A Winning CombinationTreeNet Tree Ensembles and CART  Decision Trees:  A Winning Combination
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
 

Recently uploaded

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 

Recently uploaded (20)

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 

CART Classification and Regression Trees Experienced User Guide

  • 1. CART Modeling Strategies Slide 1 CART Modeling Strategies For Experienced Data Analysts CART Modeling Strategies For Experienced Data Analysts • CART takes a significant step towards automated data analysis – One of CART’s predecessors was called AAutomatic IInteraction DDetector (AIDAID) • Nevertheless, high quality CART results require careful planning & expert guidance • No realistic prospect that CART analyses or any other sophisticated modeling can be automated in the near term
  • 2. CART Modeling Strategies Slide 2 All Data analysis, regardless of methods employed, have certain prerequisites All Data analysis, regardless of methods employed, have certain prerequisites • Complete understanding of the data available – Correct variable definitions – Sample sources and relationship to study population – Review of conventional summary statistics, percentiles – Standard reports that would be generated in the process of data integrity checks – Calculations verified: check that totals can be generated from components – Consistency checks: related fields do not conflict
  • 3. CART Modeling Strategies Slide 3 Careful data preparationCareful data preparation • CART is far better suited to dirty data analysis than conventional statistical modeling or NN tools – capable of dealing with missing values, outliers • Nevertheless, considerable benefits to proper data preparation – the better the data the better a model can perform • Includes – correct identification of missing value codes (998 valid or .) – uniform data handling when records come from different entities (branches, regions, behavioral groups) – if responder data is processed separately from and differently than non-responder data, completely erroneous results will be produced
  • 4. CART Modeling Strategies Slide 4 Some core preparatory stepsSome core preparatory steps • Identify illegal variables to be excluded from all models – ID variables – post event variables – variables unlikely to be available in future, or against which CART model is intended to compete (eg Bankruptcy scores) – variables disallowed by regulators (banking, insurance) – variables derived in part from dependent variables, or generated from target variable behavior – variables too closely connected to target for any reason
  • 5. CART Modeling Strategies Slide 5 Exploratory Data Analysis with CART: Pre-modeling Exploratory Data Analysis with CART: Pre-modeling • Run a single split tree and report all competitors – ranks ability of all variables to separate target variable into homogeneous groups – command settings LIMIT DEPTH=1 ERROR EXPLORE BOPTIONS COMPETITORS=large number • Run limited depth trees for target using one predictor at a time (again exploratory--non-tested trees) – LIMIT DEPTH=2 (up to 4 nodes) or LIMIT DEPTH=3 (up to 8 nodes) (actual number depends on redundant node pruning) – provides optimal binning of variables – binned versions could be used in parametric models
  • 6. CART Modeling Strategies Slide 6 The CART Non-linear Correlation Matrix The CART Non-linear Correlation Matrix • Run CART models using every pair of legal variables – should be unlimited depth – could be tested or exploratory – will detect non-linear dependencies • Results will be asymmetric – results can be used to fill out a correlation matrix • Alternate Procedure – run simple regressions using all pairs of variables – use CART to predict residuals – correlation determined by both linear and CART components
  • 7. CART Modeling Strategies Slide 7 Example Pearson and CART correlation Matrices Example Pearson and CART correlation Matrices • From Kerry
  • 8. CART Modeling Strategies Slide 8 CART Affiliation MatricesCART Affiliation Matrices • Select a group of interesting variables • Let each variable in turn be the target variable, all others in group are predictors • Grow standard trees (not depth limited) with test procedure to prune • Each column in matrix is a target variable • Rows are filled with importance scores (scaled to 0,1) • Provides a picture of variable interdependencies • Can highlight surprise relationships between predictors – can help in detecting data errors – when affiliations stringer or weaker than expected
  • 9. CART Modeling Strategies Slide 9 Detection of multivariate outliers Detection of multivariate outliers • Grow CART tree for every variable as predicted by a trimmed down variable list • Predict each variable in turn from all other variables • Restrict trees to moderate to large terminal nodes – use ATOM or MINCHILD controls • For regression: measure deviation of each data point from predicted • For classification: check if class value of data point is rare in predicted terminal node • Use results to investigate unusual observations
  • 10. CART Modeling Strategies Slide 10 Once data QC is complete serious CART modeling can begin Once data QC is complete serious CART modeling can begin • Need to understand nature of problem: – what would be the appropriate statistical models to use for problem at hand – e.g. is problem a simple binary outcome (respond or not to a direct mail piece) – alternatively, does it have an inherent time dimension (how long will customer remain customer -- telecommunications churn) latter problem involves censored data – is study of a fundamentally time series or panel data type – then need to allow for lagged variables, etc.
  • 11. CART Modeling Strategies Slide 11 CART cannot protect you from using an improper analysis strategy CART cannot protect you from using an improper analysis strategy • CART will help you execute your analysis strategy more quickly and often more accurately • If the modeling strategy you have selected will produce biased results CART may just exacerbate the problem • A definitive modeling approach is not required, but a defensible approach is
  • 12. CART Modeling Strategies Slide 12 Example: Targeting model for a catalog to maximize profit Example: Targeting model for a catalog to maximize profit • Sensible to model in stages – 1) yes/no response model: use classification tree – 2) Dollar volume of order for those who do respond modeled conditional on response=yes modeled just on subset of responders regression tree plausible or classification tree on binned order amounts – Final model could be an expected profit model prob(respond)*Expected(Revenue| Respond) model could be all CART, all logit, or a mixture such models discussed later
  • 13. CART Modeling Strategies Slide 13 Modeling strategy will also dictate test strategy Modeling strategy will also dictate test strategy • Suppose we are tracking purchase behavior over time • Data organized as one record per purchase opportunity • The unit of observation will be a complete case history – ideally will want to assign some complete case histories to training data – other entire case histories to test data – important not to allow random assignment between train and test on a record by record basis – might want to hold back some records from longer case histories as an additional source of test data
  • 14. CART Modeling Strategies Slide 14 Initial CART analyses are strictly exploratory Initial CART analyses are strictly exploratory • Intended to reveal summary and descriptive information about the data • Omnibus Model: dependent variable(s) fit to virtually all legal variables – Certain obvious exclusions necessary: ID numbers, clones and transforms of the dependent variable as discussed above – Omnibus Model reveals something about the predictability of the dependent variable – recall that largest tree has error no more than twice Bayes rate
  • 15. CART Modeling Strategies Slide 15 Determine Splitting Rule to Use Determine Splitting Rule to Use • Gini, Twoing, power modified Twoing for classification – possibly ordered twoing • Least squares (LS) or Least Absolute Deviation (LAD) for regression • Best splitting rule can be selected very early in project and typically does not have to be revisited
  • 16. CART Modeling Strategies Slide 16 Assess agreement among different test methods Assess agreement among different test methods • If data set is small cross validation is required • In this case rerun trees several times with different starting random number seeds – use to assess stability of size and error rate of best trees • With large data sets reassign cases between learn and test several times – initial check is on error rates and sizes of best trees
  • 17. CART Modeling Strategies Slide 17 Run all as batch of startup CART trees Run all as batch of startup CART trees • Using three or four splitting rules, and three or four test sets will get some initial feel for predictability of target variable • Useful to develop some text processing scripts to extract components of the classic CART reports most interesting – tree sequence – misclassification results (which classes are wrong) – prediction success table – importance rankings latter can be aggregated as follows: add up all importance scores for each variable across all trees rescale so that highest score is 100 • LOPTION NOPRINT gives summary tables only – no tree detail; very helpful when trees tend to be
  • 18. CART Modeling Strategies Slide 18 Derived variables almost certainly need to be created Derived variables almost certainly need to be created • Almost impossible to develop high performance models without analyst creation of derived variables • Many derived variables are “obvious” to domain specialists – to predict purchase amounts look at customer lifetime totals – possibly aggregate previous purchases into category subtotals – calculate trend; have orders been increasing or decreasing over time? • Consider standard statistical summaries of groups of variables: – mean, standard deviation, min, max, trend
  • 19. CART Modeling Strategies Slide 19 Use linear combination splits to search for new derived variables Use linear combination splits to search for new derived variables • Linear combinations found by CART can suggest new derived variables • Recommend that the delete option be set high and that the required sample size also be substantial • LINEAR N=1000 DELETE=.4 – permits linear combination splits only in nodes with more than 1,000 cases – the higher the DELETE parameter the fewer terms in the combination • E.g.
  • 20. CART Modeling Strategies Slide 20 Results of first models are used to generate the first cut back list of predictors Results of first models are used to generate the first cut back list of predictors • List is determined through a combination of judgment and perusal of initial CART runs • Purpose is error avoidance, exclusion of nuisance, pernicious and not believable variables • Variables that seem odd in the context, and thus probably should not have predictive value also excluded – Important not to exclude any variables that prior knowledge, conventional wisdom would include – Purpose of this stage is not radical pruning but elimination of valueless variables
  • 21. CART Modeling Strategies Slide 21 Can be useful to explore trees for selected predictor variables or other variables of interest Can be useful to explore trees for selected predictor variables or other variables of interest • Can think of the CART tree as an extended non-parametric version of correlation analysis • Results simply reveal what variables are in some way associated in the data • Could construct a table of variables in the columns against variables that predict in the rows
  • 22. CART Modeling Strategies Slide 22 Same procedure could be used to impute values for missing data points Same procedure could be used to impute values for missing data points • Actual procedure is complex and will be discussed in another context • Our proposed missing value imputation procedure is iterative • Also might start selecting complexity values that restrain growth of trees to reasonable sizes – A large data set might allow trees with many hundreds of terminal nodes – Yet optimal models might fall into the 20-100 terminal node size
  • 23. CART Modeling Strategies Slide 23 Next set of models should explore the impact of alternative splitting and testing rules Next set of models should explore the impact of alternative splitting and testing rules • Useful to look at GINI, TWOING, and TWOING POWER=1 • Useful to compare external test data with cross-validation in smaller data sets • These runs may suggest which splitting rules are most promising for further work • In most problems the default GINI is the best rule to use – Definitively better than ENTROPY, often slightly better than TWOING
  • 24. CART Modeling Strategies Slide 24 Impact of alternative splitting and testing rules; continued Impact of alternative splitting and testing rules; continued • In some problems, usually problems with poor predictability, TWOING, POWER=1 works well – e.g. Relative error in best GINI tree is .8 or higher – In these cases, the more balanced splitting strategy seems to yield better trees
  • 25. CART Modeling Strategies Slide 25 Also want to compare results from different test procedures Also want to compare results from different test procedures • Compare runs with different subsets of test data randomly chosen from larger data sets • e.g., Create two uniform random variables – %LET TEST20A=urn <0.20 – %LET TEST20B=urn >0.20 – Use TEST20A to pick out test sample in one run and use TEST20B in another run
  • 26. CART Modeling Strategies Slide 26 We hope results will be very similar across test sets We hope results will be very similar across test sets • Approximate size of optimal tree • Approximate relative error • Importance ranking of variables — which variables appear near top of list • Reasonable overlap of primary splitters in trees
  • 27. CART Modeling Strategies Slide 27 Instability of results across test data sets is a warning sign Instability of results across test data sets is a warning sign • May need to carefully review interdependencies of predictor variables • Results may be due to a set of closely competing predictors with different information content • If so, will want to consider whether one or more of these competitors should be dropped • In this case, a judgment is made concerning variables to exclude from the model • Results may be unstable due to inherent variance of the tree predictor • In this case, will ultimately want to consider aggregation of experts discussed below
  • 28. CART Modeling Strategies Slide 28 Experiments with Linear Combination Splits Experiments with Linear Combination Splits • Linear combinations are occasionally instructive • Not useful when many variables are involved • We recommend restriction to 2-variable linear combinations • Helpful if there are strictly positive variables transformed to logs – 2-variable linear combination might reveal a form like c1*log (X1) - c2*log(X2) , which is a ratio of the predictors
  • 29. CART Modeling Strategies Slide 29 Reading CART resultsReading CART results • Useful to prepare a series of summary reports after CART runs are done • One report should just include the TREE SEQUENCE – Reveals the size of the optimal tree, relative error rate – Can be used to reject certain runs – too large, too small, too inaccurate • Another report extracts just the split variables: – Contains a listing of the node split variables – Provides an brief outline of how the tree evolved
  • 30. CART Modeling Strategies Slide 30 Reports are used to select trees that appear to be promising Reports are used to select trees that appear to be promising • It is possible that no promising trees are found in the early rounds of analysis • Attractive trees need to be printed to facilitate absorption of the implicit model
  • 31. CART Modeling Strategies Slide 31 Currently we use allCLEAR to print Currently we use allCLEAR to print • Future CART will include its own pretty print but will still support allCLEAR • We request the “splits” level of detail in the output – Includes split variable, split value, class assignment – Table of class distribution in the node might be too voluminous
  • 32. CART Modeling Strategies Slide 32 Trees need to be read for the story they tell and assessed for plausibility Trees need to be read for the story they tell and assessed for plausibility • Particularly at the higher levels of the tree (lower levels might disappear with pruning) • Does the predictive model agree with intuition and prior expectations?
  • 33. CART Modeling Strategies Slide 33 When troubling patterns emerge, need to look at the competitors of a node When troubling patterns emerge, need to look at the competitors of a node • Reveals what other variable would be used to split the node if the main splitter were not available • If the competitor is more acceptable than the primary in a node can consider dropping the primary • Method will only work if analyst is willing to exclude the variable from anywhere in the tree • On the basis of these reports and prints can determine candidate second round models
  • 34. CART Modeling Strategies Slide 34 Now can move on to tools for model refinement Now can move on to tools for model refinement • Selection of right-sized trees based on judgment • Altering costs of misclassification • Creation of new variables
  • 35. CART Modeling Strategies Slide 35 Judgmental Pruning of Trees: A necessary step in model development Judgmental Pruning of Trees: A necessary step in model development • When the CART monograph was published in 1984 the authors suggested that the best tree was the “one-se-rule tree” • This is the smallest tree within one standard error of the minimum cost tree • The reasoning was: all trees within a one standard error band are statistically indistinguishable, and small trees are inherently more comprehensible and preferable
  • 36. CART Modeling Strategies Slide 36 Judgmental Pruning of Trees: continued Judgmental Pruning of Trees: continued • The current view of the CART originators is that one should accept the literal minimum cost tree produced by CART • This view is based on a further dozen years of experience which has revealed that the “one- se-rule” may be too conservative • Nonetheless, compelling reasons exist to prefer smaller trees in data-mining investigations
  • 37. CART Modeling Strategies Slide 37 In data-mining exercises trees can easily grow to unmanageable depths In data-mining exercises trees can easily grow to unmanageable depths • With the prodigious volumes of warehoused data, greedy analysis tools can develop complex models without restraint • Paradoxically, the large quantities of data can serve to mislead • The problem is similar to that noted by statisticians who first analyzed large national probability sample databases: in regression, t-test, and chi-square tests, almost every estimated coefficient is “significantlysignificantly” different from zero, and every null is rejected • In the tree-growing context, elaborate trees of great depth appear to perform extremely well even on independent hold-out samples
  • 38. CART Modeling Strategies Slide 38 A way to “discount” findings based on very large data sets is needed A way to “discount” findings based on very large data sets is needed • The solution in the conventional modeling context has been to adjust the significance level required before placing too much faith in a finding • For example, a t-statistic of 2.2 for a regression coefficient based on 30 degrees of freedom should be considered more compelling than the same t-statistic based on 100,000 degrees of freedom • In the CART context it would be useful to have optimal tree size selection criteria that adapted to the volume of data available
  • 39. CART Modeling Strategies Slide 39 Three tools for adjusting an analysis to data richness are available in CART Three tools for adjusting an analysis to data richness are available in CART • The ATOM or minimum node size available for splitting: as the data set size increases, ATOM size can also be increased (perhaps with the log of sample size) – The thinking is: as data sets increase in size, require the amount of data needed to support a split to increase also
  • 40. CART Modeling Strategies Slide 40 Three tools for adjusting an analysis; continued Three tools for adjusting an analysis; continued • The minimum child size can also be adjusted. MINCHILD prevents CART from splitting off nodes too small to support separate analysis – For example, we might not want to attempt inferring the probability of prepay in any node containing less than 100 observations – MINCHILD and ATOM are closely related but are different concepts. MINCHILD guarantees that no terminal node will ever be smaller than its predetermined value. ATOM determines the minimum size of a node that is eligible to be split. ATOM must always be at least 2*MINCHILD so that if the smallest node eligible for splitting is split into two equal parts, each part will be at least as large as MINCHILD. • Trees other than the “optimal” tree can be PICKED from the tree sequence
  • 41. CART Modeling Strategies Slide 41 The third tool is selection of a tree from the CART sequence The third tool is selection of a tree from the CART sequence • Analyst intervention in tree selection is both desirable and unavoidable • Allows the incorporation of prior knowledge and domain expertise • This type of selection is really just pruning: the analyst decides to prune back further than the CART algorithms recommend • Topic is mentioned briefly in the CART monograph where the authors discuss their decision to eliminate one or two nodes near the bottom of a medical diagnosis tree: – MD’s running the study did not believe that these lower level splits captured the underlying biology • This is similar to a statistician deciding to exclude a borderline significant interaction in a regression
  • 42. CART Modeling Strategies Slide 42 In the data-mining context, tree selection can be guided by the relative error plot In the data-mining context, tree selection can be guided by the relative error plot • Each CART run produces a plot of relative error against number of nodes and the relative error is printed on the TREE SEQUENCE report • In data mining these plots have a characteristic shape: steep declines in the relative error as tree initially evolves followed by lengthy flat portions in which further error reduction is extremely small with each additional node • Further, the test data support the hypothesis that many of these error reductions are “statisticallystatistically significantsignificant.” In the CART context the claim is that the more complex larger trees will predict well on fresh data and thus contain valuable information.
  • 43. CART Modeling Strategies Slide 43 An analyst could defensibly decide to trade off a large block of nodes for a small “increase” in prediction error An analyst could defensibly decide to trade off a large block of nodes for a small “increase” in prediction error• In one of our CART models the “optimaloptimal” tree had 100 terminal nodes and a relative error of 0.333968 +/- 0.00578 • Yet the sub-tree with 63 terminal nodes only has a relative error of 0.34339, a one-point apparent loss in accuracy. • And 29 terminal nodes yield a relative error of . 38564
  • 44. CART Modeling Strategies Slide 44 Final tree selection based on the relative error plot alone Final tree selection based on the relative error plot alone • In many applications it will be difficult to make a final tree selection based on the relative error plot alone • The plot reveals many opportunities for selection, but rarely serves to single out a best tree • In some problems it is possible to find the tree that exhausts all substantial improvements and that separates a steeply sloping section from a flat plateau
  • 45. CART Modeling Strategies Slide 45 The next step of tree assessment The next step of tree assessment • Carefully review of a relatively large tree chosen by CART • Examination of a large tree node-by-node will be very instructive • We are assuming that the early splits of the tree have already been examined and found to be convincing and acceptable
  • 46. CART Modeling Strategies Slide 46 Review of a relatively large tree chosen by CART Review of a relatively large tree chosen by CART • Purpose of this stage of review is to consider the lower branches: – Do any of the splits appear fortuitous or not particularly believable? – Are the same variables being used repeatedly to minutely subdivide a predictor? – Is it worth pursuing additional refinement of the sub- sample reached at a particular juncture in the tree? – Is there any concern for whatever reason that the splits are not reasonable representations of reality?
  • 47. CART Modeling Strategies Slide 47 Additional ConsiderationsAdditional Considerations • The tree that results when questionable or low value sections of the CART optimal tree are dropped should be considered – Unfortunately, there appears to be no substitute for the careful and detailed examination of the CART tree node-by-node – However, the only contribution of judgment here is to eliminate nodes that are thought to be the result of over-fitting
  • 48. CART Modeling Strategies Slide 48 Goodness-Of-Fit Measures for Classification Trees in Classic CART Goodness-Of-Fit Measures for Classification Trees in Classic CART • CART classification trees automatically generate diagnostic reports – Relative Error Rate for all trees in pruned sequence – Misclassification Rate By Class for Learn and Test data – Misclassification Table: Actual vs. Predicted Class • CART class probability trees display only the relative error sequence • Although these reports are helpful in sorting out the most promising trees early on in CART analyses, they contain far less information than needed for proper model assessment
  • 49. CART Modeling Strategies Slide 49 Characteristics of the CART GINI Measure Characteristics of the CART GINI Measure • Measure is zero whenever a node is pure • Most CART trees are grown and pruned using the Gini measure of within node diversity • Gini is largest when distribution of classes in a node is uniform • CART trees usually grown with priors EQUAL – Essential to encourage promising tree evolution when class distribution is skewed – Practical impact is to make make CART strive for roughly equal accuracy in all classes – Priors DATA and priors MIX rarely work well • CART Gini measure will then be priors adjusted i t pi i ( )= −∑1 2
  • 50. CART Modeling Strategies Slide 50 One new measure of tree performance — “Rho-squaredRho-squared” One new measure of tree performance — “Rho-squaredRho-squared” • Although the growing process is improved with equal priors, the practical evaluation of the tree requires using data priors – Actual node distributions, not priors adjusted • We therefore compute unadjusted Gini for entire tree and compare this with the Gini of the root • Provides a measure of the improvement due to splitting
  • 51. CART Modeling Strategies Slide 51 “Rho-squaredRho-squared”; continued“Rho-squaredRho-squared”; continued • Formal definition of Rho-squared Rho-squared = 1 - Gini(tree)/Gini(root) – If Gini(tree)=Gini(root) we have no improvement and rho-squared=0 – If Gini(tree)=0, meaning all terminal nodes are perfectly pure, then rho-squared=1 – Thus, rho-squared measures how the gap from Gini(root) to a Gini of 0 is closed by the model • Can be used to compare competing tree models
  • 52. CART Modeling Strategies Slide 52 Second new measure compares learn vs. test class distribution in terminal nodes Second new measure compares learn vs. test class distribution in terminal nodes • Every classification tree generates a distribution of the dependent variable in each terminal node • This learn data distribution can be compared with the distribution observed in other data: – The test data used to calibrate relative error rates and select the optimal tree – A test data set independent of both learn and test data used in the tree modeling – Data from other sources that are not necessarily expected to be similar to the tree under study • Might also want to compare the test data with external data
  • 53. CART Modeling Strategies Slide 53 Performance comparisons can be summarized in a chi-square statistic Performance comparisons can be summarized in a chi-square statistic – If there are K classes then each terminal node contributes a chi-square statistic with K-1 df – With T terminal nodes the overall statistic for the tree has T*(K-1) degrees of freedom – Can decompose the statistic by node or by class – Useful when the statistic is large to determine source of large deviations Are we fitting badly in a specific subtree? Are the deviations concentrated in one class?
  • 54. CART Modeling Strategies Slide 54 Class Probability TreesClass Probability Trees • Technically, project Oracle uses class probability trees for forecasts and simulation • Class probability trees use the same GINI method for growing • Uses GINI for pruning trees as well • Nevertheless, we used classification trees throughout and interpreted the results as class probability trees • Several reasons for this approach – Classification trees produce misclassification reports – Can be guided by variable cost of misclassification – Class probability trees sometimes much smaller than classification trees
  • 55. CART Modeling Strategies Slide 55 Class Probability Trees; continued Class Probability Trees; continued • Main problem with class probability trees – Pruning based on equal priors – Want pruning based on data priors, not yet possible in CART • Hence, use of classification tree to allow judgmental pruning • Nonetheless, looking at class probability tree sizes can be used to bound right sized tree • Would be desirable to modify CAR to allow different priors in growing and pruning