CART
Not only Classification and Regression Trees
Marc Garcia
PyData Amsterdam - March 12th, 2016
1 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Introduction
2 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Introduction
Talk goals
How decision trees work
Common problems and advantages
What they can be used for
Focus on classification
3 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Introduction
About me
@datapythonista - http://datapythonista.github.io/
Python user since 2006
Active Django developer, 2007 - 2012
GSoC 2009: Django localization
Master in Artificial Intelligence, 2012
Master in Finance 2014
Currently working at Bank of America Merrill Lynch in London
Owner of Quantitative Mining
Machine learning applied to digital marketing
4 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Warm up example
5 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Warm up example
Example: The geek party
We want to predict attendance to the geek party
We know for each possible attendee:
Their age
The distance from their home to the event location
6 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Warm up example
Data set
import pandas as pd
data = {’age’: [38, 49, 27, 19, 54, 29, 19, 42, 34, 64,
19, 62, 27, 77, 55, 41, 56, 32, 59, 35],
’distance’: [6169.98, 7598.87, 3276.07, 1570.43, 951.76,
139.97, 4476.89, 8958.77, 1336.44, 6138.85,
2298.68, 1167.92, 676.30, 736.85, 1326.52,
712.13, 3083.07, 1382.64, 2267.55, 2844.18],
’attended’: [False, False, False, True, True, True, False,
True, True, True, False, True, True, True,
False, True, True, True, True, False]}
df = pd.DataFrame(data)
age distance attended
0 38 6169.98 False
1 49 7598.87 False
2 27 3276.07 False
3 19 1570.43 True
4 54 951.76 True
5 29 139.97 True
6 19 4476.89 False
7 42 8958.77 True
8 34 1336.44 True
9 64 6138.85 True
10 19 2298.68 False
11 62 1167.92 True
7 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Warm up example
Data set visualization
from bokeh.plotting import figure, show
p = figure(title = ’Event attendance’)
p.xaxis.axis_label = ’Distance’
p.yaxis.axis_label = ’Age’
p.circle(df[df.attended][’distance’],
df[df.attended][’age’],
color=’red’,
legend=’Attended’,
fill_alpha=0.2,
size=10)
p.circle(df[~df.attended][’distance’],
df[~df.attended][’age’],
color=’blue’,
legend="Didn’t attend",
fill_alpha=0.2,
size=10)
show(p)
8 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Warm up example
Using a linear model
from sklearn.linear_model import LogisticRegression
logit = LogisticRegression()
logit.fit(df[[’age’, ’distance’]], df[’attended’])
def get_y(x): return -(logit.intercept_[0] + logit.coef_[0,1] * x) / logit.coef_[0,0]
plot = base_plot()
min_x, max_x = df[’distance’].min(), df[’distance’].max()
plot.line(x=[min_x, max_x],
y=[get_y(min_x), get_y(max_x)],
line_color=’black’,
line_width=2)
_ = show(plot)
9 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Warm up example
Linear model
10 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Warm up example
How is the model?
θintercept + θage · age + θdistance · distance >= 0 (1)
11 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Warm up example
Using a decision tree
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(df[[’age’, ’distance’]], df[’attended’])
cart_plot(dtree)
12 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Warm up example
Decision tree
13 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Warm up example
Decision tree
14 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Warm up example
Decision tree
15 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Warm up example
Decision tree
16 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Warm up example
Decision tree
17 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Warm up example
Decision tree
18 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Warm up example
Decision tree
19 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Warm up example
How is the model?
def decision_tree_model(age, distance):
if distance >= 2283.11:
if age >= 40.00:
if distance >= 6868.86:
if distance >= 8278.82:
return True
else:
return False
else:
return True
else:
return False
else:
if age >= 54.50:
if age >= 57.00:
return True
else:
return False
else:
return True
20 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Properties
21 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Properties
Overfitting
Decision trees will overfit
But will generalize using next parameters:
min_samples_leaf
min_samples_split
max_depth
max_leaf_nodes
22 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Properties
Performance
Compared to a plain logistic regression, decision trees can be:
Training: one order of magnitude slower
Prediction: one order of magnitude slower
Obviously depends on many factors (size of data, depth of tree, etc)
23 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Properties
Standardization / Normalization
Not required
Using original units we will be able to
understand the tree better
24 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Properties
Feature selection
We get feature selection for free
If a feature is not relevant, it is not used in the tree
sklearn: Gives you the feature importances:
>>> list(zip([’age’, ’distance’],
... dtree.feature_importances_))
[(’age’, 0.5844155844155845), (’distance’, 0.41558441558441556)]
25 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Properties
Feature extraction and data cleaning
We need to take care of it
Most of the times, we can improve more the results by better data, than by better models
We can capture the correlation between variables in new variables (e.g. PCA)
Remember that decision boundaries are always orthogonal to the axis
26 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Properties
Categorical variables and missing values
Decision trees can deal with them
But this depends on the implementation
sklearn uses float without missing values
Ordinal categorical variables are treated nicely
They have an intrinsic order, and grouped with neighbours
27 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Properties
Binning
We can add domain knowledge to the classifier
Avoid Highly-branching attributes bias
Two types:
Unify values. e.g. 20, 20, 30, 30, 30, 40, 50, 50
Binary variables. e.g. lives_in_the_city (yes/no)
Where to cut cyclic variables? e.g. hour of the day
28 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Properties
Unbalanced data
According to the literature, decision tree is
biased towards the dominant class
In my experience, it behaves better than
linear models without treatment
But classical techniques like weighting can be
applied
29 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Properties
Model debugging
We can know why we got a prediction
Contrast with common sense and domain knowledge
Apply changes:
To the data (useful to other models)
To the model parameters
To the model itself
30 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Properties
Stability
Small changes in data, can cause a big
change in the model
Random Forests help preventing this
31 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Training
32 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Training
Basic algorithm
def train_decision_tree(x, y):
feature, value = get_best_split(x, y)
x_left, y_left = x[x[feature] < value], y[x[feature] < value]
if len(y_left.unique()) > 1:
left_node = train_decision_tree(x_left, y_left)
else:
left_node = None
x_right, y_right = x[x[feature] >= value], y[x[feature] >= value]
if len(y_right.unique()) > 1:
right_node = train_decision_tree(x_right, y_right)
else:
right_node = None
return Node(feature, value, left_node, right_node)
33 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Training
Best split
Candidate split 1
age 18 19 21 27 29 34 38 42 49 54 62 64
attended F F T F T T F T F T T T
Split True False
Left 0 1
Right 7 4
Candidate split 2
age 18 19 21 27 29 34 38 42 49 54 62 64
attended F F T F T T F T F T T T
Split True False
Left 0 2
Right 7 3
34 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Training
Best split algorithm
def get_best_split(x, y):
best_split = None
best_entropy = 1.
for feature in x.columns.values:
column = x[feature]
for value in column.iterrows():
a = y[column < value] == class_a_value
b = y[column < value] == class_b_value
left_weight = (a + b) / len(y.index)
left_entropy = entropy(a, b)
a = y[column >= value] == class_a_value
b = y[column >= value] == class_b_value
right_items = (a + b) / len(y.index)
right_entropy = entropy(a, b)
split_entropy = left_weight * left_etropy + right_weight * right_entropy
if split_entropy < best_entropy:
best_split = (feature, value)
best_entropy = split_entropy
return best_split
35 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Training
Entropy
For a given subset1:
entropy = −Prattending · log2 Prattending − Pr¬attending · log2 Pr¬attending (2)
1
Note that pure splits have an entropy of 0 ·∞ = 0
36 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Training
Entropy algorithm
import math
def entropy(a, b):
total = a + b
prob_a = a / total
prob_b = b / total
return - prob_a * math.log(prob_a, 2) 
- prob_b * math.log(prob_b, 2)
37 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Training
Information gain
For a given split:
information_gain = entropyparent −
itemsleft
itemstotal
· entropyleft +
itemsright
itemstotal
· entropyright (3)
38 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Applications
39 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Applications
Besides classification
Exploratory analysis
Visualize your model to see if it makes sense
Detect problems in your data
Probability estimation
Uses frequencies
Linear models use distance to the decision boundary, which IMHO is a worse heuristic
Regression
Simple method: Constant value for each split
Advanced methods: Linear regression (or other) for each split
40 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Applications
Summary
Different approach than linear models
Stright-forward algorithm theory
Multiple usages: classification, regression, etc.
Strong points:
Not a black box: visualize and debug
Implicit preprocessing: feature selection, normalization
Weak points:
Need to control overfitting
Performance compared to logit
Feature extraction and binning to improve results
41 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Applications
Thank you
QUESTIONS?
42 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Appendix
43 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Appendix
Tree to nodes
from collections import namedtuple
from itertools import starmap
def tree_to_nodes(dtree):
nodes = starmap(namedtuple(’Node’, ’feature,threshold,left,right’),
zip(map(lambda x: {0: ’age’, 1: ’distance’}.get(x),
dtree.tree_.feature),
dtree.tree_.threshold,
dtree.tree_.children_left,
dtree.tree_.children_right))
return list(nodes)
44 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Appendix
CART plot decision boundaries (I)
from collections import namedtuple, deque
from functools import partial
class NodeRanges(namedtuple(’NodeRanges’, ’node,max_x,min_x,max_y,min_y’)):
pass
def cart_plot(nodes):
nodes = tree_to_nodes(dtree)
plot = base_plot()
add_line = partial(plot.line, line_color=’black’, line_width=2)
stack = deque()
stack.append(NodeRanges(node=nodes[0],
max_x=df[’distance’].max(),
min_x=df[’distance’].min(),
max_y=df[’age’].max(),
min_y=df[’age’].min()))
# (continues)
45 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Appendix
CART plot decision boundaries (II)
while len(stack):
node, max_x, min_x, max_y, min_y = stack.pop()
feature, threshold, left, right = node
if feature == ’distance’:
add_line(x=[threshold, threshold], y=[min_y, max_y])
elif feature == ’age’:
add_line(x=[min_x, max_x], y=[threshold, threshold])
else:
continue
stack.append(NodeRanges(node=nodes[left],
max_x=threshold if feature == ’distance’ else max_x,
min_x=min_x,
max_y=threshold if feature == ’age’ else max_y,
min_y=min_y))
stack.append(NodeRanges(node=nodes[right],
max_x=max_x,
min_x=threshold if feature == ’distance’ else min_x,
max_y=max_y,
min_y=threshold if feature == ’age’ else min_y))
show(plot)
46 / 47
CART: Not only Classification and Regression Trees - Marc Garcia
Appendix
CART tree Jupyther notebook
import pydot
from IPython.display import Image
def print_cart_notebook(clf, features):
dot_data = StringIO.StringIO()
tree.export_graphviz(clf, feature_names=features, out_file=dot_data)
data = dot_data.getvalue().encode(’utf-8’)
graph = pydot.graph_from_dot_data(data).create_png()
return Image(graph)
47 / 47
CART: Not only Classification and Regression Trees - Marc Garcia

CART: Not only Classification and Regression Trees

  • 1.
    CART Not only Classificationand Regression Trees Marc Garcia PyData Amsterdam - March 12th, 2016 1 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 2.
    Introduction 2 / 47 CART:Not only Classification and Regression Trees - Marc Garcia
  • 3.
    Introduction Talk goals How decisiontrees work Common problems and advantages What they can be used for Focus on classification 3 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 4.
    Introduction About me @datapythonista -http://datapythonista.github.io/ Python user since 2006 Active Django developer, 2007 - 2012 GSoC 2009: Django localization Master in Artificial Intelligence, 2012 Master in Finance 2014 Currently working at Bank of America Merrill Lynch in London Owner of Quantitative Mining Machine learning applied to digital marketing 4 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 5.
    Warm up example 5/ 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 6.
    Warm up example Example:The geek party We want to predict attendance to the geek party We know for each possible attendee: Their age The distance from their home to the event location 6 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 7.
    Warm up example Dataset import pandas as pd data = {’age’: [38, 49, 27, 19, 54, 29, 19, 42, 34, 64, 19, 62, 27, 77, 55, 41, 56, 32, 59, 35], ’distance’: [6169.98, 7598.87, 3276.07, 1570.43, 951.76, 139.97, 4476.89, 8958.77, 1336.44, 6138.85, 2298.68, 1167.92, 676.30, 736.85, 1326.52, 712.13, 3083.07, 1382.64, 2267.55, 2844.18], ’attended’: [False, False, False, True, True, True, False, True, True, True, False, True, True, True, False, True, True, True, True, False]} df = pd.DataFrame(data) age distance attended 0 38 6169.98 False 1 49 7598.87 False 2 27 3276.07 False 3 19 1570.43 True 4 54 951.76 True 5 29 139.97 True 6 19 4476.89 False 7 42 8958.77 True 8 34 1336.44 True 9 64 6138.85 True 10 19 2298.68 False 11 62 1167.92 True 7 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 8.
    Warm up example Dataset visualization from bokeh.plotting import figure, show p = figure(title = ’Event attendance’) p.xaxis.axis_label = ’Distance’ p.yaxis.axis_label = ’Age’ p.circle(df[df.attended][’distance’], df[df.attended][’age’], color=’red’, legend=’Attended’, fill_alpha=0.2, size=10) p.circle(df[~df.attended][’distance’], df[~df.attended][’age’], color=’blue’, legend="Didn’t attend", fill_alpha=0.2, size=10) show(p) 8 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 9.
    Warm up example Usinga linear model from sklearn.linear_model import LogisticRegression logit = LogisticRegression() logit.fit(df[[’age’, ’distance’]], df[’attended’]) def get_y(x): return -(logit.intercept_[0] + logit.coef_[0,1] * x) / logit.coef_[0,0] plot = base_plot() min_x, max_x = df[’distance’].min(), df[’distance’].max() plot.line(x=[min_x, max_x], y=[get_y(min_x), get_y(max_x)], line_color=’black’, line_width=2) _ = show(plot) 9 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 10.
    Warm up example Linearmodel 10 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 11.
    Warm up example Howis the model? θintercept + θage · age + θdistance · distance >= 0 (1) 11 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 12.
    Warm up example Usinga decision tree from sklearn.tree import DecisionTreeClassifier dtree = DecisionTreeClassifier() dtree.fit(df[[’age’, ’distance’]], df[’attended’]) cart_plot(dtree) 12 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 13.
    Warm up example Decisiontree 13 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 14.
    Warm up example Decisiontree 14 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 15.
    Warm up example Decisiontree 15 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 16.
    Warm up example Decisiontree 16 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 17.
    Warm up example Decisiontree 17 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 18.
    Warm up example Decisiontree 18 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 19.
    Warm up example Decisiontree 19 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 20.
    Warm up example Howis the model? def decision_tree_model(age, distance): if distance >= 2283.11: if age >= 40.00: if distance >= 6868.86: if distance >= 8278.82: return True else: return False else: return True else: return False else: if age >= 54.50: if age >= 57.00: return True else: return False else: return True 20 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 21.
    Properties 21 / 47 CART:Not only Classification and Regression Trees - Marc Garcia
  • 22.
    Properties Overfitting Decision trees willoverfit But will generalize using next parameters: min_samples_leaf min_samples_split max_depth max_leaf_nodes 22 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 23.
    Properties Performance Compared to aplain logistic regression, decision trees can be: Training: one order of magnitude slower Prediction: one order of magnitude slower Obviously depends on many factors (size of data, depth of tree, etc) 23 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 24.
    Properties Standardization / Normalization Notrequired Using original units we will be able to understand the tree better 24 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 25.
    Properties Feature selection We getfeature selection for free If a feature is not relevant, it is not used in the tree sklearn: Gives you the feature importances: >>> list(zip([’age’, ’distance’], ... dtree.feature_importances_)) [(’age’, 0.5844155844155845), (’distance’, 0.41558441558441556)] 25 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 26.
    Properties Feature extraction anddata cleaning We need to take care of it Most of the times, we can improve more the results by better data, than by better models We can capture the correlation between variables in new variables (e.g. PCA) Remember that decision boundaries are always orthogonal to the axis 26 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 27.
    Properties Categorical variables andmissing values Decision trees can deal with them But this depends on the implementation sklearn uses float without missing values Ordinal categorical variables are treated nicely They have an intrinsic order, and grouped with neighbours 27 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 28.
    Properties Binning We can adddomain knowledge to the classifier Avoid Highly-branching attributes bias Two types: Unify values. e.g. 20, 20, 30, 30, 30, 40, 50, 50 Binary variables. e.g. lives_in_the_city (yes/no) Where to cut cyclic variables? e.g. hour of the day 28 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 29.
    Properties Unbalanced data According tothe literature, decision tree is biased towards the dominant class In my experience, it behaves better than linear models without treatment But classical techniques like weighting can be applied 29 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 30.
    Properties Model debugging We canknow why we got a prediction Contrast with common sense and domain knowledge Apply changes: To the data (useful to other models) To the model parameters To the model itself 30 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 31.
    Properties Stability Small changes indata, can cause a big change in the model Random Forests help preventing this 31 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 32.
    Training 32 / 47 CART:Not only Classification and Regression Trees - Marc Garcia
  • 33.
    Training Basic algorithm def train_decision_tree(x,y): feature, value = get_best_split(x, y) x_left, y_left = x[x[feature] < value], y[x[feature] < value] if len(y_left.unique()) > 1: left_node = train_decision_tree(x_left, y_left) else: left_node = None x_right, y_right = x[x[feature] >= value], y[x[feature] >= value] if len(y_right.unique()) > 1: right_node = train_decision_tree(x_right, y_right) else: right_node = None return Node(feature, value, left_node, right_node) 33 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 34.
    Training Best split Candidate split1 age 18 19 21 27 29 34 38 42 49 54 62 64 attended F F T F T T F T F T T T Split True False Left 0 1 Right 7 4 Candidate split 2 age 18 19 21 27 29 34 38 42 49 54 62 64 attended F F T F T T F T F T T T Split True False Left 0 2 Right 7 3 34 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 35.
    Training Best split algorithm defget_best_split(x, y): best_split = None best_entropy = 1. for feature in x.columns.values: column = x[feature] for value in column.iterrows(): a = y[column < value] == class_a_value b = y[column < value] == class_b_value left_weight = (a + b) / len(y.index) left_entropy = entropy(a, b) a = y[column >= value] == class_a_value b = y[column >= value] == class_b_value right_items = (a + b) / len(y.index) right_entropy = entropy(a, b) split_entropy = left_weight * left_etropy + right_weight * right_entropy if split_entropy < best_entropy: best_split = (feature, value) best_entropy = split_entropy return best_split 35 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 36.
    Training Entropy For a givensubset1: entropy = −Prattending · log2 Prattending − Pr¬attending · log2 Pr¬attending (2) 1 Note that pure splits have an entropy of 0 ·∞ = 0 36 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 37.
    Training Entropy algorithm import math defentropy(a, b): total = a + b prob_a = a / total prob_b = b / total return - prob_a * math.log(prob_a, 2) - prob_b * math.log(prob_b, 2) 37 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 38.
    Training Information gain For agiven split: information_gain = entropyparent − itemsleft itemstotal · entropyleft + itemsright itemstotal · entropyright (3) 38 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 39.
    Applications 39 / 47 CART:Not only Classification and Regression Trees - Marc Garcia
  • 40.
    Applications Besides classification Exploratory analysis Visualizeyour model to see if it makes sense Detect problems in your data Probability estimation Uses frequencies Linear models use distance to the decision boundary, which IMHO is a worse heuristic Regression Simple method: Constant value for each split Advanced methods: Linear regression (or other) for each split 40 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 41.
    Applications Summary Different approach thanlinear models Stright-forward algorithm theory Multiple usages: classification, regression, etc. Strong points: Not a black box: visualize and debug Implicit preprocessing: feature selection, normalization Weak points: Need to control overfitting Performance compared to logit Feature extraction and binning to improve results 41 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 42.
    Applications Thank you QUESTIONS? 42 /47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 43.
    Appendix 43 / 47 CART:Not only Classification and Regression Trees - Marc Garcia
  • 44.
    Appendix Tree to nodes fromcollections import namedtuple from itertools import starmap def tree_to_nodes(dtree): nodes = starmap(namedtuple(’Node’, ’feature,threshold,left,right’), zip(map(lambda x: {0: ’age’, 1: ’distance’}.get(x), dtree.tree_.feature), dtree.tree_.threshold, dtree.tree_.children_left, dtree.tree_.children_right)) return list(nodes) 44 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 45.
    Appendix CART plot decisionboundaries (I) from collections import namedtuple, deque from functools import partial class NodeRanges(namedtuple(’NodeRanges’, ’node,max_x,min_x,max_y,min_y’)): pass def cart_plot(nodes): nodes = tree_to_nodes(dtree) plot = base_plot() add_line = partial(plot.line, line_color=’black’, line_width=2) stack = deque() stack.append(NodeRanges(node=nodes[0], max_x=df[’distance’].max(), min_x=df[’distance’].min(), max_y=df[’age’].max(), min_y=df[’age’].min())) # (continues) 45 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 46.
    Appendix CART plot decisionboundaries (II) while len(stack): node, max_x, min_x, max_y, min_y = stack.pop() feature, threshold, left, right = node if feature == ’distance’: add_line(x=[threshold, threshold], y=[min_y, max_y]) elif feature == ’age’: add_line(x=[min_x, max_x], y=[threshold, threshold]) else: continue stack.append(NodeRanges(node=nodes[left], max_x=threshold if feature == ’distance’ else max_x, min_x=min_x, max_y=threshold if feature == ’age’ else max_y, min_y=min_y)) stack.append(NodeRanges(node=nodes[right], max_x=max_x, min_x=threshold if feature == ’distance’ else min_x, max_y=max_y, min_y=threshold if feature == ’age’ else min_y)) show(plot) 46 / 47 CART: Not only Classification and Regression Trees - Marc Garcia
  • 47.
    Appendix CART tree Jupythernotebook import pydot from IPython.display import Image def print_cart_notebook(clf, features): dot_data = StringIO.StringIO() tree.export_graphviz(clf, feature_names=features, out_file=dot_data) data = dot_data.getvalue().encode(’utf-8’) graph = pydot.graph_from_dot_data(data).create_png() return Image(graph) 47 / 47 CART: Not only Classification and Regression Trees - Marc Garcia