Perceptron Arboles De Decision

Enlarging the Margins in Perceptron
Decision Trees
Kristin P. Bennett, Donghui Wu
Department of Mathematical Sciences

Rensselaer Polytechnic Institute

bennek@rpi.edu, wud2@rpi.edu

Nello Cristianini John Shawe-Taylor
Dept of Engineering Mathematics Dept of Computer Science

University of Bristol Royal Holloway, University of London

nello.cristianini@bristol.ac.uk jst@dcs.rhbnc.ac.uk

0-0

Perceptron Decision Trees

Deﬁnition 0.1 Perceptron Decision Trees (PDT), are decision
trees in which each internal node is associated with a hyperplane in
general position in the input space, i.e. the decision is constructed
using a linear combination of attributes, instead of one attribute.

w1 x x x x
x x x
x x w3
w1 x
o
x o o o
o o o o
x o
w2 w3 x o o
x o
x o
o o o
x x
x o o x x
w2

1

Which Decision Is Better?
– Capacity Control of Linear Classifier

Construct a linear classifier: f (x) = wT x − b, i.e.
find a vector w ∈ Rn and a scalar b such that

f (xi ) = wT xi − b ≥1 if yi = 1,
f (xi ) = wT xi − b ≤ −1 if yi = −1,
i = 1, . . . ,

2

Large Margin PDTs Generalize Better

Theorem 0.1 Suppose we are able to classify an m sample of
labeled examples using a perception decision tree and suppose that
the tree obtained contained K decision nodes with margins γi at
node i, then we can bound the generalization error with probability
greater than 1 − δ to be less than

130R2 (4m)K+1 2K
K
D log(4em) log(4m) + log
m (K + 1)δ

K 1
where D = i=1 γi .
2

3

The Baseline Algorithms for Comparison
Algorithm 0.1 (Basic OC1 Algorithm ) Start with the root
node, while a node remains to split

• Optimize the decision based on some splitting criterion.
– Randomized search for local minimum.
– Re-starts to jump out of local minimum.
• Partition the node into two or more child nodes based on
decision.

Prune the tree if necessary.
Splitting Criterion : Twoing Rule (goodness measure)
k 2
|TL | |TR | |Li| |Ri |
T woingV alue = ∗ ∗ −
n n i=1
|TL | |TR |

4

Twoing Rule

Splitting criteria - Twoing Rule: (Breiman, et al. (1984))

k 2
|TL | |TR | |Li| |Ri |
T woingV alue = ∗ ∗ −
n n i=1
|TL | |TR |

where
n = |TL | + |TR | - total number of instances at current node
k - number of classes, for two class problems
|TL | - number of instances on the left of the split, i.e. wT x − b >= 0
|TR | - number of instances on the right of the split i.e. wT x − b < 0
|Li | - number of instances in category i on the the left of the split
|Ri | - number of instances in category i on the the right of the split

5

Enlarging the Margins of PDT

Algorithms of producing large margin PDTs:

• Post-process existing trees (FAT).
Find optimal separating hyperplane at each node of the
existing trees.

• Incorporate large Margin into splitting criteria (MOC1)
max T woingV alue + C ∗ CurrentM argin

• Incorporate large margin into goodness measure (MOC2)
Modiﬁed Towing Rule.

6

The OC1-PDT and FAT-PDT
IDEA: Post-process the existing OC1-PDT, replace the the OC1
decision with optimal separating hyperlane at each decision node.

x o o o o
x
o o o o
o x x
x x o
x x
x x x
x
x x
x x
x
x x x
o
x o

; x
o
x
o
o OC1
o o
FAT
o
o x
o

7

The FAT Algorithm

1. Construct a decision tree using OC1, call it OC1-PDT.

2. Starting from root of OC1-PDT, traverses through all the non-leaf
nodes. At each node,

• Relabel the points at node with ω T x − b ≥ 0 as superclass right,
the other points at this node as superclass left.
• Find the perceptron (optimal separating hyperplane)
f (x) = ω ∗ T x − b∗ , which separates superclasses right and left
perfectly with maximal margin.
• Replace the original perceptron with the new one.

8

FAT Generalizes Better Than OC1
10−fold cross validation results: FAT vs OC1
100

95 significant
x=y

90
FAT 10−CV average accuracy

85

80

75

70

65
65 70 75 80 85 90 95 100
OC1 10−CV average accuracy

9

MOC1: Margin OC1

IDEA: Use Multi-objective splitting criterion to maximize both
TwoingValue and margin.

max T woingV alue + C ∗ CurrentM argin

x
o

x o

x o

o

x o o

x o

x o

o

10

MOC1 Generalizes Better Than OC1
10−fold cross validation results: MOC1 vs OC1
100

significant
not significant
95 x=y

90
MOC1 10−CV average accuracy

85

80

75

70

65
65 70 75 80 85 90 95 100

11

The MOC2 Algorithm
IDEA: Modify Twoing Criterion to allow “soft-margin”.
Want both high accuracy and strong separation/margin.

k k
|M TL | |M TR | |Li | |Ri | |M Li | |M Ri |
T woingV alue = ∗ ∗ − ∗ −
n n |TL | |TR | |M TL | |M TR |
i=1 i=1

x x o
x x o
x o
x
o o
x
x x x x o
x o o
o o
x x o o o
x o o

x x o o
x x
o
o
x o
x o
x

wx=b+1 wx = b wx=b-1

12

The Modiﬁed Twoing Rule

k k
|M TL | |M TR | |Li | |Ri | |M Li | |M Ri |
T woingV alue = ∗ ∗ − ∗ −
n n |TL | |TR | |M TL | |M TR |
i=1 i=1

where n = |TL | + |TR | - total number of instances at current node
k - number of classes, for two class problems
|TL | - number of instances on the left of the split, i.e. wT x − b >= 0
|TR | - number of instances on the right of the split i.e. wT x − b < 0
|Li | - number of instances in category i on the the left of the split
|Ri | - number of instances in category i on the the right of the split
|M TL | - number of instances on the left of the split, wT x − b >= 1
|M TR | - number of instances on the right of the split wT x − b <= −1
|M Li | - number of instances in category i with wT x − b >= 1
|M Ri | - number of instances in category i with wT x − b <= −1

13

MOC2 Generalizes Better Than OC1
10−fold cross validation results: MOC2 vs OC1
100

significant
95
not significant
x=y

90
MOC2 10−CV average accuracy

85

80

75

70

65
65 70 75 80 85 90 95 100

14

Performances of OC1, FAT, MOC1 and MOC2
OC1 FAT MOC1 MOC2 Best
Dataset x
¯ x (p value)
¯ x (p value)
¯ x (p value)
¯ classiﬁer
Bright 98.46 98.62 (.05) 98.94 (.10) 98.82 (.10) MOC1
Liver 65.22 66.09 (.10) 68.41 (.20) 70.14 (.04) MOC2
Cancer 95.89 96.48 (.05) 95.60 (*) 95.89 (*) FAT
Dim 94.82 94.92 (.20) 95.23 (.09) 94.90 (*) MOC1
Heart 73.40 76.43 (.12) 75.76 (.21) 77.78 (.10) MOC2
Housing 81.03 83.20 (.05) 82.02 (*) 80.23 (*) FAT
Iris 95.33 96.00 (.17) 95.33 (*) 96.00 (*) FAT
Diabetes 71.09 71.48 (.04) 73.18 (.08) 72.53 (.23) MOC1
Prognosis 78.91 74.15 (**) 78.91 (*) 79.59 (*) MOC2
Sonar 67.79 74.04 (.01) 72.12 (.19) 73.21 (.16) FAT

15

Conclusions

• Generalization error of PDT is bounded by function of
margins, tree size, and training set size.

• Three algorithms to control capacity of PDT investigated:

– Post-processing existing trees (FAT)
– Incorporating margins into splitting criteria:
∗ Multicriteria splitting rule (MOC1)
∗ Soft-margin modiﬁed twoing-rule (MOC2).

• Theoretically and empirically enlarged margin PDT performed
better.

16

Perceptron Arboles De Decision

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (20)

Similar to Perceptron Arboles De Decision

Similar to Perceptron Arboles De Decision (20)

More from ESCOM

More from ESCOM (20)

Recently uploaded

Recently uploaded (20)

Perceptron Arboles De Decision