04 structured support vector machine

Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs

Part 5: Structured Support Vector Machines

Sebastian Nowozin and Christoph H. Lampert

Colorado Springs, 25th June 2011

1 / 56


Problem (Loss-Minimizing Parameter Learning)
Let d(x, y) be the (unknown) true data distribution.
Let D = {(x1 , y 1 ), . . . , (xN , y N )} be i.i.d. samples from d(x, y).
Let φ : X × Y → RD be a feature function.
Let ∆ : Y × Y → R be a loss function.
Find a weight vector w∗ that leads to minimal expected loss

E(x,y)∼d(x,y) {∆(y, f (x))}

for f (x) = argmaxy∈Y w, φ(x, y) .

2 / 56


Problem (Loss-Minimizing Parameter Learning)
Let d(x, y) be the (unknown) true data distribution.
Let D = {(x1 , y 1 ), . . . , (xN , y N )} be i.i.d. samples from d(x, y).
Let φ : X × Y → RD be a feature function.
Let ∆ : Y × Y → R be a loss function.
Find a weight vector w∗ that leads to minimal expected loss

E(x,y)∼d(x,y) {∆(y, f (x))}

for f (x) = argmaxy∈Y w, φ(x, y) .

Pro:
We directly optimize for the quantity of interest: expected loss.
No expensive-to-compute partition function Z will show up.
Con:
We need to know the loss function already at training time.
We can’t use probabilistic reasoning to ﬁnd w∗ .
3 / 56


Reminder: learning by regularized risk minimization

For compatibility function g(x, y; w) := w, φ(x, y) ﬁnd w∗ that minimizes

E(x,y)∼d(x,y) ∆( y, argmaxy g(x, y; w) ).

Two major problems:
d(x, y) is unknown
argmaxy g(x, y; w) maps into a discrete space
→ ∆( y, argmaxy g(x, y; w)) is discontinuous, piecewise constant

4 / 56


Task:

min E(x,y)∼d(x,y) ∆( y, argmaxy g(x, y; w) ).
w

Problem 1:
d(x, y) is unknown

Solution:
1
Replace E(x,y)∼d(x,y) · with empirical estimate N (xn ,y n ) ·
To avoid overﬁtting: add a regularizer, e.g. λ w 2.

New task:
N
2 1
min λ w + ∆( y n , argmaxy g(xn , y; w) ).
w N
n=1

5 / 56


Task:
N
2 1
min λ w + ∆( y n , argmaxy g(xn , y; w) ).
w N
n=1

Problem:
∆( y, argmaxy g(x, y; w) ) discontinuous w.r.t. w.

Solution:
Replace ∆(y, y ) with well behaved (x, y, w)
Typically: upper bound to ∆, continuous and convex w.r.t. w.

New task:
N
2 1
min λ w + (xn , y n , w))
w N
n=1

6 / 56


Regularized Risk Minimization
N
2 1
w N
n=1
Regularization + Loss on training data

7 / 56


N
2 1
w N
n=1

Hinge loss: maximum margin training

(xn , y n , w) := max ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )
y∈Y

8 / 56


N
2 1
w N
n=1


y∈Y

is maximum over linear functions → continuous, convex.
bounds ∆ from above.
Proof: Let y = argmaxy g(xn , y, w)
¯
∆(y n , y ) ≤ ∆(y n , y ) + g(xn , y , w) − g(xn , y n , w)
¯ ¯ ¯
≤ max ∆(y n , y) + g(xn , y, w) − g(xn , y n , w)
y∈Y
9 / 56


N
2 1
w N
n=1


y∈Y

Alternative:
Logistic loss: probabilistic training

(xn , y n , w) := log exp w, φ(xn , y) − w, φ(xn , y n )
y∈Y

10 / 56


Structured Output Support Vector Machine

N
1 2 C
min w + max ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )
w 2 N y∈Y
n=1

Conditional Random Field
N
w 2
min + log exp w, φ(xn , y) − w, φ(xn , y n )
w 2σ 2
n=1 y∈Y

CRFs and SSVMs have more in common than usually assumed.
both do regularized risk minimization
log y exp(·) can be interpreted as a soft-max
11 / 56


Solving the Training Optimization Problem Numerically

Structured Output Support Vector Machine:

N
1 2 C
w 2 N y∈Y
n=1

Unconstrained optimization, convex, non-diﬀerentiable objective.

12 / 56


Structured Output SVM (equivalent formulation):
N
1 2 C
min w + ξn
w,ξ 2 N
n=1

subject to, for n = 1, . . . , N ,

max ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) ≤ ξn
y∈Y

N non-linear contraints, convex, diﬀerentiable objective.

13 / 56


Structured Output SVM (also equivalent formulation):
N
1 2 C
min w + ξn
w,ξ 2 N
n=1


∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) ≤ ξ n , for all y ∈ Y

N |Y| linear constraints, convex, diﬀerentiable objective.

14 / 56


Example: Multiclass SVM

1 for y = y
Y = {1, 2, . . . , K}, ∆(y, y ) = .
0 otherwise
φ(x, y) = y = 1 φ(x), y = 2 φ(x), . . . , y = K φ(x)

N
1 2 C
Solve: min w + ξn
w,ξ 2 N
n=1

subject to, for i = 1, . . . , n,
w, φ(xn , y n ) − w, φ(xn , y) ≥ 1 − ξ n for all y ∈ Y {y n }.

Classiﬁcation: f (x) = argmaxy∈Y w, φ(x, y) .

Crammer-Singer Multiclass SVM
[K. Crammer, Y. Singer: ”On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines”, JMLR, 2001] 15 / 56


Example: Hierarchical SVM

Hierarchical Multiclass Loss:
1
∆(y, y ) := (distance in tree)
2
∆(cat, cat) = 0, ∆(cat, dog) = 1,
∆(cat, bus) = 2, etc.

N
1 2 C
Solve: min w + ξn
w,ξ 2 N
n=1


w, φ(xn , y n ) − w, φ(xn , y) ≥ ∆(y n , y) − ξ n for all y ∈ Y.

[L. Cai, T. Hofmann: ”Hierarchical Document Categorization with Support Vector Machines”, ACM CIKM, 2004]
[A. Binder, K.-R. M¨ller, M. Kawanabe: ”On taxonomies for multi-class image categorization”, IJCV, 2011]
u

16 / 56



We can solve SSVM training like CRF training:

N
1 2 C
w 2 N y∈Y
n=1

continuous
unconstrained
convex
non-diﬀerentiable
→ we can’t use gradient descent directly.
→ we’ll have to use subgradients

17 / 56


Deﬁnition
Let f : RD → R be a convex, not necessarily diﬀerentiable, function.
A vector v ∈ RD is called a subgradient of f at w0 , if

f (w) ≥ f (w0 ) + v, w − w0 for all w.

f(w) f(w0)+⟨v,w-w0⟩
f(w )
0

w
w 0

18 / 56


Deﬁnition

f (w) ≥ f (w0 ) + v, w − w0 for all w.

f(w) f(w0)+⟨v,w-w0⟩

f(w )0

w
w 0

19 / 56


Deﬁnition

f (w) ≥ f (w0 ) + v, w − w0 for all w.

f(w) f(w0)+⟨v,w-w0⟩
f(w )
0

w
w 0

20 / 56


Deﬁnition

f (w) ≥ f (w0 ) + v, w − w0 for all w.

f(w) f(w0)+⟨v,w-w0⟩
f(w ) 0

w
w 0

For diﬀerentiable f , the gradient v = f (w0 ) is the only subgradient.
f(w) f(w0)+⟨v,w-w0⟩

f(w )0

w
w 0

21 / 56


Subgradient descent works basically like gradient descent:
Subgradient Descent Minimization – minimize F (w)

require: tolerance > 0, stepsizes ηt
wcur ← 0
repeat
v ∈ subw F (wcur )
wcur ← wcur − ηt v
until F changed less than
return wcur

Converges to global minimum, but rather inefficient if F non-differentiable.

[Shor, ”Minimization methods for non-differentiable functions”, Springer, 1985.]

22 / 56


Computing a subgradient:
N
1 2 C n
min w + (w)
w 2 N
n=1

with n (w) = maxy n (w), and
y

n
y (w) := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )

23 / 56


N
1 2 C n
min w + (w)
w 2 N
n=1

y

n

ℓ(w)
y

w

For each y ∈ Y, y (w) is a linear function.
24 / 56


N
1 2 C n
min w + (w)
w 2 N
n=1

y

n

ℓ(w) y'

w

25 / 56


N
1 2 C n
min w + (w)
w 2 N
n=1

y

n

ℓ(w)

w

26 / 56


N
1 2 C n
min w + (w)
w 2 N
n=1

y

n

ℓ(w)

w

(w) = maxy y (w): maximum over all y ∈ Y.
27 / 56


N
1 2 C n
min w + (w)
w 2 N
n=1

y

n

ℓ(w)

w
w 0

Subgradient of n at w0 :
28 / 56


N
1 2 C n
min w + (w)
w 2 N
n=1

y

n

ℓ(w)

w
w 0

Subgradient of n at w0 : ﬁnd maximal (active) y.
29 / 56


N
1 2 C n
min w + (w)
w 2 N
n=1

y

n

ℓ(w)

w
w 0

Subgradient of n at w0 : ﬁnd maximal (active) y, use v = n (w ).
y 0
30 / 56


Subgradient Descent S-SVM Training
input training pairs {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y,
input feature map φ(x, y), loss function ∆(y, y ), regularizer C,
input number of iterations T , stepsizes ηt for t = 1, . . . , T
1: w←0
2: for t=1,. . . ,T do
3: for i=1,. . . ,n do
4: y ← argmaxy∈Y ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )
ˆ
5: v n ← φ(xn , y ) − φ(xn , y n )
ˆ
6: end for
C
7: w ← w − ηt (w − N n v n )
8: end for
output prediction function f (x) = argmaxy∈Y w, φ(x, y) .

Observation: each update of w needs 1 argmax-prediction per example.
31 / 56


We can use the same tricks as for CRFs, e.g. stochastic updates:

Stochastic Subgradient Descent S-SVM Training
input feature map φ(x, y), loss function ∆(y, y ), regularizer C,
input number of iterations T , stepsizes ηt for t = 1, . . . , T
1: w←0
2: for t=1,. . . ,T do
3: (xn , y n ) ← randomly chosen training example pair
4: y ← argmaxy∈Y ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )
ˆ
C
5: w ← w − ηt (w − N [φ(xn , y ) − φ(xn , y n )])
ˆ
6: end for

Observation: each update of w needs only 1 argmax-prediction
(but we’ll need many iterations until convergence)
32 / 56



We can solve an S-SVM like a linear SVM:

One of the equivalent formulations was:
N
2 C
min w + ξn
w∈RD ,ξ∈Rn
+
N
n=1

subject to, for i = 1, . . . n,

w, φ(xn , y n ) − w, φ(xn , y) ≥ ∆(y n , y) − ξ n , for all y ∈ Y‘.

Introduce feature vectors δφ(xn , y n , y) := φ(xn , y n ) − φ(xn , y).

33 / 56


Solve
N
2 C
min w + ξn
w∈RD ,ξ∈Rn
+
N
n=1

subject to, for i = 1, . . . n, for all y ∈ Y,

w, δφ(xn , y n , y) ≥ ∆(y n , y) − ξ n .

This has the same structure as an ordinary SVM!
quadratic objective
linear constraints

34 / 56


Solve
N
2 C
min w + ξn
w∈RD ,ξ∈Rn
+
N
n=1


w, δφ(xn , y n , y) ≥ ∆(y n , y) − ξ n .

quadratic objective
linear constraints

Question: Can’t we use a ordinary SVM/QP solver?

35 / 56


Solve
N
2 C
min w + ξn
w∈RD ,ξ∈Rn
+
N
n=1


w, δφ(xn , y n , y) ≥ ∆(y n , y) − ξ n .

quadratic objective
linear constraints

Question: Can’t we use a ordinary SVM/QP solver?

Answer: Almost! We could, if there weren’t N |Y| constraints.
E.g. 100 binary 16 × 16 images: 1079 constraints
36 / 56


Solution: working set training
It’s enough if we enforce the active constraints.
The others will be fulﬁlled automatically.
We don’t know which ones are active for the optimal solution.
But it’s likely to be only a small number ← can of course be formalized.
Keep a set of potentially active constraints and update it iteratively:

37 / 56


Working Set Training

Start with working set S = ∅ (no contraints)
Repeat until convergence:
Solve S-SVM training problem with constraints from S
Check, if solution violates any of the full constraint set
if no: we found the optimal solution, terminate.
if yes: add most violated constraints to S, iterate.

38 / 56


Working Set Training

Start with working set S = ∅ (no contraints)
Repeat until convergence:
Solve S-SVM training problem with constraints from S
Check, if solution violates any of the full constraint set
if no: we found the optimal solution, terminate.
if yes: add most violated constraints to S, iterate.

Good practical performance and theoretic guarantees:
polynomial time convergence -close to the global optimum
[Tsochantaridis et al. ”Large Margin Methods for Structured and Interdependent Output Variables”, JMLR, 2005.]
39 / 56


Working Set S-SVM Training
input feature map φ(x, y), loss function ∆(y, y ), regularizer C
1: S←∅
2: repeat
3: (w, ξ) ← solution to QP only with constraints from S
4: for i=1,. . . ,n do
5: y ← argmaxy∈Y ∆(y n , y) + w, φ(xn , y)
ˆ
6: if y = y n then
ˆ
7: S ← S ∪ {(xn , y )}
ˆ
8: end if
9: end for
10: until S doesn’t change anymore.

Observation: each update of w needs 1 argmax-prediction per example.
(but we solve globally for next w, not by local steps) 40 / 56


One-Slack Formulation of S-SVM:
1 n)
(equivalent to ordinary S-SVM formulation by ξ = N nξ

1 2
min w + Cξ
w∈RD ,ξ∈R+ 2

subject to, for all (ˆ1 , . . . , y N ) ∈ Y × · · · × Y,
y ˆ

N
∆(y n , y N ) + w, φ(xn , y n ) − w, φ(xn , y n )
ˆ ˆ ≤ N ξ,
n=1

41 / 56


One-Slack Formulation of S-SVM:
1 n)
(equivalent to ordinary S-SVM formulation by ξ = N nξ

1 2
min w + Cξ
w∈RD ,ξ∈R+ 2

subject to, for all (ˆ1 , . . . , y N ) ∈ Y × · · · × Y,
y ˆ

N
∆(y n , y N ) + w, φ(xn , y n ) − w, φ(xn , y n )
ˆ ˆ ≤ N ξ,
n=1

|Y|N linear constraints, convex, diﬀerentiable objective.

We blew up the constraint set even further:
100 binary 16 × 16 images: 10177 constraints (instead of 1079 ).
42 / 56


Working Set One-Slack S-SVM Training
input feature map φ(x, y), loss function ∆(y, y ), regularizer C
1: S←∅
2: repeat
3: (w, ξ) ← solution to QP only with constraints from S
4: for i=1,. . . ,n do
5: y n ← argmaxy∈Y ∆(y n , y) + w, φ(xn , y)
ˆ
6: end for
7: S ← S ∪ { (x1 , . . . , xn ), (ˆ1 , . . . , y n ) }
y ˆ
8: until S doesn’t change anymore.

Often faster convergence:
We add one strong constraint per iteration instead of n weak ones.
43 / 56


We can solve an S-SVM like a non-linear SVM: compute Lagrangian dual
min becomes max,
original (primal) variables w, ξ disappear,
new (dual) variables αiy : one per constraint of the original problem.

Dual S-SVM problem

1
max αny ∆(y n , y) − αny αny δφ(xn , y n , y), δφ(xn , y n , y )
¯¯
¯ ¯
¯
n|Y|
α∈R+ 2
n=1,...,n y,¯∈Y
y
y∈Y n,¯ =1,...,N
n


C
αny ≤ .
N
y∈Y

N linear contraints, convex, diﬀerentiable objective, N |Y| variables.

44 / 56


We can kernelize:
Deﬁne joint kernel function k : (X × Y) × (X × Y) → R

k( (x, y) , (¯, y ) ) = φ(x, y), φ(¯, y ) .
x ¯ x ¯

k measure similarity between two (input,output)-pairs.

We can express the optimization in terms of k:

δφ(xn , y n , y) , δφ(xn , y n , y )
¯ ¯
¯
= φ(xn , y n ) − φ(xn , y) , φ(xn , y n ) − φ(xn , y )
¯ ¯ ¯
¯
= φ(xn , y n ), φ(xn , y n ) − φ(xn , y n ), φ(xn , y )
¯ ¯ ¯
¯
− φ(xn , y), φ(xn , y n ) + φ(xn , y), φ(xn , y )
¯ ¯ ¯
¯
= k( (xn , y n ), (xn , y n ) ) − k( (xn , y n ), φ(xn , y ) )
¯ ¯ ¯
¯
− k( (xn , y), (xn , y n ) ) + k( (xn , y), φ(xn , y ) )
¯ ¯ ¯
¯
=: Ki¯yy
ı ¯

45 / 56


Kernelized S-SVM problem:
1
max αiy ∆(y n , y) − αiy α¯y Ki¯yy
ı¯ ı ¯
n|Y|
α∈R+ 2
i=1,...,n y,¯∈Y
y
y∈Y i,¯=1,...,n
ı


C
αiy ≤ .
N
y∈Y

too many variables: train with working set of αiy .

Kernelized prediction function:

f (x) = argmax αiy k( (xi , yi ), (x, y) )
y∈Y
iy

46 / 56


What do ”joint kernel functions” look like?

k( (x, y) , (¯, y ) ) = φ(x, y), φ(¯, y ) .
x ¯ x ¯

As in graphical model: easier if φ decomposes w.r.t. factors:
φ(x, y) = φF (x, yF ) F ∈F

Then the kernel k decomposes into sum over factors:

k( (x, y) , (¯, y ) ) =
x ¯ φF (x, yF ) F ∈F
, φF (x , yF ) F ∈F

= φF (x, yF ), φF (x , yF )
F ∈F

= kF ( (x, yF ), (x , yF ) )
F ∈F

We can deﬁne kernels for each factor (e.g. nonlinear).

47 / 56


Example: ﬁgure-ground segmentation with grid structure

(x, y)=(
ˆ , )

Typical kernels: arbirary in x, linear (or at least simple) w.r.t. y:
Unary factors:

kp ((xp , yp ), (xp , yp ) = k(xp , xp ) yp = yp

with k(xp , xp ) local image kernel, e.g. χ2 or histogram intersection

Pairwise factors:

kpq ((yp , yq ), (yp , yp ) = yq = yq yq = yq

More powerful than all-linear, and argmax-prediction still possible.
48 / 56


Example: object localization
left top

(x, y)=(
ˆ , )
image

right bottom

Only one factor that includes all x and y:

k( (x, y) , (x , y ) ) = kimage (x|y , x |y )

with kimage image kernel and x|y is image region within box y.

argmax-prediction as diﬃcult as object localization with kimage -SVM.

49 / 56


Summary – S-SVM Learning

Given:
training set {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y
loss function ∆ : Y × Y → R.
Task: learn parameter w for f (x) := argmaxy w, φ(x, y) that minimizes
expected loss on future data.

50 / 56


Summary – S-SVM Learning

Given:
training set {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y
loss function ∆ : Y × Y → R.
Task: learn parameter w for f (x) := argmaxy w, φ(x, y) that minimizes
expected loss on future data.

S-SVM solution derived by maximum margin framework:
enforce correct output to be better than others by a margin :

w, φ(xn , y n ) ≥ ∆(y n , y) + w, φ(xn , y) for all y ∈ Y.

convex optimization problem, but non-diﬀerentiable
many equivalent formulations → diﬀerent training algorithms
training needs repeated argmax prediction, no probabilistic inference
51 / 56


Extra I: Beyond Fully Supervised Learning

So far, training was fully supervised, all variables were observed.
In real life, some variables are unobserved even during training.

missing labels in training data latent variables, e.g. part location

latent variables, e.g. part occlusion latent variables, e.g. viewpoint
52 / 56


Three types of variables:
x ∈ X always observed,
y ∈ Y observed only in training,
z ∈ Z never observed (latent).
Decision function: f (x) = argmaxy∈Y maxz∈Z w, φ(x, y, z)

53 / 56


Three types of variables:
x ∈ X always observed,
y ∈ Y observed only in training,
z ∈ Z never observed (latent).
Decision function: f (x) = argmaxy∈Y maxz∈Z w, φ(x, y, z)
Maximum Margin Training with Maximization over Latent Variables
N
1 2 C
Solve: min w + ξn
w,ξ 2 N
n=1

subject to, for n = 1, . . . , N , for all y ∈ Y

∆(y n , y) + max w, φ(xn , y, z) − max w, φ(xn , y n , z)
z∈Z z∈Z

Problem: not a convex problem → can have local minima

[C. Yu, T. Joachims, ”Learning Structural SVMs with Latent Variables”, ICML, 2009]
similar idea: [Felzenszwalb, McAllester, Ramaman. A Discriminatively Trained, Multiscale, Deformable Part Model, CVPR’08]
54 / 56


Structured Learning is full of Open Research Questions
How to train faster?
CRFs need many runs of probablistic inference,
SSVMs need many runs of argmax-predictions.

How to reduce the necessary amount of training data?
semi-supervised learning? transfer learning?

How can we better understand diﬀerent loss function?
when to use probabilistic training, when maximum margin?
CRFs are “consistent”, SSVMs are not. Is this relevant?

Can we understand structured learning with approximate inference?
often computing L(w) or argmaxy w, φ(x, y) exactly is infeasible.
can we guarantee good results even with approximate inference?

More and new applications!

55 / 56


Lunch-Break
Continuing at 13:30

Slides available at
http://www.nowozin.net/sebastian/
cvpr2011tutorial/

56 / 56

04 structured support vector machine

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to 04 structured support vector machine

Similar to 04 structured support vector machine (20)

More from zukun

More from zukun (20)

Recently uploaded

Recently uploaded (20)

04 structured support vector machine