Your SlideShare is downloading. ×
0
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Chrisoph Lampert, Talk part 2
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Chrisoph Lampert, Talk part 2

990

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
990
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Learning with Structured Inputs and Outputs Christoph Lampert Institute of Science and Technology Austria Vienna (Klosterneuburg), Austria Microsoft Computer Vision School 2011
  • 2. Kernel Methods in Computer VisionOverview...Monday: 9:00 – 10:00 Introduction to Kernel Classifiers 10:00 – 10:10 — mini break — 10:10 – 10:50 More About Kernel MethodsTuesday: 9:00 – 9:50 Conditional Random Fields 9:50 – 10:00 — mini break — 10:00 – 10:50 Structured Support Vector Machines Slides available on my home page: http://www.ist.ac.at/~chl
  • 3. Extended versions of the lectures in book form Foundations and Trends in Computer Graphics and Vision, now publisher www.nowpublishers.com/ Available as PDFs on my homepage
  • 4. "Normal" Machine Learning: f : X → R.Structured Output Learning: f : X → Y.
  • 5. "Normal" Machine Learning: f : X → R. inputs X can be any kind of objects output y is a real number classification, regression, density estimation, . . .Structured Output Learning: f : X → Y. inputs X can be any kind of objects outputs y ∈ Y are complex (structured) objects images, text, audio, folds of a protein
  • 6. What is structured data? Ad hoc definition: data that consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together. Text Molecules / Chemical Structures Documents/HyperText Images
  • 7. What is structured output prediction? Ad hoc definition: predicting structured outputs from input data (in contrast to predicting just a single number, like in classification or regression) Natural Language Processing: Automatic Translation (output: sentences) Sentence Parsing (output: parse trees) Bioinformatics: Secondary Structure Prediction (output: bipartite graphs) Enzyme Function Prediction (output: path in a tree) Speech Processing: Automatic Transcription (output: sentences) Text-to-Speech (output: audio signal) Robotics: Planning (output: sequence of actions) This lecture: Structured Prediction in Computer Vision
  • 8. Computer Vision Example: Semantic Image Segmentation → input: images output: segmentation masks input space X = {images} = [0, 255]3·M ·N ˆ output space Y = {segmentation masks} = {0, 1}M ·N ˆ (structured output) prediction function: f : X → Y f (x) := argmin E(x, y) y∈Y energy function E(x, y) = i wi ϕu (xi , yi ) + i,j wij ϕp (yi , yj )Images: [M. Everingham et al. "The PASCAL Visual Object Classes (VOC) challenge", IJCV 2010]
  • 9. Computer Vision Example: Human Pose Estimation input: image body model output: model fit input space X = {images} output space Y = {positions/angles of K body parts} = R4K . ˆ prediction function: f : X → Y f (x) := argmin E(x, y) y∈Y energy E(x, y) = i wi ϕfit (xi , yi ) + i,j wij ϕpose (yi , yj )Images: [Ferrari, Marin-Jimenez, Zisserman: "Progressive Search Space Reduction for Human Pose Estimation", CVPR 2008.]
  • 10. Computer Vision Example: Point Matching input: image pairs output: mapping y : xi ↔ y(xi ) prediction function: f : X → Y f (x) := argmax F (x, y) y∈Y scoring function F (x, y) = i wi ϕsim (xi , y(xi )) + i,j wij ϕdist (xi , xj , y(xi ), y(xj )) + i,j,k wijk ϕangle (xi , xj , xk , y(xi ), y(xj ), y(xk ))[J. McAuley et al.: "Robust Near-Isometric Matching via Structured Learning of Graphical Models", NIPS, 2008]
  • 11. Computer Vision Example: Object Localization output: object position input: (left, top image right, bottom) input space X = {images} output space Y = R4 bounding box coordinates prediction function: f : X → Y f (x) := argmax F (x, y) y∈Y scoring function F (x, y) = w ϕ(x, y) where ϕ(x, y) = h(x|y ) is a feature vector for an image region, e.g. bag-of-visual-words.[M. Blaschko, C. Lampert: "Learning to Localize Objects with Structured Output Regression", ECCV, 2008]
  • 12. Computer Vision Examples: SummaryImage Segmentation y = argmin E(x, y) E(x, y) = wi ϕ(xi , yi ) + wij ϕ(yi , yj ) y∈{0,1}N i i,jPose Estimation y = argmin E(x, y) E(x, y) = wi ϕ(xi , yi ) + wij ϕ(yi , yj ) y∈R4K i i,jPoint Matching y = argmax F(x, y) F(x, y) = wi ϕ(xi , yi ) + wij ϕ(yi , yj ) + wijk ϕ(yi , yj , yk ) y∈Πn i i,j i,j,kObject Localization y = argmax F (x, y) F (x, y) = w ϕ(x, y) y∈R4
  • 13. Grand Unified ViewPredict structured output by maximization y = argmax F (x, y) y∈Yof a compatiblity function F (x, y) = w, ϕ(x, y)that is linear in a parameter vector w.
  • 14. A generic structured prediction problem X : arbitrary input domain Y: structured output domain, decompose y = (y1 , . . . , yk ) Prediction function f : X → Y by f (x) = argmax F (x, y) y∈Y Compatiblity function (or negative of "energy") F (x, y) = w, ϕ(x, y) k = wi ϕi (yi , x) unary terms i=1 k + wij ϕij (yi , yj , x) binary terms i,j=1 + ... higher order terms (sometimes)
  • 15. Example: Sequence Prediction – Handwriting Recognition X = 5-letter word images , x = (x1 , . . . , x5 ), xj ∈ {0, 1}300×80 Y = ASCII translation , y = (y1 , . . . , y5 ), yj ∈ {A, . . . , Z }. feature functions has only unary terms ϕ(x, y) = ϕ1 (x, y1 ), . . . , ϕ5 (x, y5 ) . F (x, y) = w1 , ϕ1 (x, y1 ) + · · · + w, ϕ5 (x, y5 ) Q V E S T Output InputAdvantage: computing y ∗ = argmaxy F (x, y) is easy.We can find each yi∗ independently, check 5 · 26 = 130 values.Problem: only local information, we can’t correct errors.
  • 16. Example: Sequence Prediction – Handwriting Recognition X = 5-letter word images , x = (x1 , . . . , x5 ), xj ∈ {0, 1}300×80 Y = ASCII translation , y = (y1 , . . . , y5 ), yj ∈ {A, . . . , Z }. one global feature function   yth pos.   ϕ(x, y) = (0, . . . , 0, Φ(x) , 0, . . . , 0) if y ∈ D dictionary, (0, . . . , 0, 0 , 0, . . . , 0) otherwise.  QUEST Output InputAdvantage: access to global information, e.g. from dictionary D.Problem: argmaxy w, ϕ(x, y) has to check 265 = 11881376 values. We need separate training data for each word.
  • 17. Example: Sequence Prediction – Handwriting Recognition X = 5-letter word images , x = (x1 , . . . , x5 ), xj ∈ {0, 1}300×80 Y = ASCII translation , y = (y1 , . . . , y5 ), yj ∈ {A, . . . , Z }. feature function with unary and pairwise terms ϕ(x, y) = ϕ1 (y1 , x), ϕ2 (y2 , x), . . . , ϕ5 (y5 , x), ϕ1,2 (y1 , y2 ), . . . , ϕ4,5 (y4 , y5 ) Q U E S T Output InputCompromise: computing y ∗ is still efficient (Viterbi best path)Compromise: neighbor information allows correction of local errors.
  • 18. Example: Sequence Prediction – Handwriting Recognition ϕ(x, y) = (ϕ1 (y1 , x), . . . , ϕ5 (y5 , x), ϕ1,2 (y1 , y2 ), . . . , ϕ4,5 (y4 , y5 )). w = (w1 , , . . . , w5 , w1,2 , . . . , w4,5 ) F (x, y) = w, ϕ(x, y) = F1 (x1 , y1 ) + · · · + F5 (x5 , y5 ) + F1,2 (y1 , y2 ) + . . . , F4,5 (y4 , y5 ). F1;2 F2;3 Q V E F2 F3 ... F1 Fi,i+1 Q U V E F1 F2 F3 Q 0.0 0.9 0.1 0.1 Q 1.8 Q 0.2 Q 0.3 U 0.1 0.1 0.4 0.6 U 0.2 U 1.1 U 0.2 V 0.0 0.1 0.2 0.5 V 0.1 V 1.2 V 0.1 E 0.3 0.5 0.5 1.0 E 0.4 E 0.3 E 1.8
  • 19. Every y ∈ Y corresponds to a path. argmaxy∈Y F is best path.
  • 20. Every y ∈ Y corresponds to a path. argmaxy∈Y F is best path.F (VUQ, x) = F1 (V, x1 ) + F12 (V, U) + F2 (U, x2 ) + F23 (U, Q) + F3 (Q, x3 ) = 0.1 + 0.1 + 1.1 + 0.1 + 0.3 = 1.7
  • 21. Maximal per-letter scores.
  • 22. Maximal per-letter scores. Total: 1.8 + 0.1 + 1.2 + 0.5 + 1.8 = 5.4
  • 23. Best (=Viterbi) path. Total: 1.8 + 0.9 + 1.1 + 0.6 + 1.8 = 6.2
  • 24. Example: Human Pose EstimationPictorial Structures for Articulated Pose Estimation: (1) Ftop X Ytop (2) F top,head ... ... Yhead ... ... ... Yrarm Ytorso Ylarm ... Yrhnd ... ... ... Ylhnd Yrleg Ylleg ... ... Yrfoot Ylfoot Many popular models have unary and pairwise terms.Certain higher order terms are also doable, see C. Rother’s lectures.[Felzenszwalb and Huttenlocher, 2000], [Fischler and Elschlager, 1973]
  • 25. Thursday/Friday’s lecture: how to evaluate argmaxy F (x, y). chain tree loop-free graphs: Viterbi decoding / dynamic programming grid arbitrary graph loopy graphs: approximate inference (e.g. loopy BP) Today: how to learn a good function F (x, y) from training data.
  • 26. Parameter Learning in Structured Models Given: parametric model (family): F (x, y) = w, ϕ(x, y) Given: prediction method: f (x) = argmaxy∈Y F (x, y) Not given: parameter vector w (high-dimensional)
  • 27. Parameter Learning in Structured Models Given: parametric model (family): F (x, y) = w, ϕ(x, y) Given: prediction method: f (x) = argmaxy∈Y F (x, y) Not given: parameter vector w (high-dimensional) Supervised Training: Given: example pairs {(x 1 , y 1 ), . . . , (x n , y n )} ⊂ X × Y. typical inputs with "the right" outputs for them. { , , , , } Task: determine "good" w
  • 28. Parameter Learning in Structured Models Given: parametric model (family): F (x, y) = w, ϕ(x, y) Given: prediction method: f (x) = argmaxy∈Y F (x, y) Not given: parameter vector w (high-dimensional) Supervised Training: Given: example pairs {(x 1 , y 1 ), . . . , (x n , y n )} ⊂ X × Y. typical inputs with "the right" outputs for them. Task: determine "good" w now–9:50 Probabilistic Training inspired by probabilistic discriminative classifiers: Logistic Regression → Conditional Random Fields 10:00–10:50 Maximum Margin Training inspired by maximum-margin discriminative classifiers: Support Vector Machines → Structured Output SVMs
  • 29. Probabilistic Trainingof Structured Models(Conditional Random Fields)
  • 30. Probabilistic Models for Structured Prediction Idea: Establish a one-to-one correspondence between compatibility function F (x, y) and conditional probability distribution p(y|x), such that argmax F (x, y) = argmax p(y|x) y∈Y y∈Y
  • 31. Probabilistic Models for Structured Prediction Idea: Establish a one-to-one correspondence between compatibility function F (x, y) and conditional probability distribution p(y|x), such that argmax F (x, y) = argmax p(y|x) y∈Y y∈Y = maximum of the posterior probability
  • 32. Probabilistic Models for Structured Prediction Idea: Establish a one-to-one correspondence between compatibility function F (x, y) and conditional probability distribution p(y|x), such that argmax F (x, y) = argmax p(y|x) y∈Y y∈Y = maximum of the posterior probability = Bayes optimal decision
  • 33. Probabilistic Models for Structured Prediction Idea: Establish a one-to-one correspondence between compatibility function F (x, y) and conditional probability distribution p(y|x), such that argmax F (x, y) = argmax p(y|x) y∈Y y∈Y = maximum of the posterior probability = Bayes optimal decision Gibbs Distribution 1 F(x,y) p(y|x) = e with Z (x) = e F(x,y) Z (x) y∈Y
  • 34. Probabilistic Models for Structured Prediction Idea: Establish a one-to-one correspondence between compatibility function F (x, y) and conditional probability distribution p(y|x), such that argmax F (x, y) = argmax p(y|x) y∈Y y∈Y = maximum of the posterior probability = Bayes optimal decision Gibbs Distribution 1 F(x,y) p(y|x) = e with Z (x) = e F(x,y) Z (x) y∈Y e F(x,y) argmax p(y|x) = argmax = argmax e F(x,y) = argmax F (x, y) y∈Y y∈Y Z (x) y∈Y y∈Y
  • 35. Probabilistic Models for Structured Prediction For learning, we make the dependence on w explicit: F (x, y, w) = w, ϕ(x, y) Parametrized Gibbs Distribution – Loglinear Model 1 p(y|x, w) = e F(x,y,w) with Z (x, w) = e F(x,y,w) Z (x, w) y∈Y
  • 36. Probabilistic Models for Structured Prediction Supervised training: Given: i.i.d. example pairs D = {(x 1 , y 1 ), . . . , (x n , y n )}. Task: determine w. Idea: Maximize the posterior probability p(w|D)
  • 37. Probabilistic Models for Structured Prediction Supervised training: Given: i.i.d. example pairs D = {(x 1 , y 1 ), . . . , (x n , y n )}. Task: determine w. Idea: Maximize the posterior probability p(w|D) Use Bayes’ rule to infer posterior distribution for w: p(D|w)p(w) p((x 1 , y 1 ), . . . , (x n , y n )|w)p(w) p(w|D) = = p(D) p(D)
  • 38. Probabilistic Models for Structured Prediction Supervised training: Given: i.i.d. example pairs D = {(x 1 , y 1 ), . . . , (x n , y n )}. Task: determine w. Idea: Maximize the posterior probability p(w|D) Use Bayes’ rule to infer posterior distribution for w: p(D|w)p(w) p((x 1 , y 1 ), . . . , (x n , y n )|w)p(w) p(w|D) = = p(D) p(D) n i i n i.i.d. p(x , y |w) p(y i |x i , w)p(x i |w) = p(w) = p(w) i i=1 p(x , y ) i i=1 p(y i |x i )p(x i )
  • 39. Probabilistic Models for Structured Prediction Supervised training: Given: i.i.d. example pairs D = {(x 1 , y 1 ), . . . , (x n , y n )}. Task: determine w. Idea: Maximize the posterior probability p(w|D) Use Bayes’ rule to infer posterior distribution for w: p(D|w)p(w) p((x 1 , y 1 ), . . . , (x n , y n )|w)p(w) p(w|D) = = p(D) p(D) n i i n i.i.d. p(x , y |w) p(y i |x i , w)p(x i |w) = p(w) = p(w) i i=1 p(x , y ) i i=1 p(y i |x i )p(x i ) assuming x does not depend on w: p(x|w) ≡ p(x) n p(y i |x i , w) = p(w) i=1 p(y |x ) i i
  • 40. Posterior probability: p(y i |x i , w) p(w|D) = p(w) i p(y i |x i ) Prior Data Term (loglinear)Prior p(w) expresses prior knowledge. We cannot infer it from data. 1 2 Gaussian p(w) := const. · e − 2σ2 w Gaussianity is often a good guess if you know nothing else easy to handle, analytically und numericallyLess common: 1 Laplacian p(w) := const. · e − σ w if we expect sparsity in w but: properties of the optimization problem change
  • 41. Learning w from D ≡ Maximizing posterior probability for w w ∗ = argmax p(w|D) = argmin − log p(w|D) w∈Rd w∈Rd p(y i |x i , w)− log p(w|D) = − log p(w) i p(y i |x i ) n n = − log p(w) − log p(y i |x i , w) + log p(y i |x i ) i=1 i ,y i ,w) i=1 1 = Z e F(x indep. of w 2 n w = − F (x i , y i , w) − log Z (x i , w) + const. 2σ 2 i=1 n w 2 w,ϕ(x i ,y) = − w, ϕ(x i , y i ) − log e + c. 2σ 2 i=1 y∈Y =:L(w)
  • 42. (Negative) (Regularized) Conditional Log-Likelihood (of D) n 1 2 w,ϕ(x i ,y) L(w) = w − w, ϕ(x i , y i ) − log e 2σ 2 i=1 y∈YProbabilistic parameter estimation or training means solving w ∗ = argmin L(w). w∈RdSame optimization problem as for multi-class logistic regression. n 1 2 wy ,ϕ(x i ) Llogreg (w) = w − wy i , ϕ(x i ) − log e 2σ 2 i=1 y∈Ywith w = (w1 , . . . , wK ) for K classes.
  • 43. Negative Conditional Log-Likelihood for w ∈ R2 : negative log likelihood σ2 =0.01 negative log likelihood σ2 =0.10 3 3 128.0 2 2 00 .000 512.0 512 00 0 0 .00 .00 128 1 1 256 000 64. 128.000 32.000 .0 00 16 00 0 0 2.0 32.000 8.000 64.000 1024.000 4.0 00 1 1 00 16.0 2 2 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 negative log likelihood σ2 =1.00 negative log likelihood σ2 → ∞ 3 2.5 0 .00 2.0 128 2 1.5 00 0 .00 .0 00 1.0 32 16 64.0 1 0 0 4.00 0 0.5 8.0 00 31 2 25 50 32.0 0.0.00.1 0.2 0000 00 0.0 0000 00 0001 02 00 00 04 8 16 0.5 .0 2.0 0.0.0 0.0 00 0.0.0 0.0 4.0 8.0 0.0.0 0.0 0.0 16.0 0 6 00 000 1 0 0 0 0 0 64. 00 0.5 1.000 0.5 1 1.0 00 1.5 2.0 2 2.0 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5
  • 44. negative log likelihood σ2 =0.01 negative log likelihood σ2 =0.10 negative log likelihood σ2 =1.00 negative log likelihood σ2 → ∞3 3 3 2.5 0 .00 2.0 128 128.002 2 2 1.5 0 000 512. 0 512. 0 .00 000 .00 00 0 1.0 0 32 .00 16 .00 64.0 1281 1 1 256 00 4.00 00 0 0.5 8.0 64.0 00 31 2 25 50 32.0 0.0.00.1 0.2 0000 00 128.000 32.000 0.0 0000 00 0001 02 00 00 04 8 16 0.5 .0 2.0 0 .00 0.0.0 0.0 00 0.0.0 0.0 4.0 8.0 0.0.0 0.0 16 0.0 16.0 0 6 00 0 00 10 0 0 0 0 0 0 0 64.0 2.0 32.000 8.000 0 0.5 0 64.000 1.000 1024.000 0.5 4.0 1.0 01 1 1 0 00 16.0 00 1.5 2.02 2 2 2.0 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 w ∗ = argmin L(w). w∈Rd
  • 45. negative log likelihood σ2 =0.01 negative log likelihood σ2 =0.10 negative log likelihood σ2 =1.00 negative log likelihood σ2 → ∞3 3 3 2.5 0 .00 2.0 128 128.002 2 2 1.5 0 000 512. 0 512. 0 .00 000 .00 00 0 1.0 0 32 .00 16 .00 64.0 1281 1 1 256 00 4.00 00 0 0.5 8.0 64.0 00 31 2 25 50 32.0 0.0.00.1 0.2 0000 00 128.000 32.000 0.0 0000 00 0001 02 00 00 04 8 16 0.5 .0 2.0 0 .00 0.0.0 0.0 00 0.0.0 0.0 4.0 8.0 0.0.0 0.0 16 0.0 16.0 0 6 00 0 00 10 0 0 0 0 0 0 0 64.0 2.0 32.000 8.000 0 0.5 0 64.000 1.000 1024.000 0.5 4.0 1.0 01 1 1 0 00 16.0 00 1.5 2.02 2 2 2.0 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 w ∗ = argmin L(w). w∈Rd w ∈ Rd continuous (not discrete)
  • 46. negative log likelihood σ2 =0.01 negative log likelihood σ2 =0.10 negative log likelihood σ2 =1.00 negative log likelihood σ2 → ∞3 3 3 2.5 0 .00 2.0 128 128.002 2 2 1.5 0 000 512. 0 512. 0 .00 000 .00 00 0 1.0 0 32 .00 16 .00 64.0 1281 1 1 256 00 4.00 00 0 0.5 8.0 64.0 00 31 2 25 50 32.0 0.0.00.1 0.2 0000 00 128.000 32.000 0.0 0000 00 0001 02 00 00 04 8 16 0.5 .0 2.0 0 .00 0.0.0 0.0 00 0.0.0 0.0 4.0 8.0 0.0.0 0.0 16 0.0 16.0 0 6 00 0 00 10 0 0 0 0 0 0 0 64.0 2.0 32.000 8.000 0 0.5 0 64.000 1.000 1024.000 0.5 4.0 1.0 01 1 1 0 00 16.0 00 1.5 2.02 2 2 2.0 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 w ∗ = argmin L(w). w∈Rd w ∈ Rd continuous (not discrete) w ∈ Rd high-dimensional (often d 1000)
  • 47. negative log likelihood σ2 =0.01 negative log likelihood σ2 =0.10 negative log likelihood σ2 =1.00 negative log likelihood σ2 → ∞3 3 3 2.5 0 .00 2.0 128 128.002 2 2 1.5 0 000 512. 0 512. 0 .00 000 .00 00 0 1.0 0 32 .00 16 .00 64.0 1281 1 1 256 00 4.00 00 0 0.5 8.0 64.0 00 31 2 25 50 32.0 0.0.00.1 0.2 0000 00 128.000 32.000 0.0 0000 00 0001 02 00 00 04 8 16 0.5 .0 2.0 0 .00 0.0.0 0.0 00 0.0.0 0.0 4.0 8.0 0.0.0 0.0 16 0.0 16.0 0 6 00 0 00 10 0 0 0 0 0 0 0 64.0 2.0 32.000 8.000 0 0.5 0 64.000 1.000 1024.000 0.5 4.0 1.0 01 1 1 0 00 16.0 00 1.5 2.02 2 2 2.0 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 w ∗ = argmin L(w). w∈Rd w ∈ Rd continuous (not discrete) w ∈ Rd high-dimensional (often d 1000) optimization has no side-constraints
  • 48. negative log likelihood σ2 =0.01 negative log likelihood σ2 =0.10 negative log likelihood σ2 =1.00 negative log likelihood σ2 → ∞3 3 3 2.5 0 .00 2.0 128 128.002 2 2 1.5 0 000 512. 0 512. 0 .00 000 .00 00 0 1.0 0 32 .00 16 .00 64.0 1281 1 1 256 00 4.00 00 0 0.5 8.0 64.0 00 31 2 25 50 32.0 0.0.00.1 0.2 0000 00 128.000 32.000 0.0 0000 00 0001 02 00 00 04 8 16 0.5 .0 2.0 0 .00 0.0.0 0.0 00 0.0.0 0.0 4.0 8.0 0.0.0 0.0 16 0.0 16.0 0 6 00 0 00 10 0 0 0 0 0 0 0 64.0 2.0 32.000 8.000 0 0.5 0 64.000 1.000 1024.000 0.5 4.0 1.0 01 1 1 0 00 16.0 00 1.5 2.02 2 2 2.0 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 w ∗ = argmin L(w). w∈Rd w ∈ Rd continuous (not discrete) w ∈ Rd high-dimensional (often d 1000) optimization has no side-constraints L(w) differentiable → can apply gradient descent
  • 49. negative log likelihood σ2 =0.01 negative log likelihood σ2 =0.10 negative log likelihood σ2 =1.00 negative log likelihood σ2 → ∞3 3 3 2.5 0 .00 2.0 128 128.002 2 2 1.5 0 000 512. 0 512. 0 .00 000 .00 00 0 1.0 0 32 .00 16 .00 64.0 1281 1 1 256 00 4.00 00 0 0.5 8.0 64.0 00 31 2 25 50 32.0 0.0.00.1 0.2 0000 00 128.000 32.000 0.0 0000 00 0001 02 00 00 04 8 16 0.5 .0 2.0 0 .00 0.0.0 0.0 00 0.0.0 0.0 4.0 8.0 0.0.0 0.0 16 0.0 16.0 0 6 00 0 00 10 0 0 0 0 0 0 0 64.0 2.0 32.000 8.000 0 0.5 0 64.000 1.000 1024.000 0.5 4.0 1.0 01 1 1 0 00 16.0 00 1.5 2.02 2 2 2.0 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 w ∗ = argmin L(w). w∈Rd w ∈ Rd continuous (not discrete) w ∈ Rd high-dimensional (often d 1000) optimization has no side-constraints L(w) differentiable → can apply gradient descent L(w) convex! → g.d. will find global optimum
  • 50. Steepest Descent Minimization – minimize L(w) require: tolerance > 0 wcur ← 0 repeat v ← − w L(wcur ) descent direction η ← argminη∈R L(wcur + ηv) stepsize wcur ← wcur + ηv update until v < return wcurAlternatives: L-BFGS (second-order descent without explicit Hessian) Conjugate Gradient
  • 51. w0
  • 52. w0
  • 53. w1
  • 54. w1
  • 55. w1
  • 56. w2
  • 57. Summary I: Probabilistic Training (Conditional Random Fields) Well-defined probabilistic setting. p(y|x, w) log-linear in w ∈ Rd . Training: maximize p(w|D) for data set D. Differentiable, convex optimization, Same structure as logistic regression. ⇒ gradient descent will find global optimum.
  • 58. Summary I: Probabilistic Training (Conditional Random Fields) Well-defined probabilistic setting. p(y|x, w) log-linear in w ∈ Rd . Training: maximize p(w|D) for data set D. Differentiable, convex optimization, Same structure as logistic regression. ⇒ gradient descent will find global optimum. For logistic regression: this is where the textbook ends. We’re done. For conditional random fields: we’re not in safe waters, yet!
  • 59. Problem: Computing w L(wcur ) and evaluating L(wcur + ηv): n 1 w L(w) = w− ϕ(x i , y i ) − p(y|x i , w)ϕ(x i , y) σ2 i=1 y∈Y n 1 2 w,ϕ(x i ,y) L(w) = w − w, ϕ(x i , y i ) + log e 2σ 2 i=1 y∈Y
  • 60. Problem: Computing w L(wcur ) and evaluating L(wcur + ηv): n 1 w L(w) = w− ϕ(x i , y i ) − p(y|x i , w)ϕ(x i , y) σ2 i=1 y∈Y n 1 2 w,ϕ(x i ,y) L(w) = w − w, ϕ(x i , y i ) + log e 2σ 2 i=1 y∈Y Y typically is very (exponentially) large: N parts, each with label from a set Y : |Y| = |Y |N , binary image segmentation: |Y| = 2640×480 ≈ 1092475 , ranking N images: |Y| = N !, e.g. N = 1000: |Y| ≈ 102568 .Without structure in Y, we’re lost.
  • 61. n 1 w L(w) = 2 w − ϕ(x i , y i ) − Ey∼p(y|x i ,w) ϕ(x i , y) σ i=1Computing the Gradient (naive): O(K M nd) = log Z (x i ,w) n 1 2 w,ϕ(x i ,y) L(w) = w − w, ϕ(x i , y i ) + log e 2σ 2 i=1 y∈YLine Search (naive): O(K M nd) per evaluation of L n: number of samples d: dimension of feature space M : number of output nodes K : number of possible labels of each output nodes
  • 62. n 1 w L(w) = 2 w − ϕ(x i , y i ) − Ey∼p(y|x i ,w) ϕ(x i , y) σ i=1Computing the Gradient (naive): O(K M nd) = log Z (x i ,w) n 1 2 w,ϕ(x i ,y) L(w) = w − w, ϕ(x i , y i ) + log e 2σ 2 i=1 y∈YLine Search (naive): O(K M nd) per evaluation of L n: number of samples d: dimension of feature space M : number of output nodes ≈ 10s to 1,000,000s K : number of possible labels of each output nodes ≈ 2 to 100s
  • 63. How Does Structure in Y Help?Computing w,ϕ(x,y) Z (x, w) = e y∈Ywith ϕ and w decomposing over the factors ϕ(x, y) = ϕF (xF , yF ) and w = wF F∈F F∈F wF ,ϕF (xF ,yF ) Z (x, w) = e F∈F y∈Y wF ,ϕF (xF ,yF ) = e y∈Y F∈F =:ΨF (yF ) A lot of research goes into analyzing this expression.
  • 64. Q U E S TCase I: only unary factors φ1 φ2 φ3 φ4 φ5F = {x1 , y1 }, . . . , {xM , yM } . Z= ΨF (yF ) y∈Y F∈F
  • 65. Q U E S TCase I: only unary factors φ1 φ2 φ3 φ4 φ5F = {x1 , y1 }, . . . , {xM , yM } . Z= ΨF (yF ) y∈Y F∈F M = Ψi (yi ) y∈Y i=1
  • 66. Q U E S TCase I: only unary factors φ1 φ2 φ3 φ4 φ5F = {x1 , y1 }, . . . , {xM , yM } . Z= ΨF (yF ) y∈Y F∈F M = Ψi (yi ) y∈Y i=1 = ··· Ψ1 (y1 ) · · · ΨM (yM ) y1 ∈Y y2 ∈Y yM ∈Y
  • 67. Q U E S TCase I: only unary factors φ1 φ2 φ3 φ4 φ5F = {x1 , y1 }, . . . , {xM , yM } . Z= ΨF (yF ) y∈Y F∈F M = Ψi (yi ) y∈Y i=1 = ··· Ψ1 (y1 ) · · · ΨM (yM ) y1 ∈Y y2 ∈Y yM ∈Y = Ψ1 (y1 ) Ψ2 (y2 ) · · · ΨM (yM ) y1 ∈Y y2 ∈Y yM ∈Y
  • 68. Q U E S TCase I: only unary factors φ1 φ2 φ3 φ4 φ5F = {x1 , y1 }, . . . , {xM , yM } . Z= ΨF (yF ) y∈Y F∈F M = Ψi (yi ) y∈Y i=1 = ··· Ψ1 (y1 ) · · · ΨM (yM ) y1 ∈Y y2 ∈Y yM ∈Y = Ψ1 (y1 ) Ψ2 (y2 ) · · · ΨM (yM ) y1 ∈Y y2 ∈Y yM ∈Y = Ψ1 (y1 ) · Ψ2 (y2 ) · · · ΨM (yM ) y1 ∈Y y2 ∈Y yM ∈Y
  • 69. Q U E S TCase I: only unary factors φ1 φ2 φ3 φ4 φ5F = {x1 , y1 }, . . . , {xM , yM } . Z= ΨF (yF ) y∈Y F∈F M = Ψi (yi ) y∈Y i=1 = ··· Ψ1 (y1 ) · · · ΨM (yM ) y1 ∈Y y2 ∈Y yM ∈Y = Ψ1 (y1 ) Ψ2 (y2 ) · · · ΨM (yM ) y1 ∈Y y2 ∈Y yM ∈Y = Ψ1 (y1 ) · Ψ2 (y2 ) · · · ΨM (yM ) y1 ∈Y y2 ∈Y yM ∈YCase I: O(K M nd) → O(MK nd)
  • 70. Example: Pose Estimation original unary onlyExample: Image Segmentation original unary only
  • 71. φ1;2 φ2;3 φ3;4 φ4;5 Q U E S T Case II: chain/tree with φ1 φ2 φ3 φ4 φ5 unary and pairwise factorsF = {x1 , y1 }, {y1 , y2 }, . . . , {yM −1 , yM }, {xM , yM } .Z= ΨF (yF ) y∈Y F∈F
  • 72. φ1;2 φ2;3 φ3;4 φ4;5 Q U E S T Case II: chain/tree with φ1 φ2 φ3 φ4 φ5 unary and pairwise factorsF = {x1 , y1 }, {y1 , y2 }, . . . , {yM −1 , yM }, {xM , yM } . M MZ= ΨF (yF ) = Ψi (yi ) Ψi−1,i (yi−1,i ) y∈Y F∈F y∈Y i=1 i=2
  • 73. φ1;2 φ2;3 φ3;4 φ4;5 Q U E S T Case II: chain/tree with φ1 φ2 φ3 φ4 φ5 unary and pairwise factorsF = {x1 , y1 }, {y1 , y2 }, . . . , {yM −1 , yM }, {xM , yM } . M MZ= ΨF (yF ) = Ψi (yi ) Ψi−1,i (yi−1,i ) y∈Y F∈F y∈Y i=1 i=2 = Ψ1 Ψ1,2 Ψ2 · · · ΨM −2,M −1 ΨM −1 ΨM −1,M ΨM y1 ∈Y y2 ∈Y yM−1 ∈Y yM ∈Y
  • 74. φ1;2 φ2;3 φ3;4 φ4;5 Q U E S T Case II: chain/tree with φ1 φ2 φ3 φ4 φ5 unary and pairwise factorsF = {x1 , y1 }, {y1 , y2 }, . . . , {yM −1 , yM }, {xM , yM } . M MZ= ΨF (yF ) = Ψi (yi ) Ψi−1,i (yi−1,i ) y∈Y F∈F y∈Y i=1 i=2 = Ψ1 Ψ1,2 Ψ2 · · · ΨM −2,M −1 ΨM −1 ΨM −1,M ΨM y1 ∈Y y2 ∈Y yM−1 ∈Y yM ∈Y  t                 =     ·    ···     ·               
  • 75. φ1;2 φ2;3 φ3;4 φ4;5 Q U E S T Case II: chain/tree with φ1 φ2 φ3 φ4 φ5 unary and pairwise factorsF = {x1 , y1 }, {y1 , y2 }, . . . , {yM −1 , yM }, {xM , yM } . M MZ= ΨF (yF ) = Ψi (yi ) Ψi−1,i (yi−1,i ) y∈Y F∈F y∈Y i=1 i=2 = Ψ1 Ψ1,2 Ψ2 · · · ΨM −2,M −1 ΨM −1 ΨM −1,M ΨM y1 ∈Y y2 ∈Y yM−1 ∈Y yM ∈Y  t                 =     ·    ···     ·               Case II: O(MK 2 nd) independent was O(MKnd), naive O(K M nd)
  • 76. Example: Pose Estimation original unary only unary + pairwiseExample: Image Segmentation original unary only unary + pairwise
  • 77. Message Passing in Trees Rephrase matrix multiplication as sending messages from node to node: Belief Propagation (BP)Can also computate expectations: Ey∼p(y|xi ,w) ϕ(xi , y).Message Passing in Loopy Graphs For loopy graph, send the same messages and iterate: Loopy Belief Propagation (LBP)Results only in approximation to Z (x, w), Ey∼p(y|xi ,w) ϕ(xi , y).No converge guarantee, but variants are often used in practice. our bookletMore: Pawan Kumar’s thesis "Combinatorial and Convex Optimization for Probabilistic Models in Computer Vision" http://ai.stanford.edu/~pawan/kumar-thesis.pdf
  • 78. n 1 w L(w) = 2 w − ϕ(x i , y i ) − Ey∼p(y|x i ,w) ϕ(x i , y) σ i=1Computing the Gradient: XX O(MK 2 nd) (if BP is possible): O(K M nd), X XX X n 1 2 w,ϕ(x i ,y) L(w) = w − w, ϕ(x i , y i ) + log e 2σ 2 i=1 y∈YLine Search: XX O(MK 2 nd) per evaluation of L O(K M nd), X XX X n: number of samples d: dimension of feature space M : number of output nodes K : number of possible labels of each output nodes
  • 79. n 1 w L(w) = 2 w − ϕ(x i , y i ) − Ey∼p(y|x i ,w) ϕ(x i , y) σ i=1Computing the Gradient: XX O(MK 2 nd) (if BP is possible): O(K M nd), X XX X n 1 2 w,ϕ(x i ,y) L(w) = w − w, ϕ(x i , y i ) + log e 2σ 2 i=1 y∈YLine Search: XX O(MK 2 nd) per evaluation of L O(K M nd), X XX X n: number of samples ≈ 1,000s to 1,000,000s d: dimension of feature space M : number of output nodes K : number of possible labels of each output nodes
  • 80. What, if our training set D is too large (e.g. millions of examples)?
  • 81. What, if our training set D is too large (e.g. millions of examples)?Switch to Simpler Classifier Train independent per-node classifiersWe lose all label dependences . Still slow .
  • 82. What, if our training set D is too large (e.g. millions of examples)?Switch to Simpler Classifier Train independent per-node classifiersWe lose all label dependences . Still slow .Subsampling Create random subset D ⊂ D Perform gradient descent to maximize p(w|D )Ignores all information in D D .
  • 83. What, if our training set D is too large (e.g. millions of examples)?Switch to Simpler Classifier Train independent per-node classifiersWe lose all label dependences . Still slow .Subsampling Create random subset D ⊂ D Perform gradient descent to maximize p(w|D )Ignores all information in D D .Parallelize Train several smaller models on different computers. Merge the models.Follows multi-core trend .
  • 84. What, if our training set D is too large (e.g. millions of examples)?Switch to Simpler Classifier Train independent per-node classifiersWe lose all label dependences . Still slow .Subsampling Create random subset D ⊂ D Perform gradient descent to maximize p(w|D )Ignores all information in D D .Parallelize Train several smaller models on different computers. Merge the models.Follows multi-core trend . Unclear how to merge models
  • 85. What, if our training set D is too large (e.g. millions of examples)?Switch to Simpler Classifier Train independent per-node classifiersWe lose all label dependences . Still slow .Subsampling Create random subset D ⊂ D Perform gradient descent to maximize p(w|D )Ignores all information in D D .Parallelize Train several smaller models on different computers. Merge the models.Follows multi-core trend . Unclear how to merge models Z Z ?
  • 86. What, if our training set D is too large (e.g. millions of examples)?Switch to Simpler Classifier Train independent per-node classifiersWe lose all label dependences . Still slow .Subsampling Create random subset D ⊂ D Perform gradient descent to maximize p(w|D )Ignores all information in D D .Parallelize Train several smaller models on different computers. Merge the models.Follows multi-core trend . Unclear how to merge models Z Z ?Doesn’t reduce computation, just distributes it .
  • 87. What, if our training set D is too large (e.g. millions of examples)?Stochastic Gradient Descent (SGD) Keep maximizing p(w|D). In each gradient descent step: Create random subset D ⊂ D, Follow approximate gradient ˜w L(w) = 1 w − ϕ(x i , y i ) − Ey∼p(y|x i ,w) ϕ(x i , y) σ2 (x i ,y i )∈Dmore: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS 2008.also: http://leon.bottou.org/research/largescale
  • 88. What, if our training set D is too large (e.g. millions of examples)?Stochastic Gradient Descent (SGD) Keep maximizing p(w|D). In each gradient descent step: Create random subset D ⊂ D, ← often just 1–3 elements! Follow approximate gradient ˜w L(w) = 1 w − ϕ(x i , y i ) − Ey∼p(y|x i ,w) ϕ(x i , y) σ2 (x i ,y i )∈Dmore: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS 2008.also: http://leon.bottou.org/research/largescale
  • 89. What, if our training set D is too large (e.g. millions of examples)?Stochastic Gradient Descent (SGD) Keep maximizing p(w|D). In each gradient descent step: Create random subset D ⊂ D, ← often just 1–3 elements! Follow approximate gradient ˜w L(w) = 1 w − ϕ(x i , y i ) − Ey∼p(y|x i ,w) ϕ(x i , y) σ2 (x i ,y i )∈D Line search no longer possible. Extra parameter: stepsize ηtmore: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS 2008.also: http://leon.bottou.org/research/largescale
  • 90. What, if our training set D is too large (e.g. millions of examples)?Stochastic Gradient Descent (SGD) Keep maximizing p(w|D). In each gradient descent step: Create random subset D ⊂ D, ← often just 1–3 elements! Follow approximate gradient ˜w L(w) = 1 w − ϕ(x i , y i ) − Ey∼p(y|x i ,w) ϕ(x i , y) σ2 (x i ,y i )∈D Line search no longer possible. Extra parameter: stepsize ηt SGD converges to argminw p(w|D)! (with ηt chosen right)more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS 2008.also: http://leon.bottou.org/research/largescale
  • 91. What, if our training set D is too large (e.g. millions of examples)?Stochastic Gradient Descent (SGD) Keep maximizing p(w|D). In each gradient descent step: Create random subset D ⊂ D, ← often just 1–3 elements! Follow approximate gradient ˜w L(w) = 1 w − ϕ(x i , y i ) − Ey∼p(y|x i ,w) ϕ(x i , y) σ2 (x i ,y i )∈D Line search no longer possible. Extra parameter: stepsize ηt SGD converges to argminw p(w|D)! (with ηt chosen right) SGD needs more iterations, but each one is much fastermore: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS 2008.also: http://leon.bottou.org/research/largescale
  • 92. n 1 w L(w) = 2 w − ϕ(x i , y i ) − Ey∼p(y|x i ,w) ϕ(x i , y) σ i=1Computing the Gradient: XX O(MK 2 nd) (if BP is possible): O(K M nd), X XX X n 1 2 w,ϕ(x i ,y) L(w) = w − w, ϕ(x i , y i ) + log e 2σ 2 i=1 y∈YLine Search: XX O(MK 2 nd) per evaluation of L O(K M nd), X XX X n: number of samples d: dimension of feature space: ≈ ϕi,j 1–10s, ϕi : 100s to 10000s M : number of output nodes K : number of possible labels of each output nodes
  • 93. Typical feature functions in image segmentation: ϕi (yi , x) ∈ R≈1000 : local image features, e.g. bag-of-words → wi , ϕi (yi , x) : local classifier (like logistic-regression) ϕi,j (yi , yj ) = yi = yj ∈ R1 : test for same label → wij , ϕij (yi , yj ) : penalizer for label changes (if wij 0) combined: argmaxy p(y|x) is smoothed version of local cues original local classification local + smoothness
  • 94. Typical feature functions in handwriting recognition: ϕi (yi , x) ∈ R≈1000 : image representation (pixels, gradients) → wi , ϕi (yi , x) : local classifier if xi is letter yi ϕi,j (yi , yj ) = eyi ⊗ eyj ∈ R26·26 : letter/letter indicator → wij , ϕij (yi , yj ) : encourage/suppress letter combinations combined: argmaxy p(y|x) is corrected version of local cues Q V E S T Output Q U E S T Output Input Input local classification local + correction
  • 95. Typical feature functions in pose estimation: ϕi (yi , x) ∈ R≈1000 : local image representation, e.g. HoG → wi , ϕi (yi , x) : local confidence map ϕi,j (yi , yj ) = good_fit(yi , yj ) ∈ R1 : test for geometric fit → wij , ϕij (yi , yj ) : penalizer for unrealistic poses together: argmaxy p(y|x) is sanitized version of local cues original local classification local + geometryImages: [Ferrari, Marin-Jimenez, Zisserman: Progressive Search Space Reduction for Human Pose Estimation, CVPR 2008.]
  • 96. Typical feature functions for CRFs in Computer Vision: ϕi (yi , x): local representation, high-dimensional → wi , ϕi (yi , x) : local classifier ϕi,j (yi , yj ): prior knowledge, low-dimensional → wij , ϕij (yi , yj ) : penalize outliers learning adjusts parameters: unary wi : learn local classifiers and their importance pairwise wij : learn importance of smoothing/penalization argmaxy p(y|x) is cleaned up version of local prediction
  • 97. Idea: split learning of unary potentials into two parts: local classifiers, their importance.Two-Stage Training pre-train fiy (x) = log p(yi |x) ˆ use ϕi (yi , x) := fiy (x) ∈ RK (low-dimensional) keep ϕij (yi , yj ) are before perform CRF learning with ϕi and ϕij
  • 98. Idea: split learning of unary potentials into two parts: local classifiers, their importance.Two-Stage Training pre-train fiy (x) = log p(yi |x) ˆ use ϕi (yi , x) := fiy (x) ∈ RK (low-dimensional) keep ϕij (yi , yj ) are before perform CRF learning with ϕi and ϕijAdvantage: CRF feature space much lower dimensional → faster training fiy (x) can be stronger classifiers, e.g. non-linear SVMsDisadvantage: if local classifiers are really bad, CRF training cannot fix that.
  • 99. Summary II — Numeric Training(All?) Current CRF training methods rely on gradient descent. ˜w L(w) = w − n ϕ(x i , y i ) − p(y|x i , w) ϕ(x i , y) ∈ Rd σ2 i=1 y∈YThe faster we can do it, the better (more realistic) models we can use.A lot of research on speeding up CRF training: problem solution method(s) |Y| too large exploit structure efficient inference (e.g. LBP) smart sampling contrastive divergence n too large mini-batches stochastic gradient descent d too large discriminative ϕunary two-stage training
  • 100. Conditional Random Fields Summary 1 w,ϕ(x,y)CRFs model conditional probability distribution p(y|x) = Z e . x ∈ X is (almost) arbitrary input we need a feature function ϕ : X × Y → Rd , we don’t need to know/estimate p(x) or p(ϕ(x))! y ∈ Y is output with a structure we need structure in Y that makes training/inference feasible e.g. graph labeling: local properties plus neighborhood relations w ∈ Rd is parameter vector CRF training is solving w = argmaxw p(w|D) for training set D. after learning, we can predict: f (x) := argmaxy p(y|x).Many many many more details in Machine Learning literature, e.g.:• [M. Wainwright, M. Jordan: Graphical Models, Exponential Families, and Variational Inference, now publishers, 2008]. Free PDF at http://www.nowpublishers.com• [D. Koller, N. Friedman: Probabilistic Graphical Models, MIT Press, 2010]
  • 101. Extra I: Beyond Fully Supervised Learning So far, training was fully supervised, all variables were observed. In real life, some variables can be unobserved even during training. missing labels in training data latent variables, e.g. part location latent variables, e.g. part occlusion latent variables, e.g. viewpoint
  • 102. Graphical model: three types of variables x ∈ X always observed, y ∈ Y observed only in training, z ∈ Z never observed (latent).Marginalization over Latent VariablesConstruct conditional likelihood as usual: 1 p(y, z|x, w) = exp( w, ϕ(x, y, z) ) Z (x, w)Derive p(y|x, w) by marginalizing over z: 1 p(y|x, w) = p(y, z|x, w) Z (x, w) z∈Z
  • 103. Negative regularized conditional log-likelihood: N 2 L(w) = λ w − log p(y n |x n , w) n=1 N 2 =λ w − log p(y n , z|x n , w) n=1 z∈Z N 2 =λ w − log exp( w, ϕ(x n , y n , z) ) n=1 z∈Z N + log exp( w, ϕ(x n , y, z) ) n=1 z∈Z y∈Y L is not convex in w → can have local minima no agreed on best way for treating latent variables
  • 104. Maximum Margin Training of Structured Models (Structure Output SVMs)
  • 105. Example: Object Localization Learn a localization function: f :X →Y with X ={images}, Y ={bounding boxes}. →
  • 106. Example: Object Localization Learn a localization function: f :X →Y with X ={images}, Y ={bounding boxes}. → Compatibility function: F :X ×Y →R with F (x, y) = w, ϕ(x, y) with ϕ(x, y), e.g., the BoW histogram of the region y in x. f (x) = argmax F (x, y) y∈Y
  • 107. Learn a localization function: f (x) = argmax F (x, y) with F (x, y) = w, ϕ(x, y) . y∈Yϕ(x, y) = BoW histogram of region y = [left,top,right,bottom] in x. x left top top left right image bottom right bottom 1p(y|x, w) = exp w, ϕ(x, y) with Z = exp w, ϕ(x, y ) Z y ∈Y
  • 108. Learn a localization function: f (x) = argmax F (x, y) with F (x, y) = w, ϕ(x, y) . y∈Yϕ(x, y) = BoW histogram of region y = [left,top,right,bottom] in x. x left top top left right image bottom right bottom 1p(y|x, w) = exp w, ϕ(x, y) with Z = exp w, ϕ(x, y ) Z y ∈YWe can’t train probabilistically: ϕ doesn’t decompose, Z too hard.We can do MAP prediction: sliding window or branch-and-bound. Can we learn structured prediction functions without Z ?
  • 109. Loss-Minimizing Parameter LearningLet d(x, y) be the (unknown) true data distribution.Let D = {(x 1 , y 1 ), . . . , (x n , y n )} be i.i.d. samples from d(x, y).Let ϕ : X × Y → RD be a feature function.Let ∆ : Y × Y → R be a loss function. Find a weight vector w ∗ that leads to minimal expected loss E(x,y)∼d(x,y) {∆(y, f (x))} (∗) for f (x) = argmaxy∈Y w, ϕ(x, y) .
  • 110. Loss-Minimizing Parameter LearningLet d(x, y) be the (unknown) true data distribution.Let D = {(x 1 , y 1 ), . . . , (x n , y n )} be i.i.d. samples from d(x, y).Let ϕ : X × Y → RD be a feature function.Let ∆ : Y × Y → R be a loss function. Find a weight vector w ∗ that leads to minimal expected loss E(x,y)∼d(x,y) {∆(y, f (x))} (∗) for f (x) = argmaxy∈Y w, ϕ(x, y) .Pro: We directly optimize for the quantity of interest: expected loss. No expensive-to-compute partition function Z will show up.Con: 1) We need to know the loss function already at training time. 2) We don’t know d(x, y), so we can’t compute (∗). 3) Even if we did, minimizing (∗) would be NP-hard
  • 111. 1) Loss function How to judge if a prediction is good? Define a loss function ∆ : Y × Y → R+ , ∆(y , y) measures the loss incurred by predicting y when y is correct. The loss function is application dependent
  • 112. Example 1: 0/1 loss Loss is 0 for perfect prediction, 1 otherwise: 0 if y = y ∆0/1 (y , y) = y = y = 1 otherwise Every mistake is equally bad. Usually not very useful in structured prediction.
  • 113. Example 2: Hamming lossCount the number of mislabeled variables: 1 ∆H (y , y) = I (yi = yi ) |V | i∈VUsed, e.g., in image segmentation.
  • 114. Example 3: Squared errorIf we can add elements in Yi(pixel intensities, optical flow vectors, etc.).Sum of squared errors 1 ∆Q (y , y) = y − yi 2 . |V | i∈V i Used, e.g., in stereo reconstruction, part-based object detection.
  • 115. Example 4: Task specific lossesObject detection detection bounding boxes, or arbitrary regions ground truth image Area overlap loss: area(y ∩ y) ∆AO (y , y) = 1 − =1− area(y ∪ y) Used, e.g., in PASCAL VOC challenges for object detection, because it scale-invariants (no bias for or against big objects).
  • 116. 2)+3) Learning by regularized risk minimization Instead of minimzing E(x,y)∼d(x,y) {∆(y, f (x))} directly, solve a regularized risk minimization task: Find w for decision function f = w, ϕ(x, y) by n 2 mind λ w + (y i , f (x i )) w∈R i=1 Regularization + Error/Loss on data
  • 117. 2)+3) Learning by regularized risk minimization Instead of minimzing E(x,y)∼d(x,y) {∆(y, f (x))} directly, solve a regularized risk minimization task: Find w for decision function f = w, ϕ(x, y) by n 2 mind λ w + (y i , f (x i )) w∈R i=1 Regularization + Error/Loss on data Maximum-Margin Training: Hinge-loss (y i , f (x i )) := max 0, max ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i ) y∈Y ∆(y, y ) ≥ 0 with ∆(y, y) = 0 loss function between outputs.
  • 118. Structured Output Support Vector MachineStructured Output Support Vector Machine (SSVMs): n 1 2 C min w + max ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i ) w∈Rd 2 n i=1 y∈Y
  • 119. Structured Output Support Vector MachineStructured Output Support Vector Machine (SSVMs): n 1 2 C min w + max ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i ) w∈Rd 2 n i=1 y∈YConditional Random Fields (CRFs): n w 2 mind 2σ 2 + log exp w, ϕ(x i , y) − w, ϕ(x i , y i ) w∈R i=1 y∈Y
  • 120. Structured Output Support Vector MachineStructured Output Support Vector Machine (SSVMs): n 1 2 C min w + max ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i ) w∈Rd 2 n i=1 y∈YConditional Random Fields (CRFs): n w 2 mind 2σ 2 + log exp w, ϕ(x i , y) − w, ϕ(x i , y i ) w∈R i=1 y∈YSidenote: CRFs and SSVMs are really not that different. both do regularized risk minimization, just with different loss terms don’t believe people who say one is (in general) better than the other
  • 121. S-SVM Objective Function for w ∈ R2 : S-SVM objective C =0.01 S-SVM objective C =0.10 3 3 2 2 1 1 0 0 1 1 2 2 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 S-SVM objective C =1.00 S-SVM objective C→ ∞ 3 3 2 2 1 1 0 0 1 1 2 2 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5
  • 122. Structured Output SVM: n 1 2 Cmin w + max ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i )w∈Rd 2 n i=1 y∈Yuncontrained optimization, convex, non-differentiable objective.
  • 123. Structured Output SVM (equivalent formulation): n 1 2 C min w + ξi w∈R ,ξ∈Rn d + 2 n i=1subject to, for i = 1, . . . , n, max ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i ) ≤ ξi y∈Yn non-linear contraints, convex, differentiable objective.
  • 124. Structured Output SVM (also equivalent formulation): n 1 2 C min w + ξi w∈R ,ξ∈Rn d + 2 n i=1subject to, for i = 1, . . . , n, ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i ) ≤ ξ i , for all y ∈ Y|Y|n linear constraints, convex, differentiable objective.
  • 125. Example: A True Multiclass SVM  1 for y = y Y = {1, 2, . . . , K }, ∆(y, y ) =  . 0 otherwise. ϕ(x, y) = y = 1 Φ(x), y = 2 Φ(x), . . . , y = K Φ(x) = Φ(x)ey with ey =y-th unit vector
  • 126. Example: A True Multiclass SVM  1 for y = y Y = {1, 2, . . . , K }, ∆(y, y ) =  . 0 otherwise. ϕ(x, y) = y = 1 Φ(x), y = 2 Φ(x), . . . , y = K Φ(x) = Φ(x)ey with ey =y-th unit vectorSolve: n 1 2 C min w + ξi w,ξ 2 n i=1subject to, for i = 1, . . . , n, w, ϕ(x i , y i ) − w, ϕ(x i , y) ≥ 1 − ξ i for all y ∈ Y.Classification: MAP f (x) = argmax w, ϕ(x, y) y∈Y Crammer-Singer Multiclass SVM
  • 127. Hierarchical Multiclass ClassificationLoss function can reflect hierarchy: cat dog car bus 1 ∆(y, y ) := (distance in tree) 2 ∆(cat, cat) = 0, ∆(cat, dog) = 1, ∆(cat, bus) = 2, etc.Solve: n 1 2 C min w + ξi w,ξ 2 n i=1subject to, for i = 1, . . . , n, w, ϕ(x i , y i ) − w, ϕ(x i , y) ≥ ∆(y i , y) − ξ i for all y ∈ Y.
  • 128. Numeric Solution: Solving an S-SVM like a CRF.How to solve? n 1 2 Cmind 2 w + max ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i )w∈R n i=1 y∈YUnconstrained, convex, non-differentiable: unconstrained convex non-differentiable → we can’t use gradient descent directly.Idea: use sub-gradient descent.
  • 129. Sub-GradientsDefinition: Let f : Rd → R be a convex function. A vector v ∈ Rdis called a sub-gradient of f at w0 , if f (w) ≥ f (w0 ) + v, w − w0 for all w. f(w) f(w0)+⟨v,w-w0⟩ f(w ) 0 w w 0
  • 130. Sub-GradientsDefinition: Let f : Rd → R be a convex function. A vector v ∈ Rdis called a sub-gradient of f at w0 , if f (w) ≥ f (w0 ) + v, w − w0 for all w. f(w) f(w0)+⟨v,w-w0⟩ f(w ) 0 w w 0
  • 131. Sub-GradientsDefinition: Let f : Rd → R be a convex function. A vector v ∈ Rdis called a sub-gradient of f at w0 , if f (w) ≥ f (w0 ) + v, w − w0 for all w. f(w) f(w0)+⟨v,w-w0⟩ f(w ) 0 w w 0
  • 132. Sub-GradientsDefinition: Let f : Rd → R be a convex function. A vector v ∈ Rdis called a sub-gradient of f at w0 , if f (w) ≥ f (w0 ) + v, w − w0 for all w. f(w) f(w0)+⟨v,w-w0⟩ f(w ) 0 w w 0For differentiable f , the gradient v = f (w0 ) is the only subgradient. f(w) f(w0)+⟨v,w-w0⟩ f(w ) 0 w w 0
  • 133. Computing a subgradient: n 1 2 C i min w + (w) w 2 n i=1 i iwith (w) = max{0, maxy y (w)}, and i y (w) := ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i )
  • 134. Computing a subgradient: n 1 2 C i min w + (w) w 2 n i=1 i iwith (w) = max{0, maxy y (w)}, and i y (w) := ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i )ℓ(w) y wFor each y ∈ Y, y (w) is a linear function.
  • 135. Computing a subgradient: n 1 2 C i min w + (w) w 2 n i=1 i iwith (w) = max{0, maxy y (w)}, and i y (w) := ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i )ℓ(w) y wFor each y ∈ Y, y (w) is a linear function.
  • 136. Computing a subgradient: n 1 2 C i min w + (w) w 2 n i=1 i iwith (w) = max{0, maxy y (w)}, and i y (w) := ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i )ℓ(w) wFor each y ∈ Y, y (w) is a linear function.
  • 137. Computing a subgradient: n 1 2 C i min w + (w) w 2 n i=1 i iwith (w) = max{0, maxy y (w)}, and i y (w) := ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i )ℓ(w) w(w) = max{0, maxy y (w)}: maximum over all y ∈ Y and 0.
  • 138. Computing a subgradient: n 1 2 C i min w + (w) w 2 n i=1 i iwith (w) = max{0, maxy y (w)}, and i y (w) := ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i )ℓ(w) w w 0Subgradient at w0 :
  • 139. Computing a subgradient: n 1 2 C i min w + (w) w 2 n i=1 i iwith (w) = max{0, maxy y (w)}, and i y (w) := ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i )ℓ(w) w w 0Subgradient at w0 : find maximal (active) y.
  • 140. Computing a subgradient: n 1 2 C i min w + (w) w 2 n i=1 i iwith (w) = max{0, maxy y (w)}, and i y (w) := ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i )ℓ(w) w w 0 i iSubgradient of at w0 : find maximal (active) y, use v = y (w0 ).
  • 141. Subgradient Descent S-SVM Traininginput training pairs {(x 1 , y 1 ), . . . , (x n , y n )} ⊂ X × Y,input feature map ϕ(x, y), loss function ∆(y, y ), regularizer C ,input number of iterations T , stepsizes ηt for t = 1, . . . , T 1: w←0 2: for t=1,. . . ,T do 3: for i=1,. . . ,n do 4: y ← argmaxy∈Y ∆(y i , y) + w, ϕ(x i , y) − w, ϕ(x i , y i ) ˆ 5: v i ← ϕ(x i , y ) − ϕ(x i , y i ) ˆ 6: end for 7: w ← w − ηt (w − C i v i ) n 8: end foroutput prediction function f (x) = argmaxy∈Y w, ϕ(x, y) .Obs: each update of w needs 1 argmax-prediction per example.
  • 142. Numeric Solution: Solving an S-SVM like a linear SVM. Other, equivalent, S-SVM formulation was n 2 C min w + ξi w∈R ,ξ∈Rn d + n i=1 subject to, for i = 1, . . . n, w, ϕ(x i , y i ) − w, ϕ(x i , y) ≥ ∆(y i , y) − ξ i , for all y ∈ Y.
  • 143. Numeric Solution: Solving an S-SVM like a linear SVM. Other, equivalent, S-SVM formulation was n 2 C min w + ξi w∈R ,ξ∈Rn d + n i=1 subject to, for i = 1, . . . n, w, ϕ(x i , y i ) − w, ϕ(x i , y) ≥ ∆(y i , y) − ξ i , for all y ∈ Y. With feature vectors δϕ(x i , y i , y) := ϕ(x i , y i ) − ϕ(x i , y), this is n 2 C min w + ξi w∈R ,ξ∈Rn d + n i=1 subject to, for i = 1, . . . n, for all y ∈ Y, w, δϕ(x i , y i , y) ≥ ∆(y i , y) − ξ i .
  • 144. Solve n 2 C min w + ξi (QP) w∈R ,ξ∈Rn d + n i=1subject to, for i = 1, . . . n, for all y ∈ Y, w, δϕ(x i , y i , y) ≥ ∆(y i , y) − ξ i .This has the same structure as an ordinary SVM! quadratic objective linear constraints
  • 145. Solve n 2 C min w + ξi (QP) w∈R ,ξ∈Rn d + n i=1subject to, for i = 1, . . . n, for all y ∈ Y, w, δϕ(x i , y i , y) ≥ ∆(y i , y) − ξ i .This has the same structure as an ordinary SVM! quadratic objective linear constraintsQuestion: Can’t we use a ordinary QP solver to solve this?
  • 146. Solve n 2 C min w + ξi (QP) w∈R ,ξ∈Rn d + n i=1subject to, for i = 1, . . . n, for all y ∈ Y, w, δϕ(x i , y i , y) ≥ ∆(y i , y) − ξ i .This has the same structure as an ordinary SVM! quadratic objective linear constraintsQuestion: Can’t we use a ordinary QP solver to solve this?Answer: Almost! We could, if there weren’t n(|Y| − 1) constraints. E.g. 100 binary 16 × 16 images: 1079 constraints
  • 147. Solution: working set training It’s enough if we enforce the active constraints. The others will be fulfilled automatically. We don’t know which ones are active for the optimal solution. But it’s likely to be only a small number ← can of course be formalized.
  • 148. Solution: working set training It’s enough if we enforce the active constraints. The others will be fulfilled automatically. We don’t know which ones are active for the optimal solution. But it’s likely to be only a small number ← can of course be formalized.Keep a set of potentially active constraints and update it iteratively: Start with working set S = ∅ (no contraints) Repeat until convergence: Solve S-SVM training problem with constraints from S Check, if solution violates any of the full constraint set if no: we found the optimal solution, terminate. if yes: add most violated constraints to S, iterate.
  • 149. Solution: working set training It’s enough if we enforce the active constraints. The others will be fulfilled automatically. We don’t know which ones are active for the optimal solution. But it’s likely to be only a small number ← can of course be formalized.Keep a set of potentially active constraints and update it iteratively: Start with working set S = ∅ (no contraints) Repeat until convergence: Solve S-SVM training problem with constraints from S Check, if solution violates any of the full constraint set if no: we found the optimal solution, terminate. if yes: add most violated constraints to S, iterate.Good practical performance and theoretic guarantees: polynomial time convergence -close to the global optimum[Tsochantaridis et al. Large Margin Methods for Structured and Interdependent Output Variables, JMLR, 2005.]
  • 150. Working Set S-SVM Traininginput training pairs {(x 1 , y 1 ), . . . , (x n , y n )} ⊂ X × Y,input feature map ϕ(x, y), loss function ∆(y, y ), regularizer C 1: S ←∅ 2: repeat 3: (w, ξ) ← solution to QP only with constraints from S 4: for i=1,. . . ,n do 5: y ← argmaxy∈Y ∆(y i , y) + w, ϕ(x i , y) ˆ 6: if y = y i then ˆ 7: S ← S ∪ {(x i , y )} ˆ 8: end if 9: end for10: until S doesn’t change anymore.output prediction function f (x) = argmaxy∈Y w, ϕ(x, y) .Obs: each update of w needs 1 argmax-prediction per example.(but we solve globally for w, not locally)
  • 151. One-Slack Structured Output SVM: 1(equivalent to ordinary S-SVM by ξ = n i ξ i ) 1 2 min w + Cξ d w∈R ,ξ∈R+ 2subject to n ∆(y i , y i ) + w, ϕ(x i , y i ) − w, ϕ(x i , y i ) ˆ ˆ ≤ nξ, i=1subject to, for all (ˆ1 , . . . , y n ) ∈ Y × · · · × Y, y ˆ n n ∆(y i , y i ) + w, ˆ [ϕ(x i , y i ) − ϕ(x i , y i )] ≤ nξ, ˆ i=1 i=1|Y|n linear constraints, convex, differentiable objective.We have blown up the constraint set even further: 100 binary 16 × 16 images: 10177 constraints (instead of 1079 ).
  • 152. Working Set One-Slack S-SVM Traininginput training pairs {(x 1 , y 1 ), . . . , (x n , y n )} ⊂ X × Y,input feature map ϕ(x, y), loss function ∆(y, y ), regularizer C 1: S ←∅ 2: repeat 3: (w, ξ) ← solution to QP only with constraints from S 4: for i=1,. . . ,n do 5: y i ← argmaxy∈Y ∆(y i , y) + w, ϕ(x i , y) ˆ 6: end for 7: S ← S ∪ { (x 1 , . . . , x n ), (ˆ1 , . . . , y n ) } y ˆ 8: until S doesn’t change anymore.output prediction function f (x) = argmaxy∈Y w, ϕ(x, y) .Often faster convergence:We add one strong constraint per iteration instead of n weak ones.[Joachims, Finley, Yu: Cutting-Plane Training of Structural SVMs, Machine Learning, 2009]
  • 153. Numeric Solution: Solving an S-SVM like a non-linear SVM. Dual S-SVM problem 1 max αiy ∆(y i , y) − αiy αi y δϕ(x i , y i , y), δϕ(x i , y i , y ) α∈R+ n|Y| i=1,...,n 2 y,y ∈Y y∈Y i,i =1,...,n subject to, for i = 1, . . . , n, C αiy ≤ . y∈Y n Like ordinary SVMs, we can dualize the S-SVMs optimization: min becomes max, original (primal) variables w, ξ disappear, we get one dual variable αiy for each primal constraint. n linear contraints, convex, differentiable objective, n|Y| variables.
  • 154. Data appears only inside inner products: kernelize Define joint kernel function k : (X × Y) × (X × Y) → R k( (x, y) , (x , y ) ) = ϕ(x, y), ϕ(x , y ) . k measure similarity between two (input,output)-pairs. We can express the optimization in terms of k: δϕ(x i , y i , y) , δϕ(x i , y i , y ) = ϕ(x i , y i ) − ϕ(x i , y) , ϕ(x i , y i ) − ϕ(x i , y ) = ϕ(x i , y i ), ϕ(x i , y i ) − ϕ(x i , y i ), ϕ(x i , y ) − ϕ(x i , y), ϕ(x i , y i ) + ϕ(x i , y), ϕ(x i , y ) = k( (x i , y i ), (x i , y i ) ) − k( (x i , y i ), ϕ(x i , y ) ) − k( (x i , y), (x i , y i ) ) + k( (x i , y), ϕ(x i , y ) ) =: Kii yy
  • 155. Kernelized S-SVM problem: 1 max αiy ∆(y i , y) − αiy αi y Kii yy α∈R+ n|Y| i=1,...,n 2 y,y ∈Y y∈Y i,i =1,...,nsubject to, for i = 1, . . . , n, C αiy ≤ . y∈Y n too many variables: train with working set of αiy .Kernelized prediction function: f (x) = argmax αiy k( (x i , y i ), (x, y) ) y∈Y iyBe careful in kernel design:Evaluating f can easily become very slow / completely infeasible.
  • 156. What do joint kernel functions look like? k( (x, y) , (x , y ) ) = ϕ(x, y), ϕ(x , y ) .Graphical Model: ϕ decomposes into components per factor: ϕ(x, y) = ϕf (xf , yf ) f ∈FKernel k decomposes into sum over factors: k( (x, y) , (x , y ) ) = ϕf (xf , yf ) , ϕf (xf , yf ) f ∈F f ∈F = ϕf (xf , yf ), ϕf (xf , yf ) f ∈F = kf ( (xf , yf ), (xf , yf ) ) f ∈FWe can define kernels for each factor (e.g. nonlinear).
  • 157. Example: figure-ground segmentation with grid structure(x, y)=( ˆ , )Typical kernels: arbirary in x, linear (or at least simple) w.r.t. y: Unary factors: kp ((x, yp ), (x , yp ) = k(xN (p) , xN (p) ) yp = yp with k image kernel on local neighborhood N (p), e.g. χ2 or histogram intersection Pairwise factors: kpq ((yp , yq ), (yp , yp ) = yp = yp yq = yqMore powerful than all-linear, and MAP prediction still possible.
  • 158. Example: object localization left top (x, y)=( ˆ , ) image right bottomOnly one factor that includes all x and y: k( (x, y) , (x , y ) ) = kimage (x|y , x |y )with kimage image kernel and x|y is image region within box y.MAP prediction is just object localization with kimage -SVM. f (x) = argmax αiy kimage (x i |y , x|y ) y∈Y iy
  • 159. Summary I – S-SVM LearningTask: parameter learning given training set {(x 1 , y 1 ), . . . , (x n , y n )} ⊂ X × Y, learn parameters w for prediction function F (x, y) = w, ϕ(x, y) . predict f : X → Y by f (x) = argmaxy∈Y F (x, y) (= MAP) ˆSolution from maximum-margin principle: enforce correct outputs to be better than others by a margin : w, ϕ(x n , y n ) ≥ ∆(y n , y) + w, ϕ(x n , y) for all y ∈ Y. convex optimization problem (similar to multiclass SVM) many equivalent formulations → different training algorithms training calls argmax repeatedly, no probabilistic inference.
  • 160. Extra I: Beyond Fully Supervised Learning So far, training was fully supervised, all variables were observed. In real life, some variables are unobserved even during training. missing labels in training data latent variables, e.g. part location latent variables, e.g. part occlusion latent variables, e.g. viewpoint
  • 161. Three types of variables: x ∈ X always observed, y ∈ Y observed only in training, z ∈ Z never observed (latent).Decision function: f (x) = argmaxy∈Y maxz∈Z w, ϕ(x, y, z)
  • 162. Three types of variables: x ∈ X always observed, y ∈ Y observed only in training, z ∈ Z never observed (latent).Decision function: f (x) = argmaxy∈Y maxz∈Z w, ϕ(x, y, z)Maximization over Latent Variables N 1 2 C Solve: min w + ξn w,ξ 2 N n=1subject to, for n = 1, . . . , N , for all y ∈ Y ∆(y n , y) + max w, ϕ(x n , y, z) − max w, ϕ(x n , y n , z) z∈Z z∈ZProblem: not a convex problem → can have local minima[C. Yu, T. Joachims, Learning Structural SVMs with Latent Variables, ICML, 2009]similar idea: [Felzenszwalb, McAllester, Ramaman. A Discriminatively Trained, Multiscale, Deformable Part Model, CVPR’08]
  • 163. Summary – Learning for Structured PredictionTask: Learn parameter w for a function f : X → Y f (x) = argmax w, ϕ(x, y) y∈YRegularized risk minimization framework: loss function: logistic loss → conditional random fields (CRF) Hinge loss → structured support vector machine (S-SVM) regularization: L2 norm of w ↔ Gaussian priorNeither guarantees better prediction accuracy than the other. see e.g. [Nowozin et al., ECCV 2010]Difference is in numeric optimization: Use CRFs if you can do probabilistic inference. Use S-SVMs if you can do loss-augmented MAP prediction.
  • 164. Open Research in Structured Learning Faster training CRFs need many runs of probablistic inference, SSVMs need many runs of argmax-predictions. Reduce amount of training data necessary semi-supervised learning? transfer learning? Understand competing training methods when to use probabilistic training, when maximum margin? Understand S-SVM with approximate inference very often, exactly solving argmaxy F (x, y) is infeasible. can we guarantee convergence even with approximate inference? Latent variable S-SVM / CRFs
  • 165. For what problems can we learn solutions? we need training data, we need to be able to perform argmax-prediction. e.g. object recognition e.g. true e.g. stereo image segmentation artificial reconstruction pose estimation intelligence e.g. image action recognition thresholding ... difficulty trivial / boring good manual interesting realistic horrendously solutions difficult
  • 166. For what problems can we learn solutions? we need training data, we need to be able to perform argmax-prediction. Exact MAP MAP is NP-hard! is doable. e.g. object recognition e.g. true e.g. stereo image segmentation artificial reconstruction pose estimation intelligence e.g. image action recognition thresholding ... difficulty trivial / boring good manual interesting realistic horrendously solutions difficultFor many interesting problems, we cannot do exact MAP prediction.
  • 167. The Need for Approximate Inference Image segmentation on pixel grid: two labels and positive pairwise weights: exact MAP doable more than two labels: MAP generally NP-hard → Stereo reconstruction (grid) #labels = #depth levels: convex penalization of disparities: exact MAP doable non-convex (truncated) penalization: NP-hard Human Pose estimation limbs independent given torso: exact MAP doable limbs interact with each other: MAP NP-hard
  • 168. Approximate inference y ≈ argmaxy∈Y F (x, y): ˆ has been used for decades, e.g. for classical combinatorial problems: traveling salesman, knapsack energy minimization: Markov/Conditional Random Fields often yield good (=almost perfect) solutions y can have guarantees F (x, y ∗ ) ≤ F (x, y ) ≤ (1 + )F (x, y ∗ ) ˆ typically much faster than exact inference
  • 169. Approximate inference y ≈ argmaxy∈Y F (x, y): ˆ has been used for decades, e.g. for classical combinatorial problems: traveling salesman, knapsack energy minimization: Markov/Conditional Random Fields often yield good (=almost perfect) solutions y can have guarantees F (x, y ∗ ) ≤ F (x, y ) ≤ (1 + )F (x, y ∗ ) ˆ typically much faster than exact inferenceCan’t we use that for S-SVM training?
  • 170. Approximate inference y ≈ argmaxy∈Y F (x, y): ˆ has been used for decades, e.g. for classical combinatorial problems: traveling salesman, knapsack energy minimization: Markov/Conditional Random Fields often yield good (=almost perfect) solutions y can have guarantees F (x, y ∗ ) ≤ F (x, y ) ≤ (1 + )F (x, y ∗ ) ˆ typically much faster than exact inferenceCan’t we use that for S-SVM training? Iterating it is problematic! In each S-SVM training iteration we solve y ∗ = argmaxy [∆(y i , y) + F (x i , y)], we use y ∗ to update w, we stop, when there are no more violated constraints / updates. With y ≈ argmaxy [∆(y i , y) + F (x i , y)] ˆ errors can accumulate, we might stop too early.
  • 171. Approximate inference y ≈ argmaxy∈Y F (x, y): ˆ has been used for decades, e.g. for classical combinatorial problems: traveling salesman, knapsack energy minimization: Markov/Conditional Random Fields often yield good (=almost perfect) solutions y can have guarantees F (x, y ∗ ) ≤ F (x, y ) ≤ (1 + )F (x, y ∗ ) ˆ typically much faster than exact inferenceCan’t we use that for S-SVM training? Iterating it is problematic! In each S-SVM training iteration we solve y ∗ = argmaxy [∆(y i , y) + F (x i , y)], we use y ∗ to update w, we stop, when there are no more violated constraints / updates. With y ≈ argmaxy [∆(y i , y) + F (x i , y)] ˆ errors can accumulate, we might stop too early.Training S-SVMs with approximate MAP is active research question.
  • 172. Ad: PhD/PostDoc Positions at I.S.T. Austria, Vienna I.S.T. Graduate School 1(2) + 3 yr PhD program Computer Vision/Machine Learning (me, Vladimir Kolmogorov) Computer Graphics (C. Wojtan) Comp. Topology (H. Edelsbrunner) Game Theory (K. Chatterjee) Software Verification (T. Henzinger) Cryptography (K. Pietrzak) Comp. Neuroscience (G. Tkacik) full scholarships PostDoc Positions in my Group computer vision (object recognition) machine learning (structured output learning) More info: www.ist.ac.at Internships: ask me!

×