Successfully reported this slideshow.

Chrisoph Lampert, Talk part 2

1,182 views

Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Chrisoph Lampert, Talk part 2

  1. 1. Learning with Structured Inputs and Outputs Christoph Lampert Institute of Science and Technology Austria Vienna (Klosterneuburg), Austria Microsoft Computer Vision School 2011
  2. 2. Kernel Methods in Computer VisionOverview...Monday: 9:00 – 10:00 Introduction to Kernel Classifiers 10:00 – 10:10 — mini break — 10:10 – 10:50 More About Kernel MethodsTuesday: 9:00 – 9:50 Conditional Random Fields 9:50 – 10:00 — mini break — 10:00 – 10:50 Structured Support Vector Machines Slides available on my home page: http://www.ist.ac.at/~chl
  3. 3. Extended versions of the lectures in book form Foundations and Trends in Computer Graphics and Vision, now publisher www.nowpublishers.com/ Available as PDFs on my homepage
  4. 4. "Normal" Machine Learning: f : X → R.Structured Output Learning: f : X → Y.
  5. 5. "Normal" Machine Learning: f : X → R. inputs X can be any kind of objects output y is a real number classification, regression, density estimation, . . .Structured Output Learning: f : X → Y. inputs X can be any kind of objects outputs y ∈ Y are complex (structured) objects images, text, audio, folds of a protein
  6. 6. What is structured data? Ad hoc definition: data that consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together. Text Molecules / Chemical Structures Documents/HyperText Images
  7. 7. What is structured output prediction? Ad hoc definition: predicting structured outputs from input data (in contrast to predicting just a single number, like in classification or regression) Natural Language Processing: Automatic Translation (output: sentences) Sentence Parsing (output: parse trees) Bioinformatics: Secondary Structure Prediction (output: bipartite graphs) Enzyme Function Prediction (output: path in a tree) Speech Processing: Automatic Transcription (output: sentences) Text-to-Speech (output: audio signal) Robotics: Planning (output: sequence of actions) This lecture: Structured Prediction in Computer Vision
  8. 8. Computer Vision Example: Semantic Image Segmentation → input: images output: segmentation masks input space X = {images} = [0, 255]3·M ·N ˆ output space Y = {segmentation masks} = {0, 1}M ·N ˆ (structured output) prediction function: f : X → Y f (x) := argmin E(x, y) y∈Y energy function E(x, y) = i wi ϕu (xi , yi ) + i,j wij ϕp (yi , yj )Images: [M. Everingham et al. "The PASCAL Visual Object Classes (VOC) challenge", IJCV 2010]
  9. 9. Computer Vision Example: Human Pose Estimation input: image body model output: model fit input space X = {images} output space Y = {positions/angles of K body parts} = R4K . ˆ prediction function: f : X → Y f (x) := argmin E(x, y) y∈Y energy E(x, y) = i wi ϕfit (xi , yi ) + i,j wij ϕpose (yi , yj )Images: [Ferrari, Marin-Jimenez, Zisserman: "Progressive Search Space Reduction for Human Pose Estimation", CVPR 2008.]
  10. 10. Computer Vision Example: Point Matching input: image pairs output: mapping y : xi ↔ y(xi ) prediction function: f : X → Y f (x) := argmax F (x, y) y∈Y scoring function F (x, y) = i wi ϕsim (xi , y(xi )) + i,j wij ϕdist (xi , xj , y(xi ), y(xj )) + i,j,k wijk ϕangle (xi , xj , xk , y(xi ), y(xj ), y(xk ))[J. McAuley et al.: "Robust Near-Isometric Matching via Structured Learning of Graphical Models", NIPS, 2008]
  11. 11. Computer Vision Example: Object Localization output: object position input: (left, top image right, bottom) input space X = {images} output space Y = R4 bounding box coordinates prediction function: f : X → Y f (x) := argmax F (x, y) y∈Y scoring function F (x, y) = w ϕ(x, y) where ϕ(x, y) = h(x|y ) is a feature vector for an image region, e.g. bag-of-visual-words.[M. Blaschko, C. Lampert: "Learning to Localize Objects with Structured Output Regression", ECCV, 2008]
  12. 12. Computer Vision Examples: SummaryImage Segmentation y = argmin E(x, y) E(x, y) = wi ϕ(xi , yi ) + wij ϕ(yi , yj ) y∈{0,1}N i i,jPose Estimation y = argmin E(x, y) E(x, y) = wi ϕ(xi , yi ) + wij ϕ(yi , yj ) y∈R4K i i,jPoint Matching y = argmax F(x, y) F(x, y) = wi ϕ(xi , yi ) + wij ϕ(yi , yj ) + wijk ϕ(yi , yj , yk ) y∈Πn i i,j i,j,kObject Localization y = argmax F (x, y) F (x, y) = w ϕ(x, y) y∈R4
  13. 13. Grand Unified ViewPredict structured output by maximization y = argmax F (x, y) y∈Yof a compatiblity function F (x, y) = w, ϕ(x, y)that is linear in a parameter vector w.
  14. 14. A generic structured prediction problem X : arbitrary input domain Y: structured output domain, decompose y = (y1 , . . . , yk ) Prediction function f : X → Y by f (x) = argmax F (x, y) y∈Y Compatiblity function (or negative of "energy") F (x, y) = w, ϕ(x, y) k = wi ϕi (yi , x) unary terms i=1 k + wij ϕij (yi , yj , x) binary terms i,j=1 + ... higher order terms (sometimes)
  15. 15. Example: Sequence Prediction – Handwriting Recognition X = 5-letter word images , x = (x1 , . . . , x5 ), xj ∈ {0, 1}300×80 Y = ASCII translation , y = (y1 , . . . , y5 ), yj ∈ {A, . . . , Z }. feature functions has only unary terms ϕ(x, y) = ϕ1 (x, y1 ), . . . , ϕ5 (x, y5 ) . F (x, y) = w1 , ϕ1 (x, y1 ) + · · · + w, ϕ5 (x, y5 ) Q V E S T Output InputAdvantage: computing y ∗ = argmaxy F (x, y) is easy.We can find each yi∗ independently, check 5 · 26 = 130 values.Problem: only local information, we can’t correct errors.
  16. 16. Example: Sequence Prediction – Handwriting Recognition X = 5-letter word images , x = (x1 , . . . , x5 ), xj ∈ {0, 1}300×80 Y = ASCII translation , y = (y1 , . . . , y5 ), yj ∈ {A, . . . , Z }. one global feature function   yth pos.   ϕ(x, y) = (0, . . . , 0, Φ(x) , 0, . . . , 0) if y ∈ D dictionary, (0, . . . , 0, 0 , 0, . . . , 0) otherwise.  QUEST Output InputAdvantage: access to global information, e.g. from dictionary D.Problem: argmaxy w, ϕ(x, y) has to check 265 = 11881376 values. We need separate training data for each word.
  17. 17. Example: Sequence Prediction – Handwriting Recognition X = 5-letter word images , x = (x1 , . . . , x5 ), xj ∈ {0, 1}300×80 Y = ASCII translation , y = (y1 , . . . , y5 ), yj ∈ {A, . . . , Z }. feature function with unary and pairwise terms ϕ(x, y) = ϕ1 (y1 , x), ϕ2 (y2 , x), . . . , ϕ5 (y5 , x), ϕ1,2 (y1 , y2 ), . . . , ϕ4,5 (y4 , y5 ) Q U E S T Output InputCompromise: computing y ∗ is still efficient (Viterbi best path)Compromise: neighbor information allows correction of local errors.
  18. 18. Example: Sequence Prediction – Handwriting Recognition ϕ(x, y) = (ϕ1 (y1 , x), . . . , ϕ5 (y5 , x), ϕ1,2 (y1 , y2 ), . . . , ϕ4,5 (y4 , y5 )). w = (w1 , , . . . , w5 , w1,2 , . . . , w4,5 ) F (x, y) = w, ϕ(x, y) = F1 (x1 , y1 ) + · · · + F5 (x5 , y5 ) + F1,2 (y1 , y2 ) + . . . , F4,5 (y4 , y5 ). F1;2 F2;3 Q V E F2 F3 ... F1 Fi,i+1 Q U V E F1 F2 F3 Q 0.0 0.9 0.1 0.1 Q 1.8 Q 0.2 Q 0.3 U 0.1 0.1 0.4 0.6 U 0.2 U 1.1 U 0.2 V 0.0 0.1 0.2 0.5 V 0.1 V 1.2 V 0.1 E 0.3 0.5 0.5 1.0 E 0.4 E 0.3 E 1.8
  19. 19. Every y ∈ Y corresponds to a path. argmaxy∈Y F is best path.
  20. 20. Every y ∈ Y corresponds to a path. argmaxy∈Y F is best path.F (VUQ, x) = F1 (V, x1 ) + F12 (V, U) + F2 (U, x2 ) + F23 (U, Q) + F3 (Q, x3 ) = 0.1 + 0.1 + 1.1 + 0.1 + 0.3 = 1.7
  21. 21. Maximal per-letter scores.
  22. 22. Maximal per-letter scores. Total: 1.8 + 0.1 + 1.2 + 0.5 + 1.8 = 5.4
  23. 23. Best (=Viterbi) path. Total: 1.8 + 0.9 + 1.1 + 0.6 + 1.8 = 6.2
  24. 24. Example: Human Pose EstimationPictorial Structures for Articulated Pose Estimation: (1) Ftop X Ytop (2) F top,head ... ... Yhead ... ... ... Yrarm Ytorso Ylarm ... Yrhnd ... ... ... Ylhnd Yrleg Ylleg ... ... Yrfoot Ylfoot Many popular models have unary and pairwise terms.Certain higher order terms are also doable, see C. Rother’s lectures.[Felzenszwalb and Huttenlocher, 2000], [Fischler and Elschlager, 1973]
  25. 25. Thursday/Friday’s lecture: how to evaluate argmaxy F (x, y). chain tree loop-free graphs: Viterbi decoding / dynamic programming grid arbitrary graph loopy graphs: approximate inference (e.g. loopy BP) Today: how to learn a good function F (x, y) from training data.
  26. 26. Parameter Learning in Structured Models Given: parametric model (family): F (x, y) = w, ϕ(x, y) Given: prediction method: f (x) = argmaxy∈Y F (x, y) Not given: parameter vector w (high-dimensional)
  27. 27. Parameter Learning in Structured Models Given: parametric model (family): F (x, y) = w, ϕ(x, y) Given: prediction method: f (x) = argmaxy∈Y F (x, y) Not given: parameter vector w (high-dimensional) Supervised Training: Given: example pairs {(x 1 , y 1 ), . . . , (x n , y n )} ⊂ X × Y. typical inputs with "the right" outputs for them. { , , , , } Task: determine "good" w
  28. 28. Parameter Learning in Structured Models Given: parametric model (family): F (x, y) = w, ϕ(x, y) Given: prediction method: f (x) = argmaxy∈Y F (x, y) Not given: parameter vector w (high-dimensional) Supervised Training: Given: example pairs {(x 1 , y 1 ), . . . , (x n , y n )} ⊂ X × Y. typical inputs with "the right" outputs for them. Task: determine "good" w now–9:50 Probabilistic Training inspired by probabilistic discriminative classifiers: Logistic Regression → Conditional Random Fields 10:00–10:50 Maximum Margin Training inspired by maximum-margin discriminative classifiers: Support Vector Machines → Structured Output SVMs
  29. 29. Probabilistic Trainingof Structured Models(Conditional Random Fields)
  30. 30. Probabilistic Models for Structured Prediction Idea: Establish a one-to-one correspondence between compatibility function F (x, y) and conditional probability distribution p(y|x), such that argmax F (x, y) = argmax p(y|x) y∈Y y∈Y
  31. 31. Probabilistic Models for Structured Prediction Idea: Establish a one-to-one correspondence between compatibility function F (x, y) and conditional probability distribution p(y|x), such that argmax F (x, y) = argmax p(y|x) y∈Y y∈Y = maximum of the posterior probability
  32. 32. Probabilistic Models for Structured Prediction Idea: Establish a one-to-one correspondence between compatibility function F (x, y) and conditional probability distribution p(y|x), such that argmax F (x, y) = argmax p(y|x) y∈Y y∈Y = maximum of the posterior probability = Bayes optimal decision
  33. 33. Probabilistic Models for Structured Prediction Idea: Establish a one-to-one correspondence between compatibility function F (x, y) and conditional probability distribution p(y|x), such that argmax F (x, y) = argmax p(y|x) y∈Y y∈Y = maximum of the posterior probability = Bayes optimal decision Gibbs Distribution 1 F(x,y) p(y|x) = e with Z (x) = e F(x,y) Z (x) y∈Y
  34. 34. Probabilistic Models for Structured Prediction Idea: Establish a one-to-one correspondence between compatibility function F (x, y) and conditional probability distribution p(y|x), such that argmax F (x, y) = argmax p(y|x) y∈Y y∈Y = maximum of the posterior probability = Bayes optimal decision Gibbs Distribution 1 F(x,y) p(y|x) = e with Z (x) = e F(x,y) Z (x) y∈Y e F(x,y) argmax p(y|x) = argmax = argmax e F(x,y) = argmax F (x, y) y∈Y y∈Y Z (x) y∈Y y∈Y
  35. 35. Probabilistic Models for Structured Prediction For learning, we make the dependence on w explicit: F (x, y, w) = w, ϕ(x, y) Parametrized Gibbs Distribution – Loglinear Model 1 p(y|x, w) = e F(x,y,w) with Z (x, w) = e F(x,y,w) Z (x, w) y∈Y
  36. 36. Probabilistic Models for Structured Prediction Supervised training: Given: i.i.d. example pairs D = {(x 1 , y 1 ), . . . , (x n , y n )}. Task: determine w. Idea: Maximize the posterior probability p(w|D)
  37. 37. Probabilistic Models for Structured Prediction Supervised training: Given: i.i.d. example pairs D = {(x 1 , y 1 ), . . . , (x n , y n )}. Task: determine w. Idea: Maximize the posterior probability p(w|D) Use Bayes’ rule to infer posterior distribution for w: p(D|w)p(w) p((x 1 , y 1 ), . . . , (x n , y n )|w)p(w) p(w|D) = = p(D) p(D)
  38. 38. Probabilistic Models for Structured Prediction Supervised training: Given: i.i.d. example pairs D = {(x 1 , y 1 ), . . . , (x n , y n )}. Task: determine w. Idea: Maximize the posterior probability p(w|D) Use Bayes’ rule to infer posterior distribution for w: p(D|w)p(w) p((x 1 , y 1 ), . . . , (x n , y n )|w)p(w) p(w|D) = = p(D) p(D) n i i n i.i.d. p(x , y |w) p(y i |x i , w)p(x i |w) = p(w) = p(w) i i=1 p(x , y ) i i=1 p(y i |x i )p(x i )
  39. 39. Probabilistic Models for Structured Prediction Supervised training: Given: i.i.d. example pairs D = {(x 1 , y 1 ), . . . , (x n , y n )}. Task: determine w. Idea: Maximize the posterior probability p(w|D) Use Bayes’ rule to infer posterior distribution for w: p(D|w)p(w) p((x 1 , y 1 ), . . . , (x n , y n )|w)p(w) p(w|D) = = p(D) p(D) n i i n i.i.d. p(x , y |w) p(y i |x i , w)p(x i |w) = p(w) = p(w) i i=1 p(x , y ) i i=1 p(y i |x i )p(x i ) assuming x does not depend on w: p(x|w) ≡ p(x) n p(y i |x i , w) = p(w) i=1 p(y |x ) i i
  40. 40. Posterior probability: p(y i |x i , w) p(w|D) = p(w) i p(y i |x i ) Prior Data Term (loglinear)Prior p(w) expresses prior knowledge. We cannot infer it from data. 1 2 Gaussian p(w) := const. · e − 2σ2 w Gaussianity is often a good guess if you know nothing else easy to handle, analytically und numericallyLess common: 1 Laplacian p(w) := const. · e − σ w if we expect sparsity in w but: properties of the optimization problem change
  41. 41. Learning w from D ≡ Maximizing posterior probability for w w ∗ = argmax p(w|D) = argmin − log p(w|D) w∈Rd w∈Rd p(y i |x i , w)− log p(w|D) = − log p(w) i p(y i |x i ) n n = − log p(w) − log p(y i |x i , w) + log p(y i |x i ) i=1 i ,y i ,w) i=1 1 = Z e F(x indep. of w 2 n w = − F (x i , y i , w) − log Z (x i , w) + const. 2σ 2 i=1 n w 2 w,ϕ(x i ,y) = − w, ϕ(x i , y i ) − log e + c. 2σ 2 i=1 y∈Y =:L(w)
  42. 42. (Negative) (Regularized) Conditional Log-Likelihood (of D) n 1 2 w,ϕ(x i ,y) L(w) = w − w, ϕ(x i , y i ) − log e 2σ 2 i=1 y∈YProbabilistic parameter estimation or training means solving w ∗ = argmin L(w). w∈RdSame optimization problem as for multi-class logistic regression. n 1 2 wy ,ϕ(x i ) Llogreg (w) = w − wy i , ϕ(x i ) − log e 2σ 2 i=1 y∈Ywith w = (w1 , . . . , wK ) for K classes.
  43. 43. Negative Conditional Log-Likelihood for w ∈ R2 : negative log likelihood σ2 =0.01 negative log likelihood σ2 =0.10 3 3 128.0 2 2 00 .000 512.0 512 00 0 0 .00 .00 128 1 1 256 000 64. 128.000 32.000 .0 00 16 00 0 0 2.0 32.000 8.000 64.000 1024.000 4.0 00 1 1 00 16.0 2 2 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 negative log likelihood σ2 =1.00 negative log likelihood σ2 → ∞ 3 2.5 0 .00 2.0 128 2 1.5 00 0 .00 .0 00 1.0 32 16 64.0 1 0 0 4.00 0 0.5 8.0 00 31 2 25 50 32.0 0.0.00.1 0.2 0000 00 0.0 0000 00 0001 02 00 00 04 8 16 0.5 .0 2.0 0.0.0 0.0 00 0.0.0 0.0 4.0 8.0 0.0.0 0.0 0.0 16.0 0 6 00 000 1 0 0 0 0 0 64. 00 0.5 1.000 0.5 1 1.0 00 1.5 2.0 2 2.0 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5
  44. 44. negative log likelihood σ2 =0.01 negative log likelihood σ2 =0.10 negative log likelihood σ2 =1.00 negative log likelihood σ2 → ∞3 3 3 2.5 0 .00 2.0 128 128.002 2 2 1.5 0 000 512. 0 512. 0 .00 000 .00 00 0 1.0 0 32 .00 16 .00 64.0 1281 1 1 256 00 4.00 00 0 0.5 8.0 64.0 00 31 2 25 50 32.0 0.0.00.1 0.2 0000 00 128.000 32.000 0.0 0000 00 0001 02 00 00 04 8 16 0.5 .0 2.0 0 .00 0.0.0 0.0 00 0.0.0 0.0 4.0 8.0 0.0.0 0.0 16 0.0 16.0 0 6 00 0 00 10 0 0 0 0 0 0 0 64.0 2.0 32.000 8.000 0 0.5 0 64.000 1.000 1024.000 0.5 4.0 1.0 01 1 1 0 00 16.0 00 1.5 2.02 2 2 2.0 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 w ∗ = argmin L(w). w∈Rd
  45. 45. negative log likelihood σ2 =0.01 negative log likelihood σ2 =0.10 negative log likelihood σ2 =1.00 negative log likelihood σ2 → ∞3 3 3 2.5 0 .00 2.0 128 128.002 2 2 1.5 0 000 512. 0 512. 0 .00 000 .00 00 0 1.0 0 32 .00 16 .00 64.0 1281 1 1 256 00 4.00 00 0 0.5 8.0 64.0 00 31 2 25 50 32.0 0.0.00.1 0.2 0000 00 128.000 32.000 0.0 0000 00 0001 02 00 00 04 8 16 0.5 .0 2.0 0 .00 0.0.0 0.0 00 0.0.0 0.0 4.0 8.0 0.0.0 0.0 16 0.0 16.0 0 6 00 0 00 10 0 0 0 0 0 0 0 64.0 2.0 32.000 8.000 0 0.5 0 64.000 1.000 1024.000 0.5 4.0 1.0 01 1 1 0 00 16.0 00 1.5 2.02 2 2 2.0 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 w ∗ = argmin L(w). w∈Rd w ∈ Rd continuous (not discrete)
  46. 46. negative log likelihood σ2 =0.01 negative log likelihood σ2 =0.10 negative log likelihood σ2 =1.00 negative log likelihood σ2 → ∞3 3 3 2.5 0 .00 2.0 128 128.002 2 2 1.5 0 000 512. 0 512. 0 .00 000 .00 00 0 1.0 0 32 .00 16 .00 64.0 1281 1 1 256 00 4.00 00 0 0.5 8.0 64.0 00 31 2 25 50 32.0 0.0.00.1 0.2 0000 00 128.000 32.000 0.0 0000 00 0001 02 00 00 04 8 16 0.5 .0 2.0 0 .00 0.0.0 0.0 00 0.0.0 0.0 4.0 8.0 0.0.0 0.0 16 0.0 16.0 0 6 00 0 00 10 0 0 0 0 0 0 0 64.0 2.0 32.000 8.000 0 0.5 0 64.000 1.000 1024.000 0.5 4.0 1.0 01 1 1 0 00 16.0 00 1.5 2.02 2 2 2.0 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 w ∗ = argmin L(w). w∈Rd w ∈ Rd continuous (not discrete) w ∈ Rd high-dimensional (often d 1000)

×