Upcoming SlideShare
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Standard text messaging rates apply

# 03 conditional random field

861

Published on

Published in: Technology
3 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
861
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
29
0
Likes
3
Embeds 0
No embeds

No notes for slide

### Transcript

• 1. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39
• 2. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Problem (Probabilistic Learning) Let d(y|x) be the (unknown) true conditional distribution. Let D = {(x1 , y 1 ), . . . , (xN , y N )} be i.i.d. samples from d(x, y). Find a distribution p(y|x) that we can use as a proxy for d(y|x). or Given a parametrized family of distributions, p(y|x, w), ﬁnd the parameter w∗ making p(y|x, w) closest to d(y|x). Open questions: What do we mean by closest? What’s a good candidate for p(y|x, w)? How to actually ﬁnd w∗ ? conceptually, and numerically 2 / 39
• 3. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Principle of Parsimony (Parsimoney, aka Occam’s razor) “Pluralitas non est ponenda sine neccesitate.” William of Ockham “We are to admit no more causes of natural things than such as are both true and suﬃcient to explain their appearances.” Isaac Newton “Make everything as simple as possible, but not simpler.” Albert Einstein “Use the simplest explanation that covers all the facts.” what we’ll use 3 / 39
• 4. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields 1) Deﬁne what aspects we consider relevant facts about the data. 2) Pick the simplest distribution reﬂecting that. Deﬁnition (Simplicity ≡ Entropy) The simplicity of a distribution p is given by its entropy: H(p) = − p(z) log p(z) z∈Z Deﬁnition (Relevant Facts ≡ Feature Functions) By φi : Z → R for i = 1, . . . , D we denote a set of feature functions that express everything we want to be able to model about our data. For example: the grayvalue of a pixel, a bag-of-words histogram of an image, the time of day an image was taken, a ﬂag if a pixel is darker than half of its neighbors. 4 / 39
• 5. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Principle (Maximum Entropy Principle) Let z 1 , . . . , z N be samples from a distribution d(z). Let φ1 , . . . , φD be 1 feature functions, and denote by µi := N n φi (z n ) their means over the sample set. The maximum entropy distribution, p, is the solution to max H(p) subject to Ez∼p(z) {φi (z)} = µi . p is a prob.distr. be faithful to what we know be as simple as possible Theorem (Exponential Family Distribution) Under some very reasonable conditions, the maximum entropy distribution has the form 1 p(z) = exp wi φi (z) Z i for some parameter vector w = (w1 , . . . , wD ) and constant Z. 5 / 39
• 6. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Example: Let Z = R, φ1 (z) = z, φ2 (z) = z 2 . The exponential family distribution is 1 p(z) = exp( w1 z + w2 z 2 ) Z(w) b2 a 2 w1 = exp( a z − b ) for a = w2 , b = − . Z(a, b) w2 It’s a Gaussian! Given examples z 1 , . . . , z N , we can compute a and b, and derive w. 6 / 39
• 7. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Example: Let Z = {1, . . . , K}, φk (z) = z = k , for k = 1, . . . , K. The exponential family distribution is 1 p(z) = exp( wk φk (z) ) Z(w) k  exp(w1 )/Z for z = 1,    exp(w )/Z for z = 2, 2 = . . .    exp(w )/Z for z = K. K with Z = exp(w1 ) + · · · + exp(wK ). It’s a Multinomial! 7 / 39
• 8. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Example: Let Z = {0, 1}N ×M image grid, φi (y) := yi for each pixel i, φN M (y) = i∼j yi = yj (summing over all 4-neighbor pairs) The exponential family distribution is 1 p(z) = exp( w, φ(y) + w ˜ yi = yj ) Z(w) i,j It’s a (binary) Markov Random Field! 8 / 39
• 9. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random FieldsConditional Random Field Learning Assume: a set of i.i.d. samples D = {(xn , y n )}n=1,...,N , (xn , y n ) ∼ d(x, y) feature functions ( φ1 (x, y), . . . , φD (x, y) ) ≡: φ(x, y) 1 parametrized family p(y|x, w) = Z(x,w) exp( w, φ(x, y) ) Task: adjust w of p(y|x, w) based on D. Many possible technique to do so: Expectation Matching Maximum Likelihood Best Approximation MAP estimation of w Punchline: they all turn out to be (almost) the same! 9 / 39
• 10. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Maximum Likelihood Parameter Estimation Idea: maximize conditional likelihood of observing outputs y 1 , . . . , y N for inputs x1 , . . . , xN w∗ = argmax p(y 1 , . . . , y N |x1 , . . . , xN , w) w∈RD N i.i.d. = argmax p(y n |xn , w) w∈RD n=1 N − log(·) = argmin − log p(y n |xn , w) w∈RD n=1 negative conditional log-likelihood (of D) 10 / 39
• 11. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Best Approximation Idea: ﬁnd p(y|x, w) that is closest to d(y|x) Deﬁnition (Similarity between conditional distributions) For ﬁxed x ∈ X : KL-divergence measure similarity d(y|x) KLcond (p||d)(x) := d(y|x) log p(y|x, w) y∈Y For x ∼ d(x), compute expectation: KLtot (p||d) : = Ex∼d(x) KLcond (p||d)(x) d(y|x) = d(x, y) log p(y|x, w) x∈X y∈Y 11 / 39
• 12. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Best Approximation Idea: ﬁnd p(y|x, w) of minimal KLtot -distance to d(y|x) d(y|x) w∗ = argmin d(x, y) log w∈RD x∈X y∈Y p(y|x, w) drop const. = argmin − d(x, y) log p(y|x, w) w∈RD (x,y)∈X ×Y N (xn ,y n )∼d(x,y) ≈ argmin − log p(y n |xn , w) w∈RD n=1 negative conditional log-likelihood (of D) 12 / 39
• 13. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields MAP Estimation of w Idea: Treat w as random variable; maximize posterior probability p(w|D) N p(x1 , y 1 , . . . , xn , y n |w)p(w) i.i.d. Bayes p(y n |xn , w) p(w|D) = = p(w) p(D) p(y n |xn ) n=1 p(w): prior belief on w (cannot be estimated from data). w∗ = argmax p(w|D) = argmin − log p(w|D) w∈RD w∈RD N = argmin − log p(w) − log p(y n |xn , w) + log p(y n |xn ) w∈RD n=1 indep. of w N = argmin − log p(w) − log p(y n |xn , w) w∈RD n=1 13 / 39
• 14. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields N w∗ = argmin − log p(w) − log p(y n |xn , w) w∈RD n=1 Choices for p(w): p(w) :≡ const. (uniform; in RD not really a distribution) N w∗ = argmin − log p(y n |xn , w) + const. w∈RD n=1 negative conditional log-likelihood 1 2 p(w) := const. · e− 2σ2 w (Gaussian) N 1 w∗ = argmin − w 2 + log p(y n |xn , w) + const. w∈RD 2σ 2 n=1 regularized negative conditional log-likelihood 14 / 39
• 15. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Probabilistic Models for Structured Prediction - Summary Negative (Regularized) Conditional Log-Likelihood (of D) N 1 2 w,φ(xn ,y) L(w) = 2 w − w, φ(xn , y n ) − log e 2σ n=1 y∈Y (σ 2 → ∞ makes it unregularized) Probabilistic parameter estimation or training means solving w∗ = argmin L(w). w∈RD Same optimization problem as for multi-class logistic regression. 15 / 39
• 16. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Negative Conditional Log-Likelihood (Toy Example) negative log likelihood σ2 =0.01 negative log likelihood σ2 =0.10 3 3 128.0 2 2 00 .000 512.0 512 00 0 0 .00 .00 128 1 1 256 00 64.0 128.000 32.000 .0 00 16 0 0 0 0 2.0 32.000 8.000 64.000 1024.000 4.0 00 1 1 00 16.0 2 2 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 negative log likelihood σ2 =1.00 negative log likelihood σ2 → ∞ 3 2.5 0 .00 2.0 128 2 1.5 00 00 .0 00 .0 1.0 32 16 64.0 1 0 0 4.00 0 0.5 8.0 00 31 2 25 50 32.0 0.0.00.1 0.2 0000 00 0.0 0000 00 0001 02 00 00 04 8 16 0.5 .0 2.0 0.0.0 0.0 00 0.0.0 0.0 4.0 8.0 0.0.0 0.0 0.0 16.0 0 6 00 00 1 0 0 0 0 0 64.0 00 0.5 1.000 0.5 1 1.0 00 1.5 2.0 2 2.0 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 16 / 39
• 17. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Steepest Descent Minimization – minimize L(w) input tolerance > 0 1: wcur ← 0 2: repeat 3: v ← w L(wcur ) 4: η ← argminη∈R L(wcur − ηv) 5: wcur ← wcur − ηv 6: until v < output wcur Alternatives: L-BFGS (second-order descent without explicit Hessian) Conjugate Gradient We always need (at least) the gradient of L. 17 / 39
• 18. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields N 1 2 w,φ(xn ,y) L(w) = w − w, φ(xn , y n ) + log e 2σ 2 n=1 y∈Y N e w,φ(xn ,y) φ(xn , y) 1 n n y∈Y w L(w) = 2 w − φ(x , y ) − w,φ(xn ,¯) y σ y ∈Y ¯ e n=1 N 1 = w− φ(xn , y n ) − p(y|xn , w)φ(xn , y) σ2 n=1 y∈Y N 1 = w− φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y) σ2 n=1 N 1 ∆L(w) = IdD×D + [Ey∼p(y|xn ,w) φ(xn , y)][Ey∼p(y|xn ,w) φ(xn , y)] σ2 n=1 18 / 39
• 19. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields N 1 2 w,φ(xn ,y) L(w) = w − w, φ(xn , y n ) + log e 2σ 2 n=1 y∈Y C ∞ -diﬀerentiable on all RD . 19 / 39
• 20. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields N 1 w L(w) = w− φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y) σ2 n=1 For σ → ∞: Ey∼p(y|xn ,w) φ(xn , y) = φ(xn , y n ) ⇒ w L(w) = 0, criticial point of L (local minimum/maximum/saddle point). Interpretation: We aim for expectation matching: Ey∼p φ(x, y) = φ(x, y obs ) but discriminatively: only for x ∈ {x1 , . . . , xn }. 20 / 39
• 21. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields N 1 ∆L(w) = IdD×D + [Ey∼p(y|xn ,w) φ(xn , y)][Ey∼p(y|xn ,w) φ(xn , y)] σ2 n=1 positive deﬁnite Hessian matrix → L(w) is convex → w L(w) = 0 implies global minimum. 21 / 39
• 22. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Milestone I: Probabilistic Training (Conditional Random Fields) p(y|x, w) log-linear in w ∈ RD . Training: many probabilistic derivations lead to same optimization problem → minimize negative conditional log-likelihood, L(w) L(w) is diﬀerentiable and convex, → gradient descent will ﬁnd global optimum with w L(w) =0 Same structure as multi-class logistic regression. For logistic regression: this is where the textbook ends. we’re done. For conditional random ﬁelds: we’re not in safe waters, yet! 22 / 39
• 23. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Task: Compute v = w L(wcur ), evaluate L(wcur + ηv): N 1 2 w,φ(xn ,y) L(w) = w − w, φ(xn , y n ) + log e 2σ 2 n=1 y∈Y N 1 w L(w) = w− φ(xn , y n ) − p(y|xn , w)φ(xn , y) σ2 n=1 y∈Y Problem: Y typically is very (exponentially) large: binary image segmentation: |Y| = 2640×480 ≈ 1092475 ranking N images: |Y| = N !, e.g. N = 1000: |Y| ≈ 102568 . We must use the structure in Y, or we’re lost. 23 / 39
• 24. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Solving the Training Optimization Problem Numerically N 1 w L(w) = w− φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y) σ2 n=1 Computing the Gradient (naive): O(K M N D) N 1 2 L(w) = 2 w − w, φ(xn , y n ) + log Z(xn , w) 2σ n=1 Line Search (naive): O(K M N D) per evaluation of L N : number of samples D: dimension of feature space M : number of output nodes ≈ 100s to 1,000,000s K: number of possible labels of each output nodes ≈ 2 to 100s 24 / 39
• 25. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Probabilistic Inference to the Rescue Remember: in a graphical model with factors F, the features decompose: φ(x, y) = φF (x, yF ) F ∈F Ey∼p(y|x,w) φ(x, y) = Ey∼p(y|x,w) φF (x, yF ) F ∈F = EyF ∼p(yF |x,w) φF (x, yF ) F ∈F EyF ∼p(yF |x,w) φF (x, yF ) = p(yF |x, w) φF (x, yF ) yF ∈YF factor marginals K |F | terms Factor marginals µF = p(yF |x, w) are much smaller than complete joint distribution p(y|x, w), can be computed/approximated, e.g., with (loopy) belief propagation. 25 / 39
• 26. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Solving the Training Optimization Problem Numerically N 1 w L(w) = w− φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y) σ2 n=1 Computing the Gradient: X O(M K |Fmax | N D): O(K MX X XXX nd), N 1 2 w,φ(xn ,y) L(w) = w − w, φ(xn , y n ) + log e 2σ 2 n=1 y∈Y Line Search: XX O(M K |Fmax | N D) per evaluation of L O(K MX X X X nd), N : number of samples ≈ 10s to 1,000,000s D: dimension of feature space M : number of output nodes K: number of possible labels of each output nodes 26 / 39
• 27. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields What, if the training set D is too large (e.g. millions of examples)? Simplify Model Train simpler model (smaller factors) Less expressive power ⇒ results might get worse rather than better Subsampling Create random subset D ⊂ D. Train model using D Ignores all information in D D Parallelize Train multiple models in parallel. Merge the models. Follows the multi-core/GPU trend How to actually merge? (or ?) Doesn’t really save computation. 27 / 39
• 28. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Keep maximizing p(w|D). In each gradient descent step: Create random subset D ⊂ D, ← often just 1–3 elements! Follow approximate gradient ˜w L(w) = 1 w − φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y) σ2 (xn ,y n )∈D Line search no longer possible. Extra parameter: stepsize η SGD converges to argminw L(w)! (if η chosen right) SGD needs more iterations, but each one is much faster more: see L. Bottou, O. Bousquet: ”The Tradeoﬀs of Large Scale Learning”, NIPS 2008. also: http://leon.bottou.org/research/largescale 28 / 39
• 29. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Solving the Training Optimization Problem Numerically N 1 w L(w) = w− φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y) σ2 n=1 Computing the Gradient: X O(M K 2 N D) (if BP is possible): O(K MX X XXX nd), N 1 2 w,φ(xn ,y) L(w) = w − w, φ(xn , y n ) + log e 2σ 2 n=1 y∈Y Line Search: XX O(M K 2 N D) per evaluation of L O(K MX X X X nd), N : number of samples D: dimension of feature space: ≈ φi,j 1–10s, φi : 100s to 10000s M : number of output nodes K: number of possible labels of each output nodes 29 / 39
• 30. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Typical feature functions in image segmentation: φi (yi , x) ∈ R≈1000 : local image features, e.g. bag-of-words → wi , φi (yi , x) : local classiﬁer (like logistic-regression) φi,j (yi , yj ) = yi = yj ∈ R1 : test for same label → wij , φij (yi , yj ) : penalizer for label changes (if wij 0) combined: argmaxy p(y|x) is smoothed version of local cues original local classiﬁcation local + smoothness 30 / 39
• 31. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Typical feature functions in handwriting recognition: φi (yi , x) ∈ R≈1000 : image representation (pixels, gradients) → wi , φi (yi , x) : local classiﬁer if xi is letter yi φi,j (yi , yj ) = eyi ⊗ eyj ∈ R26·26 : letter/letter indicator → wij , φij (yi , yj ) : encourage/suppress letter combinations combined: argmaxy p(y|x) is ”corrected” version of local cues local classiﬁcation local + ”correction” 31 / 39
• 32. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Typical feature functions in pose estimation: φi (yi , x) ∈ R≈1000 : local image representation, e.g. HoG → wi , φi (yi , x) : local conﬁdence map φi,j (yi , yj ) = good ﬁt(yi , yj ) ∈ R1 : test for geometric ﬁt → wij , φij (yi , yj ) : penalizer for unrealistic poses together: argmaxy p(y|x) is sanitized version of local cues original local classiﬁcation local + geometry [V. Ferrari, M. Marin-Jimenez, A. Zisserman: ”Progressive Search Space Reduction for Human Pose Estimation”, CVPR 2008.] 32 / 39
• 33. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Typical feature functions for CRFs in Computer Vision: φi (yi , x): local representation, high-dimensional → wi , φi (yi , x) : local classiﬁer φi,j (yi , yj ): prior knowledge, low-dimensional → wij , φij (yi , yj ) : penalize outliers learning adjusts parameters: unary wi : learn local classiﬁers and their importance binary wij : learn importance of smoothing/penalization argmaxy p(y|x) is cleaned up version of local prediction 33 / 39
• 34. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: local classiﬁers, their importance. Two-Stage Training pre-train fiy (x) = log p(yi |x) ˆ use φ˜i (yi , x) := f y (x) ∈ RK (low-dimensional) i keep φij (yi , yj ) are before ˜ perform CRF learning with φi and φij Advantage: lower dimensional feature space during inference → faster fiy (x) can be stronger classiﬁers, e.g. non-linear SVMs Disadvantage: if local classiﬁers are bad, CRF training cannot ﬁx that. 34 / 39
• 35. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Solving the Training Optimization Problem Numerically CRF training methods is based on gradient-descent optimization. The faster we can do it, the better (more realistic) models we can use: N ˜w L(w) = w − φ(xn , y n ) − p(y|xn , w) φ(xn , y) ∈ RD σ2 n=1 y∈Y A lot of research on accelerating CRF training: problem ”solution” method(s) |Y| too large exploit structure (loopy) belief propagation smart sampling contrastive divergence use approximate L e.g. pseudo-likelihood N too large mini-batches stochastic gradient descent D too large trained φunary two-stage training 35 / 39
• 36. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Summary – CRF Learning Given: i.i.d. training set {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y, (xn , y n ) ∼ d(x, y) feature function φ : X × RD . 1 Task: ﬁnd parameter vector w such that Z exp( w, φ(x, y) ) ≈ d(y|x). CRF solution derived by minimizing negative conditional log-likelihood: N ∗ 1 2 w,φ(xn ,y) w = argmin w − w, φ(xn , y n ) − log e w 2σ 2 n=1 y∈Y convex optimization problem → gradient descent works training needs repeated runs of probabilistic inference 36 / 39
• 37. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Extra I: Beyond Fully Supervised Learning So far, training was fully supervised, all variables were observed. In real life, some variables can be unobserved even during training. missing labels in training data latent variables, e.g. part location latent variables, e.g. part occlusion latent variables, e.g. viewpoint 37 / 39
• 38. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Graphical model: three types of variables x ∈ X always observed, y ∈ Y observed only in training, z ∈ Z never observed (latent). Marginalization over Latent Variables Construct conditional likelihood as usual: 1 p(y, z|x, w) = exp( w, φ(x, y, z) ) Z(x, w) Derive p(y|x, w) by marginalizing over z: 1 p(y|x, w) = p(y, z|x, w) Z(x, w) z∈Z 38 / 39
• 39. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Negative regularized conditional log-likelihood: N 2 L(w) = λ w − log p(y n |xn , w) n=1 N 2 =λ w − log p(y n , z|xn , w) n=1 z∈Z N 2 =λ w − log exp( w, φ(xn , y n , z) ) n=1 z∈Z N + log exp( w, φ(xn , y, z) ) n=1 z∈Z y∈Y L is not convex in w → can have local minima no agreed on best way for treating latent variables 39 / 39