Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Logistic Regression Demystified (Hopefully)

2,218 views

Published on

In this talk, I am going to discuss logistic regression, a topic that has been (and still is) quite heavily used as a solution to many supervised learning problems, in several different domains.
I will be focussing on three fundamental and general aspects related to any supervised learning problem: i) the model, ii) the error measure (or cost function), and iii) the learning algorithm. Then, I will cast each of those aspects into the specific case of logistic regression.

Published in: Science

Logistic Regression Demystified (Hopefully)

  1. 1. Logis&c Regression Demys&fied (Hopefully) Gabriele Tolomei Yahoo Labs, London, UK 19th November 2015
  2. 2. Introduc&on •  3 components need to be defined: – Model: describes the set of hypotheses (hypothesis space) that can be represented; – Error Measure (Cost Func2on): measures the price that must be paid if a misclassifica&on error occurs – Learning Algorithm: is responsible of picking the best hypothesis (according to the error measure) by searching through the hypothesis space
  3. 3. The Model
  4. 4. Linear Signal •  Logis&c Regression is an example of linear model •  Given a d+1-dimensional input x xT = (x0, x1, …, xd), x0 = 1 •  We define the family of real-valued func&ons F having d+1 parameters θ θT = (θ0, θ1, …, θd) •  Each func&on fθ in F outputs a real scalar obtained as a linear combina&on of the input x with the parameters θ fθ(x) means “the applica&on of f parametrized by θ to x” and it is referred to as signal
  5. 5. Hypothesis Space •  The signal alone is not enough to define the hypothesis space H •  Usually the signal is passed through a “filter”, i.e. another real-valued func&on g •  hθ(x) = g(fθ(x)) defines the hypothesis space: The set of possible hypotheses H changes depending on the parametric model (fθ) and on the thresholding func&on (g)
  6. 6. 1 -1 Thresholding x fθ 1 0 g = sign g = iden6ty g = logis6c fθ(x) fθ(x) fθ(x)
  7. 7. The Logis&c Func&on •  Domain is R, Codomain is [0,1] •  Also known as sigmoid func&on do to its “S” shape or sob threshold (compared to hard threshold imposed by sign) •  When z = θTx we are applying a non-linear transforma&on to our linear signal •  Output can be genuinely interpreted as a probability value
  8. 8. Probabilis&c Interpreta&on •  Describing the set of hypotheses using the logis&c func&on is not enough to state that the output can be interpreted as a probability –  All we know is that the logis&c func&on always produce a real value between 0 and 1 –  Other func&ons may be defined having the same property •  e.g., 1/π arctan(x) + 1/2 •  The key points here are: –  the output of the logis&c func&on can be interpreted as a probability even during learning –  the logis&c func&on is mathema&cally convenient!
  9. 9. Probabilis&c Interpreta&on: Odds Ra&o •  Let p (resp., q = 1-p) be the probability of success (resp., failure) of an event •  odds(success) = p/q = p/(1-p) •  odds(failure) = q/p = 1/p/q = 1/odds(success) •  logit(p) = ln(odds(success)) = ln(p/q) = ln(p/1-p) •  Logis&c Regression is in fact an ordinary linear regression where the logit is the response variable! •  The coefficients of logis&c regression are expressed in terms of the natural logarithm of odds
  10. 10. Probabilis&c Interpreta&on: Odds Ra&o
  11. 11. Probabilis&c-generated Data As for any other supervised learning problem we can only deal with a finite set D of m labelled examples which we can try to learn from where each yi is a binary variable taking on two values {-1,+1} That means we do not have access to the individual probability associated with each training sample! S&ll we can assume that data we observe from D, i.e. posi2ve (+1) and nega2ve (-1) samples are actually generated by an underlying and unknown probability func6on (noisy target) which we want to es&mate
  12. 12. Es&ma&ng the Noisy Target More formally, given the generic training example (x,y) we claim there exists a condi&onal probability P(y|x), which is defined as: where each φ is the noisy target func&on •  Determinis&c func&on: given x as input it always outputs either y = +1 or y = -1 (mutually exclusive) •  Noisy target func&on: given x as input it always outputs both y = +1 and y = -1, each with a “degree of certainty” associated Goal: If we assume φ: Rd+1 à [0,1] is the underlying and unknown noisy target which generates our examples, our aim is to find an es&mate φ* which best approximates φ
  13. 13. Hypothesized Noisy Target We claim that the best es&mate φ* of φ is h* θ(x) which in turn is picked from the set of hypotheses defined by logis&c func&on But how do we select h* θ(x)? 2 elements are needed: -  Training set D -  Error Measure (Cost Func&on) to minimize
  14. 14. The Error Measure
  15. 15. The Best Hypothesis If the hypothesis space H is made of a family of parametric models, h* θ(x) can be picked as: That is, we want to maximise the probability of the chosen hypothesis given the data D we observed
  16. 16. Flipping the Coin: (Data) Likelihood We measure the error we are making by assuming that h* θ(x) approximates the true noisy target φ How likely is that the observed data D have been generated by our selected hypothesis h* θ(x)? Find the hypothesis which maximises the probability of the observed data D given a par&cular hypothesis
  17. 17. The Likelihood Func&on Given a generic training example (x,y) and assuming it has been generated by a hypothesis hθ(x) the likelihood func&on is: where φ has been replaced with our hypothesis If we assume the hypothesis is the logis&c func&on And by no&cing that logis&c func&on is symmetric, i.e. l(-z) = 1-l(z), the likelihood for a single example is:
  18. 18. The Likelihood Func&on Having access to a full set of m i.i.d. training examples D The overall likelihood func&on is computed as:
  19. 19. Why Does Likelihood Make Sense? How does the likelihood l(yiθTxi) changes w.r.t. the sign of yi and θTxi? If the label is concordant with the signal (either posi&vely or nega&vely) then l(yiθTxi) approaches to 1 Our predic&on agrees with the true label θTxi > 0 θTxi < 0 yi > 0 ≈ 1 ≈ 0 yi < 0 ≈ 0 ≈ 1 Conversely, if the label is discordant with the signal then l(yiθTxi) approaches to 0 Our predic&on disagrees with the true label
  20. 20. Maximum Likelihood Es&mate Find the vector of parameters θ such that the likelihood func&on is maximum
  21. 21. From MLE to In-Sample Error Generally speaking, given a hypothesis hθ and a training set D of m labelled samples we are interested in measuring the “in-sample” (i.e. training) error where e() measures how “far” the chosen hypothesis is from the true observed value How we can “transform” MLE to an expression similar to the “in-sample” error above?
  22. 22. From MLE to In-Sample Error
  23. 23. From MLE to In-Sample Error By no&cing that logis&c func&on can be rewriten as follows: We can finally write the “in-sample” error to be minimised: Cross-Entropy Error
  24. 24. The Learning Algorithm
  25. 25. Picking the Best Hypothesis So far we have defined: -  The model -  The error measure (cross-entropy) To actually select the best hypothesis, we have to pick the vector of parameters so that the error measure is minimised The usual way of achieving this is to compute the gradient with respect to θ (i.e. the vector of par&al deriva&ves), set it to 0, and solve it for θ
  26. 26. Mean Squared Error vs. Cross-Entropy In the case of linear regression we have a similar expression for the error measure, i.e. Mean Squared Error (MSE) Minimising MSE through Ordinary Least Squares (OLS) leads to a closed-form solu2on oben referred to as the OLS es&mator for θ The problem is that using Cross-Entropy as error measure we cannot find a closed-form solu&on to the minimiza&on problem Itera2ve Solu2on
  27. 27. (Batch) Gradient Descent General itera&ve method for any nonlinear op&miza&on Under specific assump&ons on the func&on to be minimised and on the learning rate parameter at each itera&on, the method guarantees the convergence to a local minimum global minimum If the func&on is convex like the cross-entropy error for logis&c regression then the local minimum is also the global minimum
  28. 28. Gradient Descent: The Idea 1.  At t=0 ini&alize the (guessed) vector of parameters θ to θ(0) 2.  Repeat un&l convergence: a.  Update the current vector of parameters θ(t) by taking a “step” along the “steepest” slope: θ(t+1) = θ(t) + ηv b.  Return to 2. step Unit vector represen&ng the direc&on of the steepest slope Ques2on: How do we compute the direc&on v? Depending on how we solve it we may get different solu&ons (Gradient Descent, Conjugate Gradient, etc.)
  29. 29. Gradient Descent: The Direc&on v We already intui&vely said that the direc&on v should be that of the “steepest” slope Concretely, this means moving along the direc&on which mostly reduces the in-sample error func&on We want ΔEinto be as nega&ve as possible, which means that we are actually reducing the error w.r.t. the previous itera&on t-1
  30. 30. Gradient Descent: The Direc&on v Let’s first assume we are in the univariate case, i.e. θ = θ in R
  31. 31. Gradient Descent: The Direc&on v First-order Taylor approxima&on Second-order error term To summarize and generalize to the mul&variate case of θ: The greek leter nabla indicates the gradient
  32. 32. Gradient Descent: The Direc&on v The unit vector v only contributes to the direc&on and not to the magnitude of the itera&ve step Therefore: -  the maximum (i.e. most posi2ve) step happens when both the error vector and the direc&on vector have the same direc&on -  the minimum (i.e. most nega2ve) step happens when the two vectors have opposite direc&on
  33. 33. Gradient Descent: The Direc&on v At each itera&on t, we want the unit vector which makes exactly the most nega&ve step Therefore:
  34. 34. Gradient Descent: The Step η How the step magnitude η affects the convergence? η too small η too large η variable Rule of thumb Dynamically change η propor&onally to the gradient!
  35. 35. Gradient Descent: The Step η Remember that at each itera&on the update strategy is: where: At each itera&on t, the step η is fixed
  36. 36. Gradient Descent: The Step ηt Instead of having a fixed η at each itera&on, use a variable ηt as func&on of η If we take
  37. 37. Gradient Descent: The Algorithm 1.  At t=0 ini&alize the (guessed) vector of parameters θ to θ(0) 2.  For t = 0, 1, 2, … un&l stop: a.  Compute the gradient of the cross-entropy error (i.e. the vector of par&al deriva&ves) b.  Update the vector of parameters: θ(t+1) = θ(t) - η Ein(θ(t)) c.  Return to 2. 3.  Return the final vector of parameters θ(∞)
  38. 38. Discussion: Ini&aliza&on •  How do we choose the ini&al value of the parameters θ(0)? •  If the func&on is convex we are guaranteed to reach the global minimum no mater what is the ini&al value of θ(0) •  In general we may get to the local minimum nearest to θ(0) –  Problem: we may miss “beter” local minima (or even the global if it exists) –  Solu&on (heuris&c): repea&ng GD 100÷1,000 &mes each &me with a different θ(0) may give a sense of what is eventually the global minimum (no guarantees)
  39. 39. Discussion: Termina&on •  When does the algorithm stop? •  Intui&vely, when θ(t+1) = θ(t) è - η Ein(θ(t)) = 0 è Ein(θ(t)) = 0 •  If the func&on is convex we are guaranteed to reach the global minimum when Ein(θ(t)) = 0 –  i.e. there exists a unique local minimum which also happens to be the global minimum •  In general we don’t know if eventually Ein(θ(t)) = 0 therefore we can use several criteria of termina&on, e.g.,: –  stop whenever the difference between two itera&ons is “small enough” à may converge “prematurely” –  stop when the error equals to ε à may not converge if the target error is not achievable –  stop aber T itera&ons –  combina&ons of the above in prac&ce works…
  40. 40. Advanced Topics •  Gradient Descent using second-order approxima&on –  beter local approxima&on than first-order but each step requires compu&ng the second deriva&ve (Hessian matrix) –  Conjugate Gradient makes second-approxima&on “faster” as it doesn’t require to compute explicitly the full Hessian matrix •  Stochas&c Gradient Descent (SGD) –  At each step only one sample is considered for compu&ng the gradient of the error instead of the full training set •  L1 and L2 regulariza&on to penalize extreme parameter values and deal with overfi~ng –  include the L1 or L2 norm of the vector of parameters θ in the cross- entropy error func&on to be minimised during learning

×