Upcoming SlideShare
×

# Logistic Regression(SGD)

• 1,094 views

More in: Technology , Education
• Comment goes here.
Are you sure you want to
Be the first to comment

Total Views
1,094
On Slideshare
0
From Embeds
0
Number of Embeds
0

Shares
31
0
Likes
1

No embeds

### Report content

No notes for slide

### Transcript

• 1. Lazy Sparse Stochastic Gradient Descent for Regularized Mutlinomial Logistic Regression Bob Carpenter Alias-i, Inc. carp@alias-i.com Abstract Stochastic gradient descent eﬃciently estimates maximum likelihood logistic regression coeﬃcientsfrom sparse input data. Regularization with respect to a prior coeﬃcient distribution destroys thesparsity of the gradient evaluated at a single example. Sparsity is restored by lazily shrinking a coeﬃcientalong the cumulative gradient of the prior just before the coeﬃcient is needed.1 Multinomial Logistic ModelA multinomial logistic model classiﬁes d-dimensional real-valued input vectors x ∈ Rd into oneof k outcomes c ∈ {0, . . . , k − 1} using k − 1 parameter vectors β0 , . . . , βk−2 ∈ Rd :  exp(βc · x) if c < k − 1  Zx  p(c | x, β) = (1)   1 if c = k − 1 Zxwhere the linear predictor is inner product: βc · x = βc,i · xi (2) i<dThe normalizing factor in the denominator is the partition function: Zx = 1 + exp(βc · x) (3) c<k−12 Corpus Log LikelihoodGiven a sequence of n data points D = xj , cj j<n , with xj ∈ Rd and cj ∈ {0, . . . , k − 1}, thelog likelihood of the data in a model with parameter matrix β is: log p(D | β) = log p(cj | xj , β) = log p(cj | xj , β) (4) j<n j<n3 Maximum Likelihood Estimate ˆThe maximum likelihood (ML) estimate β is the value of β maximizing the likelihood of thedata D: ˆ β = arg max p(D | β) = arg max log p(D | β) (5) β β 1
• 2. 4 Maximum a Posteriori EstimateGiven a prior probability density p(β) over a parameter matrix β, the maximum a posteriori ˆ(MAP) estimate β is: ˆ β = arg max p(β | D) (6) β p(D | β) p(β) = arg max β p(D) = arg max p(D | β) p(β) β4.1 Gaussian PriorA Gaussian prior for parameter vectors with zero means and diagonal covariance is speciﬁed 2by a variance parameter σi for each dimension i < d. The prior density for the full parametermatrix is: 2 p(β) = Norm(0, σi )(βc,i ) (7) c<k−1 i<d 2where the normal density function with mean 0 and variance σi is: 2 2 1 βc,i Norm(0, σi )(βc,i ) = √ exp − 2 (8) σi 2π 2σi4.2 Laplace PriorA Laplace, or double-exponential, prior with zero means and diagonal covariance is speciﬁed 2by a variance parameter σi for each dimension i < d. The prior density for the full parametermatrix is: 2 p(β) = Laplace(0, σi )(βc,i ) (9) c<k−1 i<d 2where the Laplace density function with mean 0 and variance σi is: √ 2 2 √ | βc,i | Laplace(0, σi )(βc,i ) = exp − 2 (10) 2σi σi The Laplace prior has thinner tails than the Gaussian, and thus concentrates posteriorscloser to its zero mean than a Gaussian of the same variance.4.3 Cauchy PriorA Cauchy, or Lorentz, distribution is a Student-t distribution with one degree of freedom. ACauchy prior centered at zero with scale λ > 0 has density: 1 1 λ Cauchy(0, λ)(βc,i ) = = 2 (11) βc,i 2 π βc,i + λ2 πλ 1 + λThe Cauchy prior does not have a mean or variance; the integrals diverge. The distributionis symmetric, and centered around its median, zero. Larger scale values, like larger variances,stretch the distribution out to have fatter tails. The Cauchy prior has fatter tails than theGaussian prior and thus has less of a tendency to concentrate posterior probability near theprior median.4.4 Uniform PriorThe uniform, or uninformative, prior has a constant density: p(β) ∝ 1 (12) 2
• 3. The uniform prior is improper in the sense that the integral of the density diverges: 1 dx = ∞ (13) Rd The maximum likelihood estimate is the MAP estimate under an (improper) uninformativeuniform prior: ˆ β = arg max p(D | β) p(β) (14) β = arg max p(D | β) β 2 An equivalent formulation is as a Gaussian or Laplace prior with inﬁnite variance. As σi → 2 2∞, the distributions Norm(0, σi ) and Laplace(0, σi ) ﬂatten toward uniform and the gradientwith respect to parameters βc,i (see section 6) approaches zero.5 Error FunctionThe error function ErrR is indexed by the type of prior R, which must be one of maximumlikelihood (ML), Laplace (L1), Gaussian (L2) or Cauchy (t1). The error is the sum of thelikelihood error Err and the prior error ErrR for prior type R: ErrR (β, D, σ 2 ) = Err (D, β) + ErrR (β, σ 2 ) (15) = − log p(D | β) − log p(β | σ 2 ) 2 = − log p(cj | xj , β) + − log p(βi | σi ) j<n i<d 2 = Err (cj , xj , β) + ErrR (βi , σi ) j<n i<dThe last line introduces notation for the contribution to the error due to the likelihood of asingle case xj , cj and a single parameter βi .6 Error GradientThe gradient of the error function is the collection of partial derivatives of the error functionwith respect to the parameters βc,i : ∂ c,i ErrR (D, β, σ 2 ) = ErrR (D, β, σ 2 ) (16) ∂βc,i ∂ = Err (D, β) + ErrR (β, σ 2 ) (17) ∂βc,i ∂ = − log p(D | β) − log p(β | σ 2 ) ∂βc,i ∂ ∂ = − log p(D | β) − log p(β | σ 2 ) ∂βc,i ∂βc,iThe derivatives distribute through the cases for the data log likelihood: ∂ ∂ log p(D | β) = log p(cj | xj , β) (18) ∂βc,i j<n ∂βc,iwith a single case xj , cj contributing the following to the gradient of the error for outputcategory c at input dimension i: ∂ log p(cj | xj , β) ∂βc,i 3
• 4. ∂ exp(βcj · xj ) = log ∂βc,i 1+ c <k−1 exp(βc · xj ) ∂ ∂ = log exp(βcj · xj ) − log(1 + exp(βc · xj )) ∂βc,i ∂βc,i c <k−1 ∂ 1 ∂ = (βcj · xj ) − (1 + exp(βc · xj )) ∂βc,i 1+ c <k−1 exp(βc · xj ) ∂βc,i c <k−1 1 ∂ = xj,i I(c = cj ) − exp(βc · xj ) 1+ c <k−1 exp(βc · xj ) c <k−1 ∂βc,i 1 ∂ = xj,i I(c = cj ) − exp(βc · xj ) (βc · xj ) 1+ c <k−1 exp(βc · xj ) ∂βc,i c <k−1 exp(βc,i · xj ) = xj,i I(c = cj ) − xj,i 1+ c <k−1 exp(βc · xj ) = xj,i I(c = cj ) − p(c | xj , β)xj,i = xj,i (I(c = cj ) − p(c | xj , β))where the indicator function is: 1 if cj = c I(cj = c) = (19) 0 if cj = cThe residual for training example j for outcome category c is the diﬀerence between the outcomeand the model prediction: Rc,j = I(cj = c) − p(c | xj , β) (20) The derivatives also distribute through parameters for the prior: ∂ ∂ log p(β | σ 2 ) = log p(βc ,i 2 | σi ) (21) ∂βc,i ∂βc,i c <k−1 i <d ∂ 2 = log p(βc,i | σi ) ∂βc,i For the Gaussian prior: 2 ∂ 2 ∂ 1 βc,i log p(βc,i | σi ) = log √ exp(− 2 ) (22) ∂βc,i ∂βc,i σi 2π 2σi 2 ∂ 1 βc,i = log √ + log exp(− 2 ) ∂βc,i σi 2π 2σi 2 ∂ βc,i = − 2 ∂βc,i 2σi βc,i = − 2 σi For the Laplace prior: √ ∂ 2 ∂ 2 √ | βc,i | log p(βc,i | σi ) = log exp(− 2 ) (23) ∂βc,i ∂βc,i 2σi σi ∂ √ | βc,i | = log exp(− 2 ) ∂βc,i σi ∂ √ | βc,i | = − 2 ∂βc,i σi 4
• 5.  √  − 2 if βc,i > 0  σi = √ 2   σi if βc,i < 0 For the Cauchy prior: ∂ ∂ 1 λi log p(βc,i | λi ) = log (24) ∂βc,i ∂βc,i π βc,i + λ2 2 i ∂ = − log π + log λi − log(βc,i + λ2 ) 2 i ∂βc,i ∂ = − log(βc,i + λ2 ) 2 i ∂βc,i 1 ∂ = − (βc,i + λ2 ) 2 i 2 βc,i + λ2 ∂βc,i i 2βc,i = − 2 βc,i + λ2 i7 Error HessianThe Hessian of the error function is the symmetric matrix of second derivatives with respect topairs of parameters. In the multinomial setting with k − 1 parameter vectors of dimensionalityd, the Hessian matrix itself is of dimensionality ((k − 1)d) × ((k − 1)d) with rows and columnsindexed by pairs of categories and dimensions (c, i), with c ∈ {0, . . . , k − 2} and i < d. The value of the Hessian is the sum of the value of the Hessian of the likelihood and theHessian of the prior. The Hessian for the likliehood function at row (c, i) and column (b, h) isderived by diﬀerentiating with respect to parameters βc,i and βb,h . 2 ∂2 (c,i),(b,h) Err (D, β) = Err (D, β) (25) ∂βc,i ∂βb,h ∂ = c,i Err (D, β) ∂βb,h ∂ = c,i Err (xj , cj , β) j<n ∂βb,h ∂ = xj,i (I(c = cj ) − p(c | xj , β)) j<n ∂βb,hwhere: ∂ xj,i (I(c = cj ) − p(c | xj , β)) ∂βb,h ∂ = − xj,i p(c | xj , β) ∂βb,h ∂ exp(βc · xj ) = −xj,i ∂βb,h 1 + c <k−1exp(βc · xj ) 1 ∂ = −xj,i exp(βc · xj ) 1+ c <k−1 exp(βc · xj ) ∂βb,h ∂ 1 + exp(βc · xj ) ∂βb,h 1 + c <k−1 exp(βc · xj )  exp(βc · xj ) ∂ = −xj,i  βc · xj 1+ c <k−1 exp(βc · xj ) ∂βb,h 5
• 6.   2 1 ∂  + exp(βc · xj ) 1+ exp(βc · xj ) 1+ c <k−1 exp(βc · xj ) ∂βb,h c <k−1 = −xj,i p(c | xj , β)I(c = b)xj,h 1 ∂ + p(c | xj , β) exp(βb · xj ) 1+ c <k−1 exp(βc · xj ) ∂βb,h exp(βb · xj ) ∂ = −xj,i p(c | xj , β) I(c = b)xj,h + βb · xj 1+ c <k−1 exp(βc · xj ) ∂βb,h = −xj,i p(c | xj , β) (I(c = b)xj,h + p(b | xj , β)xj,h ) = −xj,i xj,h p(c | xj , β) (I(c = b) + p(b | xj , β)) The priors contribute their own terms to the Hessian, but the only non-zero contributionsare on the diagonals for Cauchy and Gaussian priors. The maximum likelihood prior is constant,and thus has a zero gradient and zero Hessian. For the Laplace prior: 2 ∂ (c,i),(b,h) ErrL1 (β | σ2 ) = ErrL1 (β | σ 2 ) (26) ∂βb,h ∂βc,i ∂ signum(βc,i ) = − 2 ∂βb,h σi = 0 For the Gaussian prior: 2 ∂ (c,i),(b,h) ErrL2 (β | σ2 ) = ErrL2 (β | σ 2 ) (27) ∂βb,h ∂βc,i ∂ βc,i = − 2 ∂βb,h σi 1 = −I((c, i) = (b, h)) 2 σi For the Cauchy prior: 2 (c,i),(b,h) Errt1 (β | λ2 ) (28) ∂ = Errt1 (β | λ2 ) ∂βb,h ∂βc,i ∂ 2βc,i = − 2 ∂βb,h βc,i + λ2 i ∂ 2βc,i = −I((c, i) = (b, h)) ∂βc,i βc,i + λ2 2 i ∂ 1 1 ∂ = −I((c, i) = (b, h)) 2βc,i + 2 2βc,i ∂c,i βc,i + λ2 2 i βc,i + λ2 ∂c,i i 2 1 ∂ 2βc,i = −I((c, i) = (b, h)) 2βc,i 2 (βc,i + λ2 ) + 2 2 i βc,i + λ2 i ∂c,i βc,i + λ2 i 2 2βc,i 2βc,i = −I((c, i) = (b, h)) 2 + βc,i + λ2i βc,i + λ2 2 i8 Concavity and Existence of SolutionFor any of the priors, uninformative, Cauchy, Gaussian or Laplace, the error function is aconcave function of the feature vectors, so that for all vectors β, β ∈ Rd and all λ such that 6
• 7. 0 ≤ λ ≤ 1: ErrR (D, (λβ + (1 − λ)β ), σ 2 ) ≥ λErrR (D, β, σ 2 ) + (1 − λ)ErrR (D, β , σ 2 ) (29) To show that a function is concave, it suﬃces to show that its negative Hessian is posi-tive semi-deﬁnite; this generalizes the negative second derivative condition required for a one-dimensional function to be concave. A square matrix M is positive semi-deﬁnite if and only iffor every (column) vector x: x Mx ≥ 0 (30)which evaluating the product, amounts to: xi xj Mi,j ≥ 0 (31) i jFor the maximum likelihood error, this condition amounts to requiring for all vectors z ofdimensionality c(k − 1): 2 −z Err (D, β) z ≥ 0 (32)which is equivalent to: 2 − zc,i zb,h (c,i),(b,h) Err (D, β) ≥ 0 (33) (c,i) (b,h)which plugging in the Hessian’s value, yields: zc,i zb,h xj,i xj,h p(c | xj , β) (I(c = b) + p(b | xj , β)) ≥ 0 (34) (c,i) (b,h) j<n For the Cauchy and Gaussian priors with ﬁnite positive scale or variance, the gradient of theerror function is strictly concave, which strengthens this requirement for any pair of unequalvectors β, β such that β = β and any λ such that 0 < λ < 1 to: ErrR (D, (λβ + (1 − λ)β ), σ 2 ) > λErrR (D, β, σ 2 ) + (1 − λ)ErrR (D, β , σ 2 ) (35) To show that a function is strictly concave, it suﬃces to show that its negative Hessian ispositive deﬁnite. A square matrix M is positive deﬁnite if and only if for every non-zero vectorx: x Mx>0 (36)With a Cauchy or Gaussian prior, the Hessian is positive deﬁnite, and hence the combined errorfunction is positive deﬁnite. ˆ With a concave error function, there exists a MAP parameter estimate β exists where thegradient evaluated at βˆ is zero: ˆ ∂ ˆ ErrR (D, β, σ 2 )(β) = ErrR (D, β, σ 2 )(β) = 0 (37) ∂βIn the maximum likelihood setting or with uninformative priors for some dimensions in othercases, there may not be a unique solution. With strictly concave error functions, as arises withﬁnite priors, there will be a unique MAP estimate.9 MAP Estimate Expectation Constraints ˆIn maximum likelihood estimation, the estimated parameters β are the zero of the gradient atevery category c < k − 1 and dimension i < d: ˆ ∂ ˆ ( c,i Err (D, β)) (βc,i ) = Err (D, β) (βc,i ) (38) ∂βc,i = xj,i ˆ I(cj = c) − p(c | xj , β) j<n = 0 7