Logistic Regression

5,748 views

Published on

introduce logistic regression, inference with maximize likelihood with gradient descent, compare L1 and L2 regularization, generalized linear model

Published in: Technology, Education
0 Comments
20 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,748
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
231
Comments
0
Likes
20
Embeds 0
No embeds

No notes for slide
  • Unsupervised learning(聚类,降维(topic model)): learn structure from unlabeled data. Closely related with density estimation; summarize the dataSemi-supervised learning: use both labeled and unlabeled samples for training; It’s cost to collect lots of labels sometimes, use both
  • 除此之外,你对模型的熟悉程度。
  • Expected risk: 定义好loss function,选择一个预估函数,有一个输入变量和response value的联合分布,在该联合分布上对损失函数求积分,即为期望风险;通过最小化该期望风险,我们找到一个最优的预估函数。但是实际上,我们并不知道该联合分布,我们有的是从该联合分布中有偏或无偏采样得到的有限样本,可能还有一些noise点。我们转为最小化在该有限样本上的最小化loss function寻找预估函数。 即我们转为最小化经验风险另一方面,我们往往给目标函数指定function family,该function family极有可能没有包含最优或者较优的那些目标函数。误差的大小: 第一部分:函数family F中的预估函数有多接近真正最优的预估函数;第二部分:我们选择优化经验而不是经验风险
  • Logistic regression is one of the most popular classifier.Advantage: 1. easy understand and implement; 2. not bad performance; 3. light weight and less time taken for training and prediction;(can handle large dataset) 4. easy parallelizationValue to attendances:Know about what is logistic regression, what’s the advantages and disadvantage. what kind of problems are suitable apply to.L1 and L2 regularizationHow to inference through maximize likelihood with gradient descent. And know how to implement it
  • 对于generalized linear model,如果response variable是binomial或者multinomial分布,且选择了logit function作为link function 就是logistic regressionLogistic function 是logit function的反函数
  • Binary(binomial) logistic regression
  • 负梯度方向是使得函数值下降最快的方向
  • 在重新计算likelihood前,我们看一下这2种正则化背后的理论基础
  • 假设全部的weight服从一致的分布
  • Laplace 分布一阶倒数不连续假设全部的weight服从一致的分布(均值为0,Laplace参数也一样)W_k在一次更新中不能变换正负号
  • L1拟合得到的weight通常较稀疏,带来2点好处: 帮我们做特征选择,工程上更有利
  • 增加了decay ratio:AUC稍有提高(0.845 -> 0.847)在不同step时,适合的decay ratio也不一样Iteration times: 与样本量的大小有关
  • 例子:今天是高考第一天,高考选专业,每个人有多个候选,但是仅能选择一门专业(计算机,金融,化学,数学,物理,生物)和binomial分布应用的区别多类问题,可以转化为多个两类问题,如果我们的问题是“找出每门课成绩前10%的学生”,我们可以用两类logistic regression来做如果问题是“对于每个学生找出其成绩最好的课,或者最好的几门课”,两类问题就不是很适合 (每一类上的预估概率之和不等于1,无法比较不同类上的概率)Multi-nominal适用于response value为category的情况,不太适合ordinal的情况。我实现了。
  • Link function: (1) generalized linear model的重要组成部分:将linear regression拓展到generalized linear model;(2)link function的反函数的自变量介于(-无穷,+无穷),若y服从binominal分布,应变量介于【0,1】区间The inverse of any continuous cumulative distribution function (CDF) can be used for the link since the CDF’s range is [0,1]
  • Generalized linear model 广义上的线性模型,都有一个基本的线性单元W*X(linear regression),通过各种link function建立该线性单元和各种分布的response variable的关系。包含linear regression (normal distribution),logistic regression (binominal/multi-nominal distribution), Poisson regression (Poisson distribution)对于binominal/multi-nominal distribution,我们也可以选择除logit link function之外的link function (广义的logistic regression)
  • Logistic Regression

    1. 1. Machine  learning  workshop   guodong@hulu.com   Machine  learning  introduc7on   Logis&c  regression   Feature  selec7on   Boos7ng,  tree  boos7ng     See  more  machine  learning  post:  h>p://dongguo.me    
    2. 2. Overview  of  machine  learning       Machine  Learning   Unsupervised   Learning   Supervised   Learning   Classifica7on   Logis7c   regression   Semi-­‐supervised   Learning   Regression  
    3. 3. How  to  choose  a  suitable  model?   Characteris&c   Naïve   Bayes   Trees   K  Nearest   neighbor   Logis&c   regression   Neural   SVM   Networks   Computa7onal   scalability   3   3   1   3   1   1   Interpretability    2   2     1    2   1   1   Predic7ve  power   1   1    3   2   3   3   Natural  handling   data  of  “mixed”  type   1   3   1   1   1   1   Robustness  to   outliers  in  input   space    3   3   3   3     1   1   <Elements  of  Sta-s-cal  Learning>  II  P351      
    4. 4. Why  model  can’t  perform  perfectly  on  unseen  data   •  Expected  risk   •  Empirical  risk   •  Choose  func7on  family  for  predic7on  func7ons     •  Error  
    5. 5. Logis7c  regression  
    6. 6. Outline   •  •  •  •  •  Introduc7on   Inference   Regulariza7on   Experiments   More   –  Mul7-­‐nominal  LR   –  Generalized  linear  model   •  Applica7on  
    7. 7. Logit  func7on  and  logis7c  func7on   •  Logit  func7on     •  logis7c  func7on:  Inversed  logit  
    8. 8. Logis7c  regression   •  Predic7on  func7on  
    9. 9. Inference  with  maximize  likelihood  (1)   •  Likelihood   •  Inference  
    10. 10. Inference  with  maximize  likelihood  (2)   •  Inference  cont.   •  Use  gradient  descent     •  Stochas7c  gradient  descent  
    11. 11. Regulariza7on   •  Penalize  large  weight  to  avoid  overfi`ng     –  L2  regulariza7on     –  L1  regulariza7on  
    12. 12. Regulariza7on:  Maximum  a  posteriori   •  MAP  
    13. 13. L2  regulariza7on  :  Gaussian  Prior     •  Gaussian  prior     •  MAP     •  Gradient  descent  step  
    14. 14. L1  regulariza7on  :  Laplace  Prior     •  Laplace  prior   •  MAP     •  Gradient  descent  step  
    15. 15. Implementa7on   •  L2  LR   _weightOfFeatures[fea] += step * (feaValue * error - reguParam * _weightOfFeatures[fea]);   •  L1  LR   if (_weightOfFeatures[fea] > 0) { _weightOfFeatures[fea] += step * (feaValue * error) - step * reguParam; if (_weightOfFeatures[fea] < 0) _weightOfFeatures[fea] = 0; }else if (_weightOfFeatures[fea] < 0) { _weightOfFeatures[fea] += step * (feaValue * error) + step * reguParam; if (_weightOfFeatures[fea] > 0) _weightOfFeatures[fea] = 0; }else{ _weightOfFeatures[fea] += step * (feaValue * error); }
    16. 16. L2  VS.  L1   •  L2  regulariza7on   –  Almost  all  weights  are  not  equal  to  zero   –  Not  suitable  when  training  samples  are  scarce   •  L1  regulariza7on   –  Produces  sparse  parameter  vectors   –  More  suitable  when  most  features  are  irrelevant   –  Could  handle  scarce  training  samples  be>er  
    17. 17. Experiments   •  Dataset   –  Goal:  gender  predic7on   –  Dataset:  train  samples  (431k),  test  samples  (167k)   •  Comparison  algorithms   –  A:  gradient  descent  with  L1  regulariza7on   –  B:  gradient  descent  with  L2  regulariza7on   –  C:  OWL-­‐QN  (L-­‐BFGS  based  op7miza7on  with  L1  regulariza7on)   •  Parameters  choice   –  –  –  –  Regulariza7on  value   Step(learning  speed)   Decay  ra7o   Itera7on  over  condi7on   •  Max  itera7on  7mes(50)  ||    AUC  change  <=0.0005  
    18. 18. Experiments  (cont.)   •  Experiments  results   Parameters  and   metrics   gradient  descent  with   gradient  descent  with   L1   L2   OWL-­‐QN   ‘best’  regulariza7on   0.001~0.005   term   0.0002~0.001   1   Best  step   0.05   0.02~0.05   -­‐   Best  decay  ra7o   0.85   0.85   -­‐   Itera7on  7mes   26   20~26   48   Not  zero  feature  /   all  feature   10492/10938   10938/10938   6629/10938   AUC   0.8470   0.8463   0.8467  
    19. 19. Mul7-­‐nominal  logis7c  regression   •  Predic7on  func7on     •  Inference  with  maximize  likelihood   •  Gradient  descent  step  (L2)  
    20. 20. More  Link  func7ons   •  Inference  with  maximize  likelihood     •  Link  func7on   •  Link  func7ons  for  binomial  distribu7on   –  Logit  func7on   –  Probit  func7on   –  Log-­‐log  func7on  
    21. 21. Generalized  linear  model   •  What  is  GLM   –  Generaliza7on  of  linear  regression   –  Connect  linear  model  with  response  variable  by  link  func7on   –  More  distribu7on  for  response  variable   •  Typical  GLM   •  Overview       –  Linear  regression  ,  Logis7c  regression,  Poisson  regression  
    22. 22. Applica7on   •  Yahoo   –  <Personalized  Click  Predic7on  in  Sponsored  Search>  WSDM’10   •  Microsoq   –  <Scalable  Training  of  L1-­‐Regularized  Log-­‐Linear  Models>  ICML’07   •  Baidu   –  Contextual  ads  CTR  predic7on   •  h>p://www.docin.com/p-­‐376254439.html   •  Hulu   –  –  –  –  Demographic  targe7ng   Other  ad-­‐targe7ng  project   Custom  churn  predic7on   More…  
    23. 23. Reference   •  ‘Scalable  Training  of  L1-­‐Regularized  Log-­‐Linear   Models’  ICML’07   –  h>p://www.docin.com/p-­‐376254439.html#   •  ‘Genera-ve  and  discrimina-ve  classifiers:  Naïve   Bayes  and  logis-c  regression’  by  Mitchell   •  ‘Feature  selec-on,  L1  vs.  L2  regulariza-on,  and   rota-onal  invariance’  ICML’04  
    24. 24. Recommended  resources   •  Machine  Learning  open  class  –  by  Andrew  Ng   –  //10.20.0.130/TempShare/Machine-­‐Learning  Open  Class   •  h>p://www.cnblogs.com/vivounicorn/archive/ 2012/02/24/2365328.html   •  logis7c  regression  Implementa7on[link]   –  //10.20.0.130/TempShare/guodong/Logis7c  regression  Implementa7on/   –  Support  binomial  and  mul7nominal  LR  with  L1  and  L2  regulariza7on   •  OWL-­‐QN   –  //10.20.0.130/TempShare/guodong/OWL-­‐QN/  
    25. 25. Thanks  

    ×