ICML2012読み会 Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems

3,848 views

Published on

2012-07-28 ICML2012読み会 Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems の発表資料

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,848
On SlideShare
0
From Embeds
0
Number of Embeds
1,979
Actions
Shares
0
Downloads
14
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

ICML2012読み会 Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems

  1. 1. ICML2012読み会:Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems 2012-07-28 Yoshihiko Suhara @sleepy_yoshi
  2. 2. 読む論文• Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems – by C. Scherrer, M. Halappanavar, A. Tewari, D. Haglin• Coordinate Descent の並列計算 – [Bradley+ 11] Parallel Coordinate Descent for L1- Regularized Loss Minimization (ICML2011) とか 2
  3. 3. 概要• 共有メモリマルチコア環境におけるParallel Coordinate Descentの一般化フレームワークを紹 介• 以下の2つの手法を提案 – Thread-Greedy Coordinate Descent – Coloring-Based Coordinate Descent• Parallel CDの4手法を実験で比較 – Thread-Greedy が思いのほかよかった 3
  4. 4. L1正則化損失関数の最適化• L1正則化損失関数 𝑛 1 min ℓ 𝒚 𝑖 , 𝑿𝒘 𝑖 + 𝜆 𝒘 1 𝒘 𝑛 𝑖=1• ここで – 𝑿 ∈ ℝ 𝑛×𝑘 : 計画行列 – 𝒘 ∈ ℝ 𝑘 : 重みベクトル – ℓ(𝑦,⋅): 微分可能な凸関数• たとえば – Lasso (L1 + 二乗誤差) – L1正則化ロジスティック回帰 4
  5. 5. 記法𝑿 = 𝑿1 , 𝑿2 , … , 𝑿 𝑗 , … 𝑿 𝑘 𝒆 𝑗 = 0, 0, … , 1, … , 0 𝑇 𝒙1𝑇 𝒙2𝑇 𝑿= ⋮ 𝒙 𝑖𝑇 ⋮ 𝒙𝑇 𝑛 5
  6. 6. 補足: Coordinate Descent• 座標降下法とも呼ばれる (?)• 選択された次元に対して直線探索• いろんな次元の選び方 – 例) Cyclic Coordinate Descent• 並列計算する場合には全次元の部分集合を選択して更新 6
  7. 7. GenCD: A Generic Framework for Parallel Coordinate Descent なぜかここから英語 7
  8. 8. Generic Coordinate Descent (GenCD) 8
  9. 9. Step 1: Select• Selecting 𝐽 coordinates• The selection criteria differs for variations of CD techniques – cyclic CD (CCD) – stochastic CD (SCD) • selection of a singlton – fully greedy CD • 𝐽 = {1, … , 𝑘} – Shotgun [Bradley+ 11] • selects a random subset of a given size 9
  10. 10. Step 2: Propose• Propose step computes a proposed increment 𝛿 𝑗 for each 𝑗 ∈ 𝐽. – this step does not actually change the weights• In Step 2, we maintain a vector 𝝋 ∈ ℝ 𝑘 , where 𝝋 𝑗 is a proxy for the objective function evaluated at 𝒘 + 𝜹 𝑗 𝒆 𝑗 – update 𝝋 𝑗 whenever a new proposal is calculated for j – 𝝋 is not necessary if the algorithm will accepts all proposals 10
  11. 11. Step 3: Accept• In Accept step, the algorithm accepts 𝐽′ ⊆ 𝐽 –  [Bradley+ 11] show correlations among features can lead to divergence if too many coordinates are updated at once (see below figure)• In CCD, SCD, Shotgun, the algorithm allows all proposals to be accepted – No need to calculate 𝝋 11
  12. 12. Step 4: Update• In Update step, the algorithm updates according to the set 𝐽′ 𝑿𝒘 を保持 12
  13. 13. Approximate Minimization (1/2)• Propose step calculates a proposed increment 𝜹 𝑗 for each 𝑗 ∈ 𝐽 𝛿 = argmin 𝛿 𝐹 𝒘 + 𝛿𝒆 𝑗 + 𝜆|𝒘 𝑗 + 𝛿| 1 𝑛 where, 𝐹 𝒘 = 𝑖=1 ℓ 𝒚 𝑖 , 𝑿𝒘 𝑖 𝑛•  For a general loss function, there is no closed-form solution along a given coordinate. – Thus, consider approximate minimization 13
  14. 14. Approximate Minimization (2/2)• Well known minimizer (e.g., [Yuan and Lin 10]) 𝛻𝑗 𝐹 𝒘 − 𝜆 𝛻𝑗 𝐹 𝒘 + 𝜆 𝛿 = −𝜓 𝒘𝑗; , 𝛽 𝛽 𝑎 if 𝑥 < 𝑎 where, 𝜓 𝑥; 𝑎, 𝑏 = 𝑏 if 𝑥 > 𝑏 𝑥 otherwise for squared loss 𝛽 = 1, logistic loss 𝛽 = 1/4. 14
  15. 15. Step 2: Propose (Approximated) ′ 𝑖ℓ 𝒚 𝑖 ,𝒛 𝑖 ,𝑿 𝑗 ? 𝑛 Decrease in the approximated objective 15
  16. 16. Experiments 16
  17. 17. Algorithms (conventional)• SHOTGUN [Bradley+ 11] – Select step: random subset of the columns – Accept step: accepts every proposal • No need to compute a proxy for the objective – convergence is guaranteed only if the # of coordinates selected is at most 𝑃 ∗ = 𝑘 (*1) 2𝜌• GREEDY – Select step: all coordinates – Propose step: each thread generating proposals for some subset of the coordinates using approximation – Accept step: Only accepts the single best among the all threads. (*1) 𝜌 is the matrix eigenvalue of 𝑿 𝑇 𝑿 17
  18. 18. Comparisons of the Algorithms 18
  19. 19. Algorithms (proposed)• THREAD-GREEDY – Select step: random set of coordinates (?) – Propose step: each thread generating proposals for some subset of the coordinates using approximation – Accept step: Each thread accepts the best of the proposals – no proof for convergence (however, empirical results are encouraging)• COLORING – Preprocessing: structurally independent features are identified via partial distance-2 coloring – Select step: a random color is selected – Accept step: accepts every proposal • since the features are disjoint. 19
  20. 20. Implementation and Platform• Implementation – gcc with OpenMP • -O3 -fopenmp flags • parallel for pragma • static scheduling – Given n iterations and p threads, each thread gets n/p iterations• Platform – AMD Opteron (Magny-Cours) • with 48 cores (12 cores x 4 sockets) – 256GB Memory 20
  21. 21. Datasets(Number of Non-Zero) 21
  22. 22. Convergence ratesナゼカワカラナイ 22
  23. 23. Scalability 23
  24. 24. Summary• Presented GenCD, a generic framework for expressing parallel coordinate descent – Select, Propose, Accept, Upadte• Performs convergence and scalability tests for the four algorithms – but the authors do not favor any of these algorithms over the others• The condition for convergence of the THREAD- GREEDY algorithm is an open question 24
  25. 25. References• [Yuan and Lin 10] G. Yuan, C. Lin, “A Comparison of Opitmization Methods and Software for Large-scale L1-regularized Linear Classification”, Journal of Machine Learning Research, vol.11, pp.3183-3234, 2010.• [Bradley+ 11] J. K. Bradley, A. Kyrola, D. Bickson, C. Guestrin, “Parallel Coordinate Descent for L1-Regularized Loss Minimization”, In Proc. ICML ‘11, 2011. 25
  26. 26. おわり 26

×