Your SlideShare is downloading. ×
0
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

ICML2013読み会 Large-Scale Learning with Less RAM via Randomization

9,094

Published on

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
9,094
On Slideshare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
27
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Large-Scale Learning with Less RAM via Randomization [Golovin+ ICML13] ICML読み会 2013/07/09 Hidekazu Oiwa (@kisa12012)
  • 2. 自己紹介 •大岩 秀和 •東大D2@中川研 •専門:機械学習や自然言語処理 •得意技:大規模データ最適化/スパース化 •Twitter: @kisa12012 2
  • 3. 読む論文 •Large-Scale Learning with Less RAM via Randomization (ICML13) •D. Golovin, C. Sculley, H. B. McMahan, M. Young (Google) •http://arxiv.org/abs/1303.4664 (論文) •NIPS2012のWorkshopが初出 •図/表は,論文より引用 3
  • 4. 一枚概要 • 重みベクトルの省メモリ化 • GPU/L1-cacheに載せるためビット数を自動調節 • SGDベースのアルゴリズムを提案 • 精度をほとんど落とさず,メモリ使用量の削減を実現 • 学習時:50%, 予測時:95% • Regretによる理論保証もあり 4 float (32bits) Ex. (5bits) = (1.0, . . . , 1.52) Q2,2 ˆ = (1.0, . . . , 1.50)
  • 5. 導入
  • 6. 背景:ビッグデータ!! • メモリ容量が重要な制約に • 全データがメモリに載らない • GPU/L1Cacheでデータ処理可能か? • 学習時のみでなく予測時にも重要 • 検索広告やメールフィルタのレイテンシに影響 • 重みベクトルのメモリ量を削減したい 6
  • 7. float型は実用上オーバースペック 7 Large-Scale Learning with Less R Figure 1. Histogram of coe cients in a typical large-scale linear model trained from real data. Values are tightly grouped near zero; a large dynamic range is superfluous. Contributions This paper gives the following theo- (Va pap lear Pe (20 stra (i.e dic rat ing tivi 32bitもの精度で 値を保持する必要があるのか? 線形分類器の重みベクトル値のヒストグラム図 特徴種類数
  • 8. • 固定長にすれば良い? • 最適な重みベクトルが固定bit長で表せない場 合,永遠に収束しない • アイデア • 学習時はステップ幅に応じて表現方法を変える bit数を削減 : どのように? 8 ⇤
  • 9. アルゴリズム
  • 10. bit長の記法の定義 • :固定bit長表現の記法 • :仮数部 • :指数部 • では(n+m+1)bits使用 •1bitは符号 • :表現可能な点の間のgap 10 Qn.m Qn.m ✏ ✏ n n m m 1.5 ⇥ 2 1
  • 11. アルゴリズム 11 Large-Scale Learning with Less RAM v Algorithm 1 OGD-Rand-1d input: feasible set F = [ R, R], learning rate schedule ⌘t, resolution schedule ✏t define fun Project ( ) = max( R, min( , R)) Initialize ˆ1 = 0 for t=1, . . . , T do Play the point ˆt, observe gt t+1 = Project ˆt ⌘tgt ˆt+1 RandomRound( t+1, ✏t) function RandomRound( , ✏) a ✏ ⌅ ✏ ⇧ ; b ✏ ⌃ ✏ ⌥ return ( b with prob. ( a)/✏ a otherwise grid of re ing rate for each implying appropri assume t +R are a dimensio each coor cretizatio is a regre be found Theorem adaptive discretiza 普通のSGD Random Roundで bit数を に落とすQn.m
  • 12. Random Round図示 12 a : b b a + b a a + b % ˆ ˆ 1 : 4 80% 20% ˆ ˆ Example
  • 13. gap の決め方 •SGDのステップ幅 と任意の定数 より, • を満たすようにbit長を設定 •この時,Regretの上限が で求まる 13 ⌘t > 0 ✏t  ⌘t with Less RAM via Randomization edule enta- n (by ll to y be head ning work ding grid of resolution ✏t on round t, and an adaptive learn- ing rate ⌘t. We then run one copy of this algorithm for each coordinate of the original convex problem, implying that we can choose the ⌘t and ✏t schedules appropriately for each coordinate. For simplicity, we assume the ✏t resolutions are chosen so that R and +R are always gridpoints. Algorithm 1 gives the one- dimensional version, which is run independently on each coordinate (with a di↵erent learning rate and dis- cretization schedule) in Algorithm 2. The core result is a regret bound for Algorithm 1 (omitted proofs can be found in the Appendix): Theorem 3.1. Consider running Algorithm 1 with adaptive non-increasing learning-rate schedule ⌘t, and discretization schedule ✏t such that ✏t  ⌘t for a con- stant > 0. Then, against any sequence of gradi- ents g1, . . . , gT (possibly selected by an adaptive ad- versary) with |gt|  G, against any comparator point ⇤ 2 [ R, R], we have E[Regret( ⇤ )]  (2R)2 2⌘T + 1 2 (G2 + 2 )⌘1:T + R p T. By choosing su ciently small, we obtain an expected regret bound that is indistinguishable from the non- rounded version (which is obtained by taking = 0). Thm. 3.1. where Large-Scale Learning with Less RAM via Randomization orithm 1 OGD-Rand-1d put: feasible set F = [ R, R], learning rate schedule resolution schedule ✏t fine fun Project ( ) = max( R, min( , R)) tialize ˆ1 = 0 r t=1, . . . , T do Play the point ˆt, observe gt t+1 = Project ˆt ⌘tgt ˆt+1 RandomRound( t+1, ✏t) nction RandomRound( , ✏) a ✏ ⌅ ✏ ⇧ ; b ✏ ⌃ ✏ ⌥ return ( b with prob. ( a)/✏ a otherwise onverting back to a floating point representa- requires a single integer-float multiplication (by 2 m ). Randomized rounding requires a call to eudo-random number generator, which may be in 18-20 flops. Overall, the added CPU overhead gligible, especially as many large-scale learning hods are I/O bound reading from disk or network grid of resolution ✏t on round t, and an ing rate ⌘t. We then run one copy o for each coordinate of the original c implying that we can choose the ⌘t a appropriately for each coordinate. Fo assume the ✏t resolutions are chosen +R are always gridpoints. Algorithm dimensional version, which is run in each coordinate (with a di↵erent learn cretization schedule) in Algorithm 2. is a regret bound for Algorithm 1 (om be found in the Appendix): Theorem 3.1. Consider running A adaptive non-increasing learning-rate discretization schedule ✏t such that ✏t stant > 0. Then, against any seq ents g1, . . . , gT (possibly selected by versary) with |gt|  G, against any c ⇤ 2 [ R, R], we have E[Regret( ⇤ )]  (2R)2 2⌘T + 1 2 (G2 + 2 ⌘t ✏t ✏t O( p T) ! 0 でfloat型のRegret上限
  • 14. Per-coordinate learning rates (a.k.a. AdaGrad) [Duchi+ COLT10] • 特徴毎にステップ幅を変化 • 頻繁に出現する特徴は,ステップ幅をどんどん下げる • 稀にしか出現しない特徴は,ステップ幅をあまり下げ ない • 非常に高速に最適解を求めるための手法 • 各特徴の出現回数を保持しなければならない • 32bits int型 • Morris algorithm [Morris+ 78] で近似 -> 8bits 14
  • 15. Morris Algorithm •頻度カウンタを確率変数にする •特徴が出現するたびに,以下の操作を行う •確率 でCを1インクリメント • の値を頻度として返す • は真の頻度カウント数の不偏推定量 15 on, on or- act p- sts ng ht- es- is an he he ke learning rate that decreases over time, e.g., setting ⌘t proportional to 1/ p t. Per-coordinate learning rates require storing a unique count ⌧i for each coordinate, where ⌧i is the number of times coordinate i has ap- peared with a non-zero gradient so far. Significant space is saved by using a 8-bit randomized counting scheme rather than a 32-bit (or 64-bit) integer to store the d total counts. We use a variant of Morris’ prob- abilistic counting algorithm (1978) analyzed by Flajo- let (1985). Specifically, we initialize a counter C = 1, and on each increment operation, we increment C with probability p(C) = b C , where base b is a parameter. We estimate the count as ˜⌧(C) = bC b b 1 , which is an unbiased estimator of the true count. We then use learning rates ⌘t,i = ↵/ p ˜⌧t,i + 1, which ensures that even when ˜⌧t,i = 0 we don’t divide by zero. We compute high-probability bounds on this counter in Lemma A.1. Using these bounds for ⌘t,i in conjunc- tion with Theorem 3.1, we obtain the following result (proof deferred to the appendix). t decreases over time, e.g., setting ⌘t 1/ p t. Per-coordinate learning rates unique count ⌧i for each coordinate, number of times coordinate i has ap- non-zero gradient so far. Significant y using a 8-bit randomized counting an a 32-bit (or 64-bit) integer to store ts. We use a variant of Morris’ prob- g algorithm (1978) analyzed by Flajo- fically, we initialize a counter C = 1, ement operation, we increment C with = b C , where base b is a parameter. count as ˜⌧(C) = bC b b 1 , which is an tor of the true count. We then use ,i = ↵/ p ˜⌧t,i + 1, which ensures that 0 we don’t divide by zero. h-probability bounds on this counter Using these bounds for ⌘t,i in conjunc- em 3.1, we obtain the following result ˜(C)
  • 16. Per-Coordinate版アルゴリズム 16 Large-Scale Learning with Less RAM via Algorithm 2 OGD-Rand input: feasible set F = [ R, R]d , parameters ↵, > 0 Initialize ˆ1 = 0 2 Rd ; 8i, ⌧i = 0 for t=1, . . . , T do Play the point ˆt, observe loss function ft for i=1, . . . , d do let gt,i = rft(xt)i if gt,i = 0 then continue ⌧i ⌧i + 1 let ⌘t,i = ↵/ p ⌧i and ✏t,i = ⌘t,i t+1,i Project ˆt,i ⌘t,igt,i ˆt+1,i RandomRound( t+1,i, ✏t,i) example rare words in a bag-of-words representation, identified by a binary feature), using a fine-precision s of all coe store a sec the coe ci rank/select one of Patr 3.2. Appr Online con learning ra proportion require sto where ⌧i is peared wit space is sa scheme rat 頻度のカウンティング.Morris s Algo.により8bits化可能. 頻度情報を使ってステップ幅を決定.
  • 17. 予測時はさらに近似可能 • 予測時は予測への影響が少なければ,bit数はかな り大胆に減らせる • Lemma 4.1, 4.2, Theorem 4.3: Logistic Loss の場合の近似の程度に応じた発生しうる誤差分析 • さらに圧縮を使えば情報論的下限までメモリを削減 可能 17 Large-Scale Learning with Less RAM via Randomization Figure 1. Histogram of coe cients in a typical large-scale linear model trained from real data. Values are tightly grouped near zero; a large dynamic range is superfluous. Contributions This paper gives the following theo- retical and empirical results: (Van Durme & Lall, 2009). T paper gives the first algorithms learning with randomized roun Per-Coordinate Learning (2010) and McMahan & St strated that per-coordinate a (i.e., adaptive learning rates) diction accuracy. The intuition rate for common features decre ing the learning rate high for ra tivity increases RAM cost by r statistic to be stored for each ただし,下限は あまり小さくない
  • 18. 実験
  • 19. RCV1 Dataset 19 Large-Scale Learning with Less RAM via Randomization Train Test Feature RCV1 20,242 677,399 47,236
  • 20. CTR Dataset 非公開の検索広告クリックログデータ 20 Data Feature CTR 30M 20M 2. Rounding at Training Time. The fixed q2.13 encoding is 50% smaller than control with no loss. Per-coo g rates significantly improve predictions but use 64 bits per value. Randomized counting reduces this to adaptive or fixed precision reduces memory use further, to 24 total bits per value or less. The benefit of a on is seen more on the larger CTR data. o↵ error is no longer an issue. This allows even aggressive rounding to be used safely. E[ˆi] = i, then for any x 2 {0, 1}d our predict odds ratio, ˆ·x is distributed as a sum of indep Data Feature CTR Billion Billion のデータでもほとんど同じ結果だよと言ってる
  • 21. 予測モデルの近似性能 21 Large-Scale Learning with Less RAM Table 1. Rounding at Prediction Time for CTR Data. Fixed-point encodings are compared to a 32-bit floating point control model. Added loss is negligible even when using only 1.5 bits per value with optimal encoding. Encoding AucLoss Opt. Bits/Val q2.3 +5.72% 0.1 q2.5 +0.44% 0.5 q2.7 +0.03% 1.5 q2.9 +0.00% 3.3 ues may be encoded with fewer bits. The theoret- ical bound for a whole model with d coe cients isPd i=1 log p( i) d bits per value, where p(v) is the proba- bility of occurrence of v in across all dimensions d. of trade with ra to plo domized these di Using a coding pared to learning tive per consum 64 bits p ized cou nate. H tive pre 情報論的下限まで小さく した時のサイズ
  • 22. まとめ • 重みベクトルの省メモリ化 • Randomized Rounding • GPUやL1cacheに載る形で学習/予測可能 • FOBOS等への拡張もStraightforward • と著者は書いていて,脚注にProof Sketch • 本当に成立するかどうか,各自調べる必要がありそう 22 float (32bits) Ex. (5bits) = (1.0, . . . , 1.52) Q2,2 ˆ = (1.0, . . . , 1.50)

×