Successfully reported this slideshow.
Accelerating C ollapsed  V ariational  B ayesian Inference   for L atent  D irichlet  A llocation  with Nvidia  CUDA  Comp...
Overview <ul><li>What is CVB? </li></ul><ul><li>Parallelization of CVB for LDA </li></ul><ul><li>Implementation for GPGPU ...
LDA (latent Dirichlet allocation )
latent Dirichlet allocation [Blei et al. 02] <ul><li>Bayesian multi-topic document model </li></ul><ul><ul><li>multi-topic...
Dir ( β ) topic = word multinomial Tomonari MASADA (IEA-AIE 2009) symmetric Dirichlet prior v 1 v 2 v 3 v 4 t 1 φ 11 φ 12 ...
document Dir ( α ) per-doc topic multinomial Tomonari MASADA (IEA-AIE 2009) symmetric Dirichlet prior θ j 1 θ j 2 θ j 3
v 3 v 1 v 3 v 2 v 2 θ j 1 θ j 2 θ j 3 Tomonari MASADA (IEA-AIE 2009) v 1 v 2 v 3 v 4 t 3 φ 31 φ 32 φ 33 φ 34 v 1 v 2 v 3 v...
posterior distribution <ul><li>p ( x , z , θ , φ | α , β ) </li></ul><ul><li>= p ( θ j | α ) p ( φ k | β ) Π i  p ( z ji |...
Inference methods for LDA <ul><li>Variational  Bayesian inference [Blei et al. 02] </li></ul><ul><ul><li>Approximating pos...
・・・ Tomonari MASADA (IEA-AIE 2009) vote γ 111 γ 112 γ 113 party γ 12 1 γ 122 γ 123 prime γ 131 γ 132 γ 133 ・・・ stock γ 241...
Interpretation of  γ jwk   <ul><li>γ jwk </li></ul><ul><ul><li>= How strongly </li></ul></ul><ul><ul><li>  word  w  in doc...
Algorithm of CVB <ul><li>for each  d j </li></ul><ul><li>for each   v w   in   d j </li></ul><ul><li>for each  t k </li></...
Updating posterior parameters <ul><li>γ jwk  ∝  ( α  + E[ n jk ] ) </li></ul><ul><li>·  ( β  +  E[ n kw ] ) /( W β  +  E[ ...
Approximation by Gaussian <ul><li>Means and variances </li></ul><ul><ul><li>E[ n jk ] =Σ w  n wj γ jwk ,  Var[ n jk ] =Σ w...
O( JK ) size O( KW ) size O( K ) size O( MK ) size Tomonari MASADA (IEA-AIE 2009) j : doc id w : word id k : topic id E[ n...
<ul><li>for each  d j </li></ul><ul><li>for each   v w   in   d j </li></ul><ul><li>for each  t k </li></ul><ul><li>1. E[ ...
Parallelization of CVB for LDA “ as many threads as topics”
Parallelization of CVB <ul><li>for each  d j </li></ul><ul><li>for each   v w   in   d j </li></ul><ul><li>for each  t k <...
<ul><li>γ jwk  ∝  ( α  + E[ n jk ] ) </li></ul><ul><li>·  ( β  +  E[ n kw ] ) /( W β  +  E[ n k ] ) </li></ul><ul><li>· ex...
Reduction for normalization O (log K ) Tomonari MASADA (IEA-AIE 2009)
Related Work
A new approach (not a parallelization) <ul><li>Fast Collapsed Gibbs Sampling </li></ul><ul><li>for Latent Dirichlet Alloca...
Accelerating CVB for LDA by GPGPU
Nvidia CUDA   (C ompute  U nified  D evice  A rchitecture )  Grid Device Memory documents topics Shared Memory Registers T...
Device memory access latency Grid Device Memory 16KB Shared Memory Registers Thread Registers Thread Block Shared Memory R...
Data transfer latency Host Memory Transfer one large block instead of many smaller ones! Grid Device Memory Shared Memory ...
O( JK ) size O( KW ) size O( K ) size parameters of approximated posterior Tomonari MASADA (IEA-AIE 2009) j : doc id w : w...
Where to store? <ul><li>Posterior parameters </li></ul><ul><ul><li>γ jwk  :  O ( K ) size </li></ul></ul><ul><li>Means and...
write conflicts O( KW ) size Tomonari MASADA (IEA-AIE 2009) γ j’wk E[ n kw ] Var[ n kw ] γ jwk
Experiments
Text mining <ul><li>Articles from Mainichi and Asahi Web news </li></ul><ul><ul><li>56,755 docs </li></ul></ul><ul><ul><li...
16 topics 64 iterations on CPU 64 iterations on GPU
64 iterations on CPU 64 iterations on GPU 32 topics
64 topics 64 iterations on CPU 64 iterations on GPU
Image mining <ul><li>1.5 million tiny images </li></ul><ul><li>http://people.csail.mit.edu/torralba/tinyimages/ </li></ul>...
Image mining <ul><li>Statistics </li></ul><ul><ul><li>J  = 32,768 docs </li></ul></ul><ul><ul><li>W  = 2,090,223 unique wo...
Tomonari MASADA (IEA-AIE 2009)
Image mining (2 nd  trial) <ul><li>Statistics </li></ul><ul><ul><li>J  = 201,043 docs </li></ul></ul><ul><ul><li>W  = 509,...
Summary
discussions <ul><li>Larger device memory is better. </li></ul><ul><ul><li>Data transfer latency between CPU and GPU </li><...
Future work <ul><li>Collapsed Gibbs sampling on GPU? </li></ul><ul><ul><li>Collapsed Gibbs sampling for LDA is </li></ul><...
Thank you for your attention! 非常感謝 !!!
Upcoming SlideShare
Loading in …5
×

Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

1,391 views

Published on

  • Be the first to comment

Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

  1. 1. Accelerating C ollapsed V ariational B ayesian Inference for L atent D irichlet A llocation with Nvidia CUDA Compatible Devices Tomonari MASADA 正田 備也 Nagasaki University [email_address]
  2. 2. Overview <ul><li>What is CVB? </li></ul><ul><li>Parallelization of CVB for LDA </li></ul><ul><li>Implementation for GPGPU </li></ul><ul><ul><li>GPGPU = Nvidia CUDA compatible devices </li></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  3. 3. LDA (latent Dirichlet allocation )
  4. 4. latent Dirichlet allocation [Blei et al. 02] <ul><li>Bayesian multi-topic document model </li></ul><ul><ul><li>multi-topic </li></ul></ul><ul><ul><ul><li>document = mixture of K topics </li></ul></ul></ul><ul><ul><li>Bayesian </li></ul></ul><ul><ul><ul><li>introducing a prior </li></ul></ul></ul><ul><ul><ul><li>obtaining a posterior </li></ul></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  5. 5. Dir ( β ) topic = word multinomial Tomonari MASADA (IEA-AIE 2009) symmetric Dirichlet prior v 1 v 2 v 3 v 4 t 1 φ 11 φ 12 φ 13 φ 14 v 1 v 2 v 3 v 4 t 2 φ 21 φ 22 φ 23 φ 24 v 1 v 2 v 3 v 4 t 3 φ 31 φ 32 φ 33 φ 34
  6. 6. document Dir ( α ) per-doc topic multinomial Tomonari MASADA (IEA-AIE 2009) symmetric Dirichlet prior θ j 1 θ j 2 θ j 3
  7. 7. v 3 v 1 v 3 v 2 v 2 θ j 1 θ j 2 θ j 3 Tomonari MASADA (IEA-AIE 2009) v 1 v 2 v 3 v 4 t 3 φ 31 φ 32 φ 33 φ 34 v 1 v 2 v 3 v 4 t 2 φ 21 φ 22 φ 23 φ 24 v 1 v 2 v 3 v 4 t 1 φ 11 φ 12 φ 13 φ 14
  8. 8. posterior distribution <ul><li>p ( x , z , θ , φ | α , β ) </li></ul><ul><li>= p ( θ j | α ) p ( φ k | β ) Π i p ( z ji | θ j ) p ( x ji | z ji , φ ) </li></ul><ul><li>p ( z , θ , φ | x , α , β ) </li></ul><ul><li>= p ( x , z , θ , φ | α , β ) / p ( x | α , β ) </li></ul>unknown known Tomonari MASADA (IEA-AIE 2009)
  9. 9. Inference methods for LDA <ul><li>Variational Bayesian inference [Blei et al. 02] </li></ul><ul><ul><li>Approximating posterior by a variational method </li></ul></ul><ul><li>Collapsed Gibbs sampling [Griffiths et al. 04] </li></ul><ul><ul><li>Marginalizing θ jk and φ kw </li></ul></ul><ul><ul><li>Sampling z ji </li></ul></ul><ul><li>Collapsed variational Bayesian inference ( CVB ) </li></ul><ul><li> [Teh et al. 06] </li></ul><ul><ul><li>Marginalizing θ jk and φ kw </li></ul></ul><ul><ul><li>Approximating posterior by a variational method </li></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  10. 10. ・・・ Tomonari MASADA (IEA-AIE 2009) vote γ 111 γ 112 γ 113 party γ 12 1 γ 122 γ 123 prime γ 131 γ 132 γ 133 ・・・ stock γ 241 γ 242 γ 243 ratio γ 251 γ 252 γ 253 prime γ 231 γ 232 γ 233 ・・・ party γ 321 γ 322 γ 323 celeb γ 361 γ 362 γ 363 prime γ 331 γ 332 γ 333 ・・・
  11. 11. Interpretation of γ jwk <ul><li>γ jwk </li></ul><ul><ul><li>= How strongly </li></ul></ul><ul><ul><li> word w in document j </li></ul></ul><ul><ul><li>relates to topic k ? </li></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  12. 12. Algorithm of CVB <ul><li>for each d j </li></ul><ul><li>for each v w in d j </li></ul><ul><li>for each t k </li></ul><ul><li>next </li></ul><ul><li>next </li></ul><ul><li>next </li></ul>Update γ jwk j : doc id w : word id k : topic id Tomonari MASADA (IEA-AIE 2009) O ( MK ) time M : # of unique doc-word pairs K : # of topics
  13. 13. Updating posterior parameters <ul><li>γ jwk ∝ ( α + E[ n jk ] ) </li></ul><ul><li>· ( β + E[ n kw ] ) /( W β + E[ n k ] ) </li></ul><ul><li>· exp [ − Var[ n jk ] / 2( α + E[ n jk ]) 2 </li></ul><ul><li>− Var[ n kw ] / 2( β + E[ n kw ]) 2 </li></ul><ul><li>+ Var[ n k ] / 2(W β + E[ n k ]) 2 ] </li></ul>Tomonari MASADA (IEA-AIE 2009) j : doc id w : word id k : topic id
  14. 14. Approximation by Gaussian <ul><li>Means and variances </li></ul><ul><ul><li>E[ n jk ] =Σ w n wj γ jwk , Var[ n jk ] =Σ w n wj γ jwk (1- γ jwk ) </li></ul></ul><ul><ul><li>E[ n kw ]=Σ j n wj γ jwk , Var[ n kw ]=Σ j n wj γ jwk (1- γ jwk ) </li></ul></ul><ul><ul><li>E[ n k ] =Σ w,j n wj γ jwk , Var[ n k ] =Σ w,j n wj γ jwk (1- γ jwk ) </li></ul></ul><ul><ul><li>n jk : # of word tokens which relate to topic k and appear in document j </li></ul></ul><ul><ul><li>n kw : # of tokens of word w which relate to topic k </li></ul></ul><ul><ul><li>n k : # of word tokens which relate to topic k </li></ul></ul>Tomonari MASADA (IEA-AIE 2009) j : doc id w : word id k : topic id
  15. 15. O( JK ) size O( KW ) size O( K ) size O( MK ) size Tomonari MASADA (IEA-AIE 2009) j : doc id w : word id k : topic id E[ n jk ] Var[ n jk ] E[ n kw ] Var[ n kw ] E[ n k ] Var[ n k ] γ jwk
  16. 16. <ul><li>for each d j </li></ul><ul><li>for each v w in d j </li></ul><ul><li>for each t k </li></ul><ul><li>1. E[ n jk ] −= n wj * γ jwk ; Var[ n jk ] −= n wj * γ jwk *(1− γ wjk ) </li></ul><ul><li>2. Update γ wjk </li></ul><ul><li>3. E[ n jk ] += n wj * γ jwk ; Var[ n jk ] += n wj * γ jwk *(1− γ wjk ) </li></ul><ul><li>next </li></ul><ul><li>next </li></ul><ul><li>next </li></ul>Details of CVB for LDA Update another two types of E[]s and Var[]s in a similar manner. Tomonari MASADA (IEA-AIE 2009) j : doc id w : word id k : topic id
  17. 17. Parallelization of CVB for LDA “ as many threads as topics”
  18. 18. Parallelization of CVB <ul><li>for each d j </li></ul><ul><li>for each v w in d j </li></ul><ul><li>for each t k </li></ul><ul><li>next </li></ul><ul><li>next </li></ul><ul><li>next </li></ul>Update γ jwk Tomonari MASADA (IEA-AIE 2009) conventional parallelization proposed parallelization
  19. 19. <ul><li>γ jwk ∝ ( α + E[ n jk ] ) </li></ul><ul><li>· ( β + E[ n kw ] ) /( W β + E[ n k ] ) </li></ul><ul><li>· exp [ − Var[ n jk ] / 2( α + E[ n jk ]) 2 </li></ul><ul><li>− Var[ n kw ] / 2( β + E[ n kw ]) 2 </li></ul><ul><li>+ Var[ n k ] / 2(W β + E[ n k ]) 2 ] </li></ul>Strategy: “different topics for different threads” Tomonari MASADA (IEA-AIE 2009) γ jw 1 + γ jw 2 + ・・・ + γ jwK = 1 Normalization is required! O ( MK )  O ( M log K )
  20. 20. Reduction for normalization O (log K ) Tomonari MASADA (IEA-AIE 2009)
  21. 21. Related Work
  22. 22. A new approach (not a parallelization) <ul><li>Fast Collapsed Gibbs Sampling </li></ul><ul><li>for Latent Dirichlet Allocation </li></ul><ul><li>[Porteous et al. KDD2008] </li></ul><ul><li>Algorithmic acceleration of collapsed Gibbs sampling for LDA </li></ul><ul><ul><li>Complexity is not proportional to # of topics. </li></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  23. 23. Accelerating CVB for LDA by GPGPU
  24. 24. Nvidia CUDA (C ompute U nified D evice A rchitecture ) Grid Device Memory documents topics Shared Memory Registers Thread Registers Thread Block Shared Memory Registers Thread Registers Thread Block
  25. 25. Device memory access latency Grid Device Memory 16KB Shared Memory Registers Thread Registers Thread Block Shared Memory Registers Thread Registers Thread Block
  26. 26. Data transfer latency Host Memory Transfer one large block instead of many smaller ones! Grid Device Memory Shared Memory Registers Thread Registers Thread Block Shared Memory Registers Thread Registers Thread Block
  27. 27. O( JK ) size O( KW ) size O( K ) size parameters of approximated posterior Tomonari MASADA (IEA-AIE 2009) j : doc id w : word id k : topic id E[ n jk ] Var[ n jk ] E[ n kw ] Var[ n kw ] E[ n k ] Var[ n k ] γ jwk
  28. 28. Where to store? <ul><li>Posterior parameters </li></ul><ul><ul><li>γ jwk : O ( K ) size </li></ul></ul><ul><li>Means and variances </li></ul><ul><ul><li>E[ n jk ], Var[ n jk ] : O ( K ) size for a fixed doc </li></ul></ul><ul><ul><li>E[ n kw ], Var[ n kw ] : O ( KW ) size </li></ul></ul><ul><ul><li>E[ n k ], Var[ n k ] : O ( K ) size </li></ul></ul>registers shared memory (for summation) registers device memory Tomonari MASADA (IEA-AIE 2009)
  29. 29. write conflicts O( KW ) size Tomonari MASADA (IEA-AIE 2009) γ j’wk E[ n kw ] Var[ n kw ] γ jwk
  30. 30. Experiments
  31. 31. Text mining <ul><li>Articles from Mainichi and Asahi Web news </li></ul><ul><ul><li>56,755 docs </li></ul></ul><ul><ul><li>40,158 words (applying MeCab + removing stop words) </li></ul></ul><ul><ul><li>M = 5,053,978 unique doc/word pairs </li></ul></ul><ul><ul><ul><li>3,387,822 pairs for training </li></ul></ul></ul><ul><li>ASUS EN8800GT/HTDP/1G </li></ul><ul><li>+ Core 2Quad Q9450 </li></ul><ul><li>Evaluating by test data perplexity </li></ul>Tomonari MASADA (IEA-AIE 2009)
  32. 32. 16 topics 64 iterations on CPU 64 iterations on GPU
  33. 33. 64 iterations on CPU 64 iterations on GPU 32 topics
  34. 34. 64 topics 64 iterations on CPU 64 iterations on GPU
  35. 35. Image mining <ul><li>1.5 million tiny images </li></ul><ul><li>http://people.csail.mit.edu/torralba/tinyimages/ </li></ul><ul><ul><li>Only first 32,768 images </li></ul></ul><ul><ul><li>Uniform color quantization: 16x16x16 </li></ul></ul><ul><ul><li>Original image size: 32x32 </li></ul></ul><ul><ul><li>word = (R, G, B, Xpos, Ypos)  16x16x16x32x32 </li></ul></ul><ul><ul><li>30 topics </li></ul></ul><ul><ul><li>8 PCs (GeForce GTX260 for each PC) </li></ul></ul><ul><ul><ul><li>CUDA + MPICH2 + OpenMP (perplexity computation) </li></ul></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  36. 36. Image mining <ul><li>Statistics </li></ul><ul><ul><li>J = 32,768 docs </li></ul></ul><ul><ul><li>W = 2,090,223 unique words </li></ul></ul><ul><ul><li>M = 33,554,432 unique document-word pairs </li></ul></ul><ul><li>Running time </li></ul><ul><ul><li>8,191 sec for 100 iterations </li></ul></ul><ul><ul><ul><li>LEADTEK WinFast GTX 260 896MB + Core 2Quad Q9550 </li></ul></ul></ul><ul><li>http://www.cis.nagasaki-u.ac.jp/~masada/researches.html </li></ul>Tomonari MASADA (IEA-AIE 2009)
  37. 37. Tomonari MASADA (IEA-AIE 2009)
  38. 38. Image mining (2 nd trial) <ul><li>Statistics </li></ul><ul><ul><li>J = 201,043 docs </li></ul></ul><ul><ul><li>W = 509,109 words (8x8x8 uniform quantization) </li></ul></ul><ul><ul><li>M = 25,733,120 unique document-word pairs </li></ul></ul><ul><li>Running time </li></ul><ul><ul><li>6.9 hours for 100 iterations </li></ul></ul><ul><ul><ul><li>LEADTEK WinFast GTX 260 896MB + Core 2Quad Q9550 </li></ul></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  39. 39. Summary
  40. 40. discussions <ul><li>Larger device memory is better. </li></ul><ul><ul><li>Data transfer latency between CPU and GPU </li></ul></ul><ul><li>GPU is not enough for scalability. </li></ul><ul><ul><li>GPGPU + PC cluster (MPICH2) </li></ul></ul><ul><ul><ul><li>“ fine-grained” : topic <-> thread </li></ul></ul></ul><ul><ul><ul><li>“ coarse-grained” : data subset <-> node </li></ul></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  41. 41. Future work <ul><li>Collapsed Gibbs sampling on GPU? </li></ul><ul><ul><li>Collapsed Gibbs sampling for LDA is </li></ul></ul><ul><ul><li>too simple to obtain speed-up by GPGPU. </li></ul></ul><ul><li>Non-parametric Bayes on GPU? </li></ul><ul><ul><li>Hierarchical Dirichlet Processes [Teh et al. 06] </li></ul></ul><ul><ul><ul><li>How to keep topic numbering consistent among different threads? </li></ul></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  42. 42. Thank you for your attention! 非常感謝 !!!

×