Your SlideShare is downloading. ×
0
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA compatible devices

1,037

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,037
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Accelerating C ollapsed V ariational B ayesian Inference for L atent D irichlet A llocation with Nvidia CUDA Compatible Devices Tomonari MASADA 正田 備也 Nagasaki University [email_address]
  • 2. Overview <ul><li>What is CVB? </li></ul><ul><li>Parallelization of CVB for LDA </li></ul><ul><li>Implementation for GPGPU </li></ul><ul><ul><li>GPGPU = Nvidia CUDA compatible devices </li></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  • 3. LDA (latent Dirichlet allocation )
  • 4. latent Dirichlet allocation [Blei et al. 02] <ul><li>Bayesian multi-topic document model </li></ul><ul><ul><li>multi-topic </li></ul></ul><ul><ul><ul><li>document = mixture of K topics </li></ul></ul></ul><ul><ul><li>Bayesian </li></ul></ul><ul><ul><ul><li>introducing a prior </li></ul></ul></ul><ul><ul><ul><li>obtaining a posterior </li></ul></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  • 5. Dir ( β ) topic = word multinomial Tomonari MASADA (IEA-AIE 2009) symmetric Dirichlet prior v 1 v 2 v 3 v 4 t 1 φ 11 φ 12 φ 13 φ 14 v 1 v 2 v 3 v 4 t 2 φ 21 φ 22 φ 23 φ 24 v 1 v 2 v 3 v 4 t 3 φ 31 φ 32 φ 33 φ 34
  • 6. document Dir ( α ) per-doc topic multinomial Tomonari MASADA (IEA-AIE 2009) symmetric Dirichlet prior θ j 1 θ j 2 θ j 3
  • 7. v 3 v 1 v 3 v 2 v 2 θ j 1 θ j 2 θ j 3 Tomonari MASADA (IEA-AIE 2009) v 1 v 2 v 3 v 4 t 3 φ 31 φ 32 φ 33 φ 34 v 1 v 2 v 3 v 4 t 2 φ 21 φ 22 φ 23 φ 24 v 1 v 2 v 3 v 4 t 1 φ 11 φ 12 φ 13 φ 14
  • 8. posterior distribution <ul><li>p ( x , z , θ , φ | α , β ) </li></ul><ul><li>= p ( θ j | α ) p ( φ k | β ) Π i p ( z ji | θ j ) p ( x ji | z ji , φ ) </li></ul><ul><li>p ( z , θ , φ | x , α , β ) </li></ul><ul><li>= p ( x , z , θ , φ | α , β ) / p ( x | α , β ) </li></ul>unknown known Tomonari MASADA (IEA-AIE 2009)
  • 9. Inference methods for LDA <ul><li>Variational Bayesian inference [Blei et al. 02] </li></ul><ul><ul><li>Approximating posterior by a variational method </li></ul></ul><ul><li>Collapsed Gibbs sampling [Griffiths et al. 04] </li></ul><ul><ul><li>Marginalizing θ jk and φ kw </li></ul></ul><ul><ul><li>Sampling z ji </li></ul></ul><ul><li>Collapsed variational Bayesian inference ( CVB ) </li></ul><ul><li> [Teh et al. 06] </li></ul><ul><ul><li>Marginalizing θ jk and φ kw </li></ul></ul><ul><ul><li>Approximating posterior by a variational method </li></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  • 10. ・・・ Tomonari MASADA (IEA-AIE 2009) vote γ 111 γ 112 γ 113 party γ 12 1 γ 122 γ 123 prime γ 131 γ 132 γ 133 ・・・ stock γ 241 γ 242 γ 243 ratio γ 251 γ 252 γ 253 prime γ 231 γ 232 γ 233 ・・・ party γ 321 γ 322 γ 323 celeb γ 361 γ 362 γ 363 prime γ 331 γ 332 γ 333 ・・・
  • 11. Interpretation of γ jwk <ul><li>γ jwk </li></ul><ul><ul><li>= How strongly </li></ul></ul><ul><ul><li> word w in document j </li></ul></ul><ul><ul><li>relates to topic k ? </li></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  • 12. Algorithm of CVB <ul><li>for each d j </li></ul><ul><li>for each v w in d j </li></ul><ul><li>for each t k </li></ul><ul><li>next </li></ul><ul><li>next </li></ul><ul><li>next </li></ul>Update γ jwk j : doc id w : word id k : topic id Tomonari MASADA (IEA-AIE 2009) O ( MK ) time M : # of unique doc-word pairs K : # of topics
  • 13. Updating posterior parameters <ul><li>γ jwk ∝ ( α + E[ n jk ] ) </li></ul><ul><li>· ( β + E[ n kw ] ) /( W β + E[ n k ] ) </li></ul><ul><li>· exp [ − Var[ n jk ] / 2( α + E[ n jk ]) 2 </li></ul><ul><li>− Var[ n kw ] / 2( β + E[ n kw ]) 2 </li></ul><ul><li>+ Var[ n k ] / 2(W β + E[ n k ]) 2 ] </li></ul>Tomonari MASADA (IEA-AIE 2009) j : doc id w : word id k : topic id
  • 14. Approximation by Gaussian <ul><li>Means and variances </li></ul><ul><ul><li>E[ n jk ] =Σ w n wj γ jwk , Var[ n jk ] =Σ w n wj γ jwk (1- γ jwk ) </li></ul></ul><ul><ul><li>E[ n kw ]=Σ j n wj γ jwk , Var[ n kw ]=Σ j n wj γ jwk (1- γ jwk ) </li></ul></ul><ul><ul><li>E[ n k ] =Σ w,j n wj γ jwk , Var[ n k ] =Σ w,j n wj γ jwk (1- γ jwk ) </li></ul></ul><ul><ul><li>n jk : # of word tokens which relate to topic k and appear in document j </li></ul></ul><ul><ul><li>n kw : # of tokens of word w which relate to topic k </li></ul></ul><ul><ul><li>n k : # of word tokens which relate to topic k </li></ul></ul>Tomonari MASADA (IEA-AIE 2009) j : doc id w : word id k : topic id
  • 15. O( JK ) size O( KW ) size O( K ) size O( MK ) size Tomonari MASADA (IEA-AIE 2009) j : doc id w : word id k : topic id E[ n jk ] Var[ n jk ] E[ n kw ] Var[ n kw ] E[ n k ] Var[ n k ] γ jwk
  • 16. <ul><li>for each d j </li></ul><ul><li>for each v w in d j </li></ul><ul><li>for each t k </li></ul><ul><li>1. E[ n jk ] −= n wj * γ jwk ; Var[ n jk ] −= n wj * γ jwk *(1− γ wjk ) </li></ul><ul><li>2. Update γ wjk </li></ul><ul><li>3. E[ n jk ] += n wj * γ jwk ; Var[ n jk ] += n wj * γ jwk *(1− γ wjk ) </li></ul><ul><li>next </li></ul><ul><li>next </li></ul><ul><li>next </li></ul>Details of CVB for LDA Update another two types of E[]s and Var[]s in a similar manner. Tomonari MASADA (IEA-AIE 2009) j : doc id w : word id k : topic id
  • 17. Parallelization of CVB for LDA “ as many threads as topics”
  • 18. Parallelization of CVB <ul><li>for each d j </li></ul><ul><li>for each v w in d j </li></ul><ul><li>for each t k </li></ul><ul><li>next </li></ul><ul><li>next </li></ul><ul><li>next </li></ul>Update γ jwk Tomonari MASADA (IEA-AIE 2009) conventional parallelization proposed parallelization
  • 19. <ul><li>γ jwk ∝ ( α + E[ n jk ] ) </li></ul><ul><li>· ( β + E[ n kw ] ) /( W β + E[ n k ] ) </li></ul><ul><li>· exp [ − Var[ n jk ] / 2( α + E[ n jk ]) 2 </li></ul><ul><li>− Var[ n kw ] / 2( β + E[ n kw ]) 2 </li></ul><ul><li>+ Var[ n k ] / 2(W β + E[ n k ]) 2 ] </li></ul>Strategy: “different topics for different threads” Tomonari MASADA (IEA-AIE 2009) γ jw 1 + γ jw 2 + ・・・ + γ jwK = 1 Normalization is required! O ( MK )  O ( M log K )
  • 20. Reduction for normalization O (log K ) Tomonari MASADA (IEA-AIE 2009)
  • 21. Related Work
  • 22. A new approach (not a parallelization) <ul><li>Fast Collapsed Gibbs Sampling </li></ul><ul><li>for Latent Dirichlet Allocation </li></ul><ul><li>[Porteous et al. KDD2008] </li></ul><ul><li>Algorithmic acceleration of collapsed Gibbs sampling for LDA </li></ul><ul><ul><li>Complexity is not proportional to # of topics. </li></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  • 23. Accelerating CVB for LDA by GPGPU
  • 24. Nvidia CUDA (C ompute U nified D evice A rchitecture ) Grid Device Memory documents topics Shared Memory Registers Thread Registers Thread Block Shared Memory Registers Thread Registers Thread Block
  • 25. Device memory access latency Grid Device Memory 16KB Shared Memory Registers Thread Registers Thread Block Shared Memory Registers Thread Registers Thread Block
  • 26. Data transfer latency Host Memory Transfer one large block instead of many smaller ones! Grid Device Memory Shared Memory Registers Thread Registers Thread Block Shared Memory Registers Thread Registers Thread Block
  • 27. O( JK ) size O( KW ) size O( K ) size parameters of approximated posterior Tomonari MASADA (IEA-AIE 2009) j : doc id w : word id k : topic id E[ n jk ] Var[ n jk ] E[ n kw ] Var[ n kw ] E[ n k ] Var[ n k ] γ jwk
  • 28. Where to store? <ul><li>Posterior parameters </li></ul><ul><ul><li>γ jwk : O ( K ) size </li></ul></ul><ul><li>Means and variances </li></ul><ul><ul><li>E[ n jk ], Var[ n jk ] : O ( K ) size for a fixed doc </li></ul></ul><ul><ul><li>E[ n kw ], Var[ n kw ] : O ( KW ) size </li></ul></ul><ul><ul><li>E[ n k ], Var[ n k ] : O ( K ) size </li></ul></ul>registers shared memory (for summation) registers device memory Tomonari MASADA (IEA-AIE 2009)
  • 29. write conflicts O( KW ) size Tomonari MASADA (IEA-AIE 2009) γ j’wk E[ n kw ] Var[ n kw ] γ jwk
  • 30. Experiments
  • 31. Text mining <ul><li>Articles from Mainichi and Asahi Web news </li></ul><ul><ul><li>56,755 docs </li></ul></ul><ul><ul><li>40,158 words (applying MeCab + removing stop words) </li></ul></ul><ul><ul><li>M = 5,053,978 unique doc/word pairs </li></ul></ul><ul><ul><ul><li>3,387,822 pairs for training </li></ul></ul></ul><ul><li>ASUS EN8800GT/HTDP/1G </li></ul><ul><li>+ Core 2Quad Q9450 </li></ul><ul><li>Evaluating by test data perplexity </li></ul>Tomonari MASADA (IEA-AIE 2009)
  • 32. 16 topics 64 iterations on CPU 64 iterations on GPU
  • 33. 64 iterations on CPU 64 iterations on GPU 32 topics
  • 34. 64 topics 64 iterations on CPU 64 iterations on GPU
  • 35. Image mining <ul><li>1.5 million tiny images </li></ul><ul><li>http://people.csail.mit.edu/torralba/tinyimages/ </li></ul><ul><ul><li>Only first 32,768 images </li></ul></ul><ul><ul><li>Uniform color quantization: 16x16x16 </li></ul></ul><ul><ul><li>Original image size: 32x32 </li></ul></ul><ul><ul><li>word = (R, G, B, Xpos, Ypos)  16x16x16x32x32 </li></ul></ul><ul><ul><li>30 topics </li></ul></ul><ul><ul><li>8 PCs (GeForce GTX260 for each PC) </li></ul></ul><ul><ul><ul><li>CUDA + MPICH2 + OpenMP (perplexity computation) </li></ul></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  • 36. Image mining <ul><li>Statistics </li></ul><ul><ul><li>J = 32,768 docs </li></ul></ul><ul><ul><li>W = 2,090,223 unique words </li></ul></ul><ul><ul><li>M = 33,554,432 unique document-word pairs </li></ul></ul><ul><li>Running time </li></ul><ul><ul><li>8,191 sec for 100 iterations </li></ul></ul><ul><ul><ul><li>LEADTEK WinFast GTX 260 896MB + Core 2Quad Q9550 </li></ul></ul></ul><ul><li>http://www.cis.nagasaki-u.ac.jp/~masada/researches.html </li></ul>Tomonari MASADA (IEA-AIE 2009)
  • 37. Tomonari MASADA (IEA-AIE 2009)
  • 38. Image mining (2 nd trial) <ul><li>Statistics </li></ul><ul><ul><li>J = 201,043 docs </li></ul></ul><ul><ul><li>W = 509,109 words (8x8x8 uniform quantization) </li></ul></ul><ul><ul><li>M = 25,733,120 unique document-word pairs </li></ul></ul><ul><li>Running time </li></ul><ul><ul><li>6.9 hours for 100 iterations </li></ul></ul><ul><ul><ul><li>LEADTEK WinFast GTX 260 896MB + Core 2Quad Q9550 </li></ul></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  • 39. Summary
  • 40. discussions <ul><li>Larger device memory is better. </li></ul><ul><ul><li>Data transfer latency between CPU and GPU </li></ul></ul><ul><li>GPU is not enough for scalability. </li></ul><ul><ul><li>GPGPU + PC cluster (MPICH2) </li></ul></ul><ul><ul><ul><li>“ fine-grained” : topic <-> thread </li></ul></ul></ul><ul><ul><ul><li>“ coarse-grained” : data subset <-> node </li></ul></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  • 41. Future work <ul><li>Collapsed Gibbs sampling on GPU? </li></ul><ul><ul><li>Collapsed Gibbs sampling for LDA is </li></ul></ul><ul><ul><li>too simple to obtain speed-up by GPGPU. </li></ul></ul><ul><li>Non-parametric Bayes on GPU? </li></ul><ul><ul><li>Hierarchical Dirichlet Processes [Teh et al. 06] </li></ul></ul><ul><ul><ul><li>How to keep topic numbering consistent among different threads? </li></ul></ul></ul>Tomonari MASADA (IEA-AIE 2009)
  • 42. Thank you for your attention! 非常感謝 !!!

×