Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PyDataTokyo201-05-22

1,320 views

Published on

PyDataTokyo201-05-22

Published in: Technology
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (Unlimited) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... Download Full EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ACCESS WEBSITE for All Ebooks ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... Download EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... Download doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

PyDataTokyo201-05-22

  1. 1. PYDATA TOKYO 2015-05-22 LDA IN PYTHON 1 Wednesday, June 3, 15
  2. 2. WHO • バクフー株式会社 柏野 雄太 • 大規模リアルタイムデータのPPPP (P4) • preprocess /process /persistence /providing Wednesday, June 3, 15
  3. 3. WHAT IS LDA • Latent Dirichlet Allocation • 文章群から教師なしで「トピック」を探し出す • トピック:複数単語のまとまり • トピックは単語の分布関数を持つ • 文章はトピックの分布関数を持つ w w w w w w w1 w2 w3 w4 k k1 トピック毎単語分布 z1 z2 z3 w w w w w w z1 z2 ✓d ドキュメント毎トピック分布 Wednesday, June 3, 15
  4. 4. WHAT IS LDA • グラフィカルモデル ✓d zd,i wd,i k N K M zd,i ⇠ Multi(✓d) wd,i ⇠ Multi( zd,i ) k ⇠ Dirichlet( ) ↵ ✓d ⇠ Dirichlet(↵) z1 z2 z3 w w w w w w w w w w z1 z2 ✓d ドキュメント毎トピック分布 w w w w w w w1 w2 w3 w4 k k1 トピック毎単語分布 Wednesday, June 3, 15
  5. 5. WHAT IS LDA • 何をするの? w1 w2 w3 w4 w5 w6 w7 w8... LDA w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w1 w2 w3 w4 k k1 k2 k1 k3k4 z1 z2 z3 w w w w w w w w w w w z1 z2 ✓d 単語をトピック別にクラスタリング トピック毎単語分布 ドキュメント毎トピック分布 ドキュメント群 Wednesday, June 3, 15
  6. 6. WHAT IS LDA • パイプライン w1 w2 w3 w4 w5 w6 w7 w8... LDA w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w1 w2 w3 w4 k k1 k2 k1 k3k4 z1 z2 z3 w w w w w w w w w w w z1 z2 ✓d 単語をトピック別にクラスタリング トピック毎単語分布 ドキュメント毎トピック分布 ドキュメント群 tokenize vectorizing modeling word dictionary corpus BoW Variational Bayes Gibbs sampling word id: word 1: 政治 2: 自民 3: 総理 [(word id, freq)…] [(1, 2), (3, 2), …] [(1, 19), (4, 1), …] ... Wednesday, June 3, 15
  7. 7. LDA IN PYTHON 1/7 • lda-c Blei et al. 2003 • https://www.cs.princeton.edu/~blei/lda-c/ index.html • 実装: C • モデル: 変分ベイズ • 全ての始まり・コーパス固定/辞書固定 Wednesday, June 3, 15
  8. 8. LDA IN PYTHON 2/7 • onlineldavb.py Hoffman et al. 2010 • http://www.cs.princeton.edu/~blei/ downloads/onlineldavb.tar • model: 変分ベイズEM • オンラインLDA • メモリ効率はいいが,遅い. Wednesday, June 3, 15
  9. 9. LDA IN PYTHON 3/7 • gensim • http://radimrehurek.com/gensim/ • Hoffman+のpython版オンラインLDAをラッピン グ, LSIも実装している • Pyroによる分散処理が可能 • 基本遅い.辞書・コーパスを更新できない Wednesday, June 3, 15
  10. 10. LDA IN PYTHON 4/7 • Vowpal_Wabbit /w pyvw • Hoffman自身がonlineldavb.pyをC++で実装 • 激速い • pyvw経由でpythonから使える • ただin/outのファイルが特殊 Wednesday, June 3, 15
  11. 11. LDA IN PYTHON 5/7 • lda • http://pythonhosted.org/lda/ • scikit-learnライクなインタフェース • collapsed Gibbsサンプリング • 野良LDA的… Wednesday, June 3, 15
  12. 12. LDA IN PYTHON 6/7 • dato graphlab • https://dato.com/products/create/docs/ generated/graphlab.topic_model.create.html • C++ /w pythonインタフェース • collapesed Gibbsサンプリング • graphlabの仕組みで並列化できる Wednesday, June 3, 15
  13. 13. LDA IN PYTHON 7/7 • 大量の野良実装 • 自分も辞書・コーパスを更新できるオンラインLDA を実装 • https://bitbucket.org/yutakashino/nhkssf4w/ src/cd1ffc7f46ce/streamlda/?at=master Wednesday, June 3, 15
  14. 14. LDA IN X • MALLET (Java): 階層LDAもある • Stanford Topic Modeling Toolbox (scala): LDA and Labeled LDA, Excelから使える • Wang&Blei 2009 class-slda (c++) • GibbsLDA ++ (c++) • Multithreaded LDA (c) • ... Wednesday, June 3, 15

×