Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Treasure Data Summer Internship Final Report

3,325 views

Published on

2015 | Machine Learning | Online Factorization Machine | Online Latent Dirichlet Allocation(Mini-Batch)

Published in: Science
  • Be the first to comment

Treasure Data Summer Internship Final Report

  1. 1. Summer Internship Final Report Naoki Ishikawa (@NeokiStones) 2015/09/30 13:30-
  2. 2. Who am I 2 • Naoki Ishikawa • Waseda University, Information Science M1 • Research: Evolutional Computation/ Reinforcement Learning • Laboratory: Sugawara Lab • Laboratory theme: Artificial Intelligence
  3. 3. • Implemented Algorithm • Factorization Machine • Latent Dirichlet Allocation 3 Table of contents
  4. 4. • Implemented Algorithm • Factorization Machine • Latent Dirichlet Allocation 4 Table of contents
  5. 5. Factorization Machine 5 • Algorithm for Recommendation • Classification(Clustering) • Regression • Supervised Learning • Need Input/Output Data • Suitable for Sparse Data
  6. 6. Application
  7. 7. Application 7 • Prediction of Movie Rating • Task: Prediction movie rating
 (real number) • Regression
 - Input: Self-designed Matrix 
 - Output: Rating Vector
  8. 8. 8 Input Output Prediction of Movie Rating
  9. 9. INPUT Details 9 • Identifier
 - User Identifier : [0, 0, …, 0, 1, 0, …,0]
 - Movie Identifier : [0, 0, …, 0, 0, 1, 0, …,0] • Designed Feature
 - Rating of Other Movie
 - Time
 - Last Movie rated
  10. 10. 10 Recommendation Algorithm • Collaborative Filtering • Associations Analysis • Bayesian Network
  11. 11. Prediction of Movie Rating 11 • Hivemall • Matrix Factorization • Recommendation
  12. 12. 12 Difference from Matrix Factorization • Data Structure • Matrix Factorization • User-Item Matrix http://ampcamp.berkeley.edu/big-data-mini-course/img/matrix_factorization.png Input Learning Parameter
  13. 13. 13 Difference from Matrix Factorization • Factorization Machine Vv k Input Learning Parameter Wk 1
  14. 14. 14 • Factorization Machine • Consider • context data • Interaction between valuables Advantage of Factorization Machine
  15. 15. 15 Difference from Matrix Factorization Prediction by Factorization Machine (d=2)
  16. 16. 16 Difference from Matrix Factorization Prediction by Factorization Machine (d=2) (mean) Global bias Interaction Factorization (Wkj) Regression coefficience of k-th variable
  17. 17. 17 Difference from Matrix Factorization Prediction by Factorization Machine (d=2) Learning Method Stochastic Gradient descent(SGD)
  18. 18. 18 Local Implementation
  19. 19. 19 Difference from Matrix Factorization • d-way • FM / MF • assume K latent attributes • Matrix Factorization: d = 2 • Factorization Machine: d 2
  20. 20. 20 HyperParameter • K: the number of hidden factor • η: the regulation parameter
  21. 21. 21 Implemented Model • Implemented Model • d = 2 • MapModel • ArrayModel
  22. 22. 22 Implemented Model • MapModel • For unknown data • Flexible • Suitable for Online Learning
  23. 23. 23 Implemented Model • ArrayModel • For known data • less overhead
  24. 24. 24 Other Use Case • E-Commerce User-Item Recommendation • Input Data • Age • Purchase timezone • Past bought items • Cluster ID • Target Data • Evaluation of an Item by User
  25. 25. • Implemented Algorithm • Factorization Machine • Latent Dirichlet Allocation 25 Table of contents
  26. 26. Latent Dirichlet Allocation 26 • Most Popular Algorithm of Topic Model • Mostly applied for text data • Find hidden structure of data • Unsupervised Learning • Need Input Data only • Generative Model
  27. 27. Latent Dirichlet Allocation 27 • Generative Modelling in LDA • Mimic how to generate Document • 1. Choose what you write about • 2. Choose word from the Topic • 3. Write
  28. 28. Latent Dirichlet Allocation 28 • Input • Text data (Documents) • Output • Topic-word distribution • Document-Topic distribution
  29. 29. Latent Dirichlet Allocation 29 https://www.vappingo.com/word-blog/wp-content/uploads/2011/01/paper2.jpg https://wellecks.wordpress.com/2014/10/26/ldaoverflow-with-online-lda/
  30. 30. Learning Method 30 • Define Generative model • For documents • Learn parameters to reproduce the document
  31. 31. Learning Method 31 K Topic
  32. 32. Learning Method 32 http://heartruptcy.blog.fc2.com/blog-entry-124.html
  33. 33. Graphical Model(Code) 33 • For Topic ={1,…, K} • WordDistribution[k] Dir(β) For Document={1,…, D} TopicDistribution[d] Dir(α) For Word={1,…, numOfWord[d]} WordTopic[d][n] TopicDistribution[d] Word[d][n] WordDistribution[WordTopic[d][n]]
  34. 34. Learning Method 34 • Variational Bayes • Gibbs Sampling (MCMC) • Particle Filtering
  35. 35. Learning Method 35 • Variational Bayes • Gibbs Sampling (MCMC) • Particle Filtering faster than Gibbs Sampling
  36. 36. Mini-batch Online LDA 36 • Faster than Batch Algorithm • Less noise than pure Online LDA Pure Online Mini-batch Online Batch Batch Size
  37. 37. 37 Implemented Model • Mini-Batch Map Model • For unknown data • Don t assume Vocabulary List • Mini-Batch Array Model (Other implementation) • For known data • Assume Vocabulary List
  38. 38. • Mini-Batch Map Model • For unknown data • Don t assume Vocabulary List 38 Implemented Model • Mini-Batch Array Model (Other implementation) • For known data • Assume Vocabulary List
  39. 39. • Meaning Less word • LDA: Clustering word by co-occurrence • a , the , I , He , is , in , on • Stop Word: Ignore them • TF-IDF: how important a word is to a document in a collection or dataset 39 Faced Implementation Problem
  40. 40. 40 Faced Implementation Problem • Meaning Less word • LDA: Clustering word by co-occurrence • a , the , I , He , is , in , on • Stop Word: Ignore them • TF-IDF: how important a word is to a document in a collection or dataset
  41. 41. • TF-IDF • can be calculated by Hivemall • Input Data: (DocId, Words) • https://github.com/myui/hivemall/wiki/ TFIDF-calculation 41 Faced Implementation Problem
  42. 42. • 1 ["justice:0.1641245850805637","found:0.06564983513276658","discussion: 0.06564983513276658","law:0.065 • 64983513276658","based:0.06564983513276658","religion: 0.06564983513276658","viewpoints:0.03282491756638329"," • rationality:0.03282491756638329","including:0.03282491756638329","context: 0.03282491756638329","concept:0.032 • 82491756638329","rightness:0.03282491756638329","general: 0.03282491756638329","many:0.03282491756638329","dif • fering:0.03282491756638329","fairness:0.03282491756638329","social: 0.03282491756638329","broadest:0.032824917 • 56638329 ,"equity:0.03282491756638329","includes: 0.03282491756638329","theology:0.03282491756638329"] 42 Faced Implementation Problem • TF-IDF
  43. 43. • Vocabulary List Model • Initialize all lambda for all words at first • if word does not appear in the Doc: • Lambda decreases at the same rate • No initialization problem 43 Faced Implementation Problem
  44. 44. • Online Map Model • Initialize lambda when new word fetched • final lambda: 
 depend on the first appeared time • Initialize problem 44 Faced Implementation Problem
  45. 45. • Prepared Dummy Lambda • Initialize dummy lambdas at first • Apply lambda update rule for dummy lambda 45 Faced Implementation Problem
  46. 46. • Implicit Φ Normalization • Not written implicitly 46 Faced Implementation Problem
  47. 47. • Implicit Φ Normalization • Not written implicitly 47 Faced Implementation Problem
  48. 48. • Implicit Φ Normalization • Not written explicitly 48 Faced Implementation Problem
  49. 49. 49 Faced Implementation Problem • Difficult Debugging • Circular reference Φ γ β :dependence
  50. 50. • Data: 20News • Topic:6 • Iteration:10 50 Result: Online LDA
  51. 51. • Topic:1 • No.0 writes[6]: 0.007909349 • No.1 article[7]: 0.006535292 • No.2 apr[3]: 0.0034389505 • No.3 team[4]: 0.00340712 • No.4 game[4]: 0.0033219245 • No.5 year[4]: 0.0032751847 • No.6 good[4]: 0.0032546786 • No.7 time[4]: 0.0030503264 • No.8 play[4]: 0.00262638 • No.9 games[5]: 0.002433915 • No.10 season[6]: 0.0022433712 • No.11 ll[2]: 0.0020719478 • No.12 players[7]: 0.0020332362 • No.13 win[3]: 0.0019284738 • No.14 hockey[6]: 0.0018870989 51 Result: Online LDA • No.15 league[6]: 0.0018450991 • No.16 baseball[8]: 0.0018226414 • No.17 years[5]: 0.0017960512 • No.18 mail[4]: 0.0017936684 • No.19 people[6]: 0.0017642054 • No.20 teams[5]: 0.0016675185 • No.21 great[5]: 0.001642102 • No.22 ve[2]: 0.0015846819 • No.23 point[5]: 0.0015730233 • No.24 cs[2]:0.0015609838 • No.25 didn[4]: 0.0015398773 • No.26 lot[3]: 0.0015123658 • No.27 mike[4]: 0.0014935194 • No.28 university[10]: 0.0014718652 • No.29 player[6]: 0.0014655796
  52. 52. • Topic:1 • No.0 writes[6]: 0.007909349 • No.1 article[7]: 0.006535292 • No.2 apr[3]: 0.0034389505 • No.3 team[4]: 0.00340712 • No.4 game[4]: 0.0033219245 • No.5 year[4]: 0.0032751847 • No.6 good[4]: 0.0032546786 • No.7 time[4]: 0.0030503264 • No.8 play[4]: 0.00262638 • No.9 games[5]: 0.002433915 • No.10 season[6]: 0.0022433712 • No.11 ll[2]: 0.0020719478 • No.12 players[7]: 0.0020332362 • No.13 win[3]: 0.0019284738 • No.14 hockey[6]: 0.0018870989 52 Result: Online LDA • No.15 league[6]: 0.0018450991 • No.16 baseball[8]: 0.0018226414 • No.17 years[5]: 0.0017960512 • No.18 mail[4]: 0.0017936684 • No.19 people[6]: 0.0017642054 • No.20 teams[5]: 0.0016675185 • No.21 great[5]: 0.001642102 • No.22 ve[2]: 0.0015846819 • No.23 point[5]: 0.0015730233 • No.24 cs[2]:0.0015609838 • No.25 didn[4]: 0.0015398773 • No.26 lot[3]: 0.0015123658 • No.27 mike[4]: 0.0014935194 • No.28 university[10]: 0.0014718652 • No.29 player[6]: 0.0014655796 Sports
  53. 53. • Topic:3 • No.0 writes[6]: 0.0065424195 • No.1 article[7]: 0.005621346 • No.2 apr[3]: 0.002746017 • No.3 work[4]: 0.002731466 • No.4 good[4]: 0.00266331 • No.5 ve[2]: 0.0025969497 • No.6 time[4]: 0.0025880735 • No.7 system[6]: 0.0024449623 • No.8 problem[7]: 0.002349667 • No.9 mail[4]: 0.0023234019 • No.10 windows[7]: 0.0021310966 • No.11 people[6]: 0.0018598152 • No.12 find[4]: 0.0018072439 • No.13 computer[8]: 0.0017470584 • No.14 email[5]: 0.0017204053 53 Result: Online LDA • No.15 drive[5]: 0.0017121765 • No.16 bit[3]: 0.0016401116 • No.17 program[7]: 0.001636191 • No.18 software[8]: 0.0016341405 • No.19 university[10]: 0.0015907411 • No.20 ll[2]: 0.0015530549 • No.21 thing[5]: 0.0015159848 • No.22 card[4]: 0.0013826761 • No.23 doesn[5]: 0.0013809163 • No.24 phone[5]: 0.0013786326 • No.25 question[8]: 0.0013721529 • No.26 internet[8]:0.001368883 • No.27 file[4]: 0.0013417117 • No.28 things[6]: 0.0013097903 • No.29 set[3]: 0.0013029057
  54. 54. • Topic:3 • No.0 writes[6]: 0.0065424195 • No.1 article[7]: 0.005621346 • No.2 apr[3]: 0.002746017 • No.3 work[4]: 0.002731466 • No.4 good[4]: 0.00266331 • No.5 ve[2]: 0.0025969497 • No.6 time[4]: 0.0025880735 • No.7 system[6]: 0.0024449623 • No.8 problem[7]: 0.002349667 • No.9 mail[4]: 0.0023234019 • No.10 windows[7]: 0.0021310966 • No.11 people[6]: 0.0018598152 • No.12 find[4]: 0.0018072439 • No.13 computer[8]: 0.0017470584 • No.14 email[5]: 0.0017204053 54 Result: Online LDA • No.15 drive[5]: 0.0017121765 • No.16 bit[3]: 0.0016401116 • No.17 program[7]: 0.001636191 • No.18 software[8]: 0.0016341405 • No.19 university[10]: 0.0015907411 • No.20 ll[2]: 0.0015530549 • No.21 thing[5]: 0.0015159848 • No.22 card[4]: 0.0013826761 • No.23 doesn[5]: 0.0013809163 • No.24 phone[5]: 0.0013786326 • No.25 question[8]: 0.0013721529 • No.26 internet[8]:0.001368883 • No.27 file[4]: 0.0013417117 • No.28 things[6]: 0.0013097903 • No.29 set[3]: 0.0013029057 Computer
  55. 55. Impression about Internship 55 • Machine Learning • Implementing ML algorithm from Scratch was fun • Contributing for OSS is precious experience for me
  56. 56. Unfinished Business 56 • Documentation • write entry for FM/Online LDA • UDTF • build the function into Hivemall
  57. 57. 57 • Thank you for Listening

×