Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AI at Stitch Fix 2017

876 views

Published on

Chris Moody interpretability & variational methods

Published in: Data & Analytics
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

AI at Stitch Fix 2017

  1. 1. AI at Stitch Fix Interpretability & Variational Methods Christopher Moody @ Stitch Fix
  2. 2. About @chrisemoody Caltech Physics PhD. in astrostats supercomputing sklearn t-SNE contributor Stitch Fix Algorithms Team Insight Fellow github.com/cemoody Gaussian Processes t-SNE chainer deep learning Tensor Decomposition
  3. 3. Interpreting my model1
  4. 4. t-SNE k-SVD 1 2 3lda2vec
  5. 5. Model: Doc vector “ITEM_92 I think this fabric is wonderful (rayon & spandex). like the lace/embroidery accents” Co-occurrence modeling
  6. 6. Model: Doc vector Co-occurrence modeling “ITEM_92 I think this fabric is wonderful (rayon & spandex). like the lace/embroidery accents” !Embed both words & items in the same space!
  7. 7. Model: Doc vector “ITEM_92 think fabric wonderful rayon spandex like lace embroidery accents” Co-occurrence modeling w c X[c, w] += 1
  8. 8. Model: Doc vector “ITEM_92 think fabric wonderful rayon spandex like lace embroidery accents” Co-occurrence modeling w c X[c, w] += 1
  9. 9. Model: Doc vector “ITEM_92 think fabric wonderful rayon spandex like lace embroidery accents” Co-occurrence modeling w c X[c, w] += 1
  10. 10. Model: Doc vector “ITEM_92 think fabric wonderful rayon spandex like lace embroidery accents” Co-occurrence modeling w c X[c, w] += 1
  11. 11. Model: Doc vector “ITEM_92 think fabric wonderful rayon spandex like lace embroidery accents” Co-occurrence modeling w c X[c, w] += 1
  12. 12. Model: Doc vector “ITEM_92 think fabric wonderful rayon spandex like lace embroidery accents” Co-occurrence modeling c w X[c, w] += 1
  13. 13. Model: Doc vector “ITEM_92 think fabric wonderful rayon spandex like lace embroidery accents” Co-occurrence modeling c w X[c, w] += 1
  14. 14. Model: Doc vector “ITEM_92 think fabric wonderful rayon spandex like lace embroidery accents” w Co-occurrence modeling c X[c, w] += 1
  15. 15. Model: Doc vector “ITEM_92 think fabric wonderful rayon spandex like lace embroidery accents” Co-occurrence modeling c w X[c, w] += 1
  16. 16. Model: Doc vector “ITEM_92 think fabric wonderful rayon spandex like lace embroidery accents” w Co-occurrence modeling w w w w w c w ww X[c, w] = count
  17. 17. Model: Doc vector “ITEM_92 think fabric wonderful rayon spandex like lace embroidery accents” w Co-occurrence modeling w w w w w c w ww X[c, w] = count
  18. 18. Model: Doc vector “ITEM_92 think fabric wonderful rayon spandex like lace embroidery accents” w Co-occurrence modeling w w w w w cw ww X[c, w] = count
  19. 19. Model: Doc vector “ITEM_92 think fabric wonderful rayon spandex like lace embroidery accents” w Co-occurrence modeling w w w w w cw ww X[c, w] = count
  20. 20. Model: Doc vector “ITEM_92 think fabric wonderful rayon spandex like lace embroidery accents” w Co-occurrence modeling w w w w w cw ww X[c, w] = count
  21. 21. Model: Doc vector Co-occurrence modeling log(X[c, w]) = r[c] + r[w] + c w known unknown See also: ‘Glove vectors’ http://nlp.stanford.edu/ projects/glove/
  22. 22. Model: Doc vector Co-occurrence modeling log(X[c, w]) = r[c] + r[w] + c w known unknown See also: ‘Glove vectors’ http://nlp.stanford.edu/ projects/glove/
  23. 23. Model: Doc vector Co-occurrence modeling log(X[like, embroidery]) = r[like] + r[embroidery] + like embroidery How frequent is like? See also: ‘Glove vectors’ http://nlp.stanford.edu/ projects/glove/
  24. 24. Model: Doc vector Co-occurrence modeling log(X[spandex, rayon]) = r[spandex] + r[rayon] + spandex rayon See also: ‘Glove vectors’ http://nlp.stanford.edu/ projects/glove/ How similar are spandex &rayon ?
  25. 25. Model: Interpret: Doc vector t-SNE See also: ‘How to use t-SNE effectively’ http://distill.pub/2016/ misread-tsne/
  26. 26. Model: Interpret: Doc vector t-SNE See also: ‘How to use t-SNE effectively’ http://distill.pub/2016/ misread-tsne/
  27. 27. Model: Interpret: Doc vector t-SNE See also: ‘How to use t-SNE effectively’ http://distill.pub/2016/ misread-tsne/
  28. 28. Model: Interpret: Doc vector t-SNE See also: ‘How to use t-SNE effectively’ http://distill.pub/2016/ misread-tsne/
  29. 29. Model: Interpret: Doc vector t-SNE See also: ‘How to use t-SNE effectively’ http://distill.pub/2016/ misread-tsne/
  30. 30. Model: Interpret: Doc vector t-SNE See also: ‘How to use t-SNE effectively’ http://distill.pub/2016/ misread-tsne/ This distance means nothing!
  31. 31. Model: Interpret: Doc vector t-SNE See also: ‘How to use t-SNE effectively’ http://distill.pub/2016/ misread-tsne/ This distance means nothing!
  32. 32. Model: Interpret: Doc vector t-SNE See also: ‘How to use t-SNE effectively’ http://distill.pub/2016/ misread-tsne/ More exotic skinnies?
  33. 33. Model: Interpret: Doc vector t-SNE See also: ‘How to use t-SNE effectively’ http://distill.pub/2016/ misread-tsne/ More colorful jeans? Lighter jeans?
  34. 34. Model: Interpret: Doc vector Linearities See also: ‘A word is worth a thousand vectors’ http:// multithreaded.stitchfix.com
  35. 35. Model: Interpret: Doc vector Linearities See also: ‘A word is worth a thousand vectors’ http:// multithreaded.stitchfix.com
  36. 36. Model: Interpret: Doc vector Linearities See also: ‘A word is worth a thousand vectors’ http:// multithreaded.stitchfix.com
  37. 37. Model: Interpret: Doc vector Linearities See also: ‘A word is worth a thousand vectors’ http:// multithreaded.stitchfix.com
  38. 38. Model: Interpret: Doc vector Linearities But… what are my model’s ‘directions’? See also: ‘A word is worth a thousand vectors’ http:// multithreaded.stitchfix.com
  39. 39. Model: Interpret: Doc vector k-SVD See also: ‘Decoding the thought vector’ http://gabgoh.github.io/ ThoughtVectors/ -1 - 1 - 5= …and after k-SVD: - - -1 - 1 - 5 … = +0.53 (Atom 23) +0.16 (Atom 95)
  40. 40. Model: Interpret: Doc vector k-SVD See also: ‘Decoding the thought vector’ http://gabgoh.github.io/ ThoughtVectors/ -1 - 1 - 5= …and after k-SVD: - - -1 - 1 - 5 … = +0.53 (Atom 23) “Tank top” +0.16 (Atom 95) “Exposed shoulder”
  41. 41. Model: Interpret: Doc vector k-SVD See also: ‘Decoding the thought vector’ http://gabgoh.github.io/ ThoughtVectors/ = +0.53 +0.16 “Tank top” “Exposed shoulder”
  42. 42. Model: Interpret: Doc vector k-SVD See also: ‘Decoding the thought vector’ http://gabgoh.github.io/ ThoughtVectors/ “Dress” (Atom 1)
  43. 43. Model: Interpret: Doc vector k-SVD See also: ‘Decoding the thought vector’ http://gabgoh.github.io/ ThoughtVectors/ “Urban/bohemian” (Atom 40)
  44. 44. Model: Interpret: Doc vector k-SVD See also: ‘Decoding the thought vector’ http://gabgoh.github.io/ ThoughtVectors/ “Statement pieces” (Atom 22)
  45. 45. Model: Interpret: Doc vector k-SVD See also: ‘Decoding the thought vector’ http://gabgoh.github.io/ ThoughtVectors/ “Ring & Drop Earrings” (Atom 75)
  46. 46. Model: Interpret: lda2vec lda2vec See also: ‘lda2vec’ http:// multithreaded.stitchfix.com /blog/2016/05/27/lda2vec/ word2vec + LDA =
  47. 47. Variational Methods2 …or what my model doesn’t know.
  48. 48. Variational Methods Practical reasons to go variational: 1. Alternative regularization 2. Measure what your model doesn’t know. 3. Help explain your data.
  49. 49. Variational Methods Practical reasons to go variational: 1. Alternative regularization 2. Measure what your model doesn’t know. 3. Help explain your data. 4. Short & fits in a tweet!
  50. 50. Variational Methods Practical reasons to go variational: 1. Alternative regularization 2. Measure what your model doesn’t know. 3. Help explain your data. 4. Short & fits in a tweet!
  51. 51. Variational Word Vectors log(X[c, w]) = r[c] + r[w] + c w How similar are c &w ? How frequent is c? How frequent is w?
  52. 52. Variational Word Vectors log(X[c, w]) = r[c] + r[w] + c w Let’s make this variational: 1. Replace point estimates with samples from a distribution. 2. Replace regularizing that point, regularize that distribution.
  53. 53. Replace point estimates with samples from a distribution. embeddings c_vector Without variational embeddings = nn.Embedding(n_words, n_dim) ... c_vector = embeddings(c_index) #1
  54. 54. Replace point estimates with samples from a distribution. With variational mean variance #1 embeddings_mu = nn.Embedding(n_words, n_dim) embeddings_lv = nn.Embedding(n_words, n_dim) ... vector_mu = embeddings_mu(c_index) vector_lv = embeddings_lv(c_index) c_vector = normal_sample(vector_mu, vector_lv) vector_mu embeddings_mu vector_lv embeddings_lv
  55. 55. Replace point estimates with samples from a distribution. With variational +0.32 +0.49 -0.21 +0.03 … sample #1 embeddings_mu = nn.Embedding(n_words, n_dim) embeddings_lv = nn.Embedding(n_words, n_dim) ... vector_mu = embeddings_mu(c_index) vector_lv = embeddings_lv(c_index) c_vector = normal_sample(vector_mu, vector_lv)
  56. 56. Replace point estimates with samples from a distribution. With variational +0.32 +0.49 -0.21 +0.03 … sample #1 embeddings_mu = nn.Embedding(n_words, n_dim) embeddings_lv = nn.Embedding(n_words, n_dim) ... vector_mu = embeddings_mu(c_index) vector_lv = embeddings_lv(c_index) c_vector = normal_sample(vector_mu, vector_lv)
  57. 57. Replace point estimates with samples from a distribution. With variational +0.32 +0.49 -0.21 +0.03 … sample #1 embeddings_mu = nn.Embedding(n_words, n_dim) embeddings_lv = nn.Embedding(n_words, n_dim) ... vector_mu = embeddings_mu(c_index) vector_lv = embeddings_lv(c_index) c_vector = normal_sample(vector_mu, vector_lv)
  58. 58. Replace point estimates with samples from a distribution. With variational +0.32 +0.49 -0.21 +0.03 … sample #1 embeddings_mu = nn.Embedding(n_words, n_dim) embeddings_lv = nn.Embedding(n_words, n_dim) ... vector_mu = embeddings_mu(c_index) vector_lv = embeddings_lv(c_index) c_vector = normal_sample(vector_mu, vector_lv)
  59. 59. Replace point estimates with samples from a distribution. With variational +0.32 +0.49 -0.21 +0.03 … sample #1 embeddings_mu = nn.Embedding(n_words, n_dim) embeddings_lv = nn.Embedding(n_words, n_dim) ... vector_mu = embeddings_mu(c_index) vector_lv = embeddings_lv(c_index) c_vector = normal_sample(vector_mu, vector_lv)
  60. 60. Replace point estimates with samples from a distribution. vector_mu embeddings_mu With variational vector_lv embeddings_lv embeddings_mu = nn.Embedding(n_words, n_dim) embeddings_lv = nn.Embedding(n_words, n_dim) ... vector_mu = embeddings_mu(c_index) vector_lv = embeddings_lv(c_index) c_vector = normal_sample(vector_mu, vector_lv) def normal_sample(mu, lv): variance = sqrt(exp(lv)) sample = mu + N(0, 1) * variance return sample #1
  61. 61. Replace regularizing a point with regularizing the distribution Without variational loss += c_vector.pow(2.0).sum() #2
  62. 62. Replace regularizing a point with regularizing the distribution With variational loss += kl_divergence(vector_mu, vector_lv) #2 Prior N(μ, σ)
  63. 63. Replace regularizing a point with regularizing the distribution With variational loss += kl_divergence(vector_mu, vector_lv) #2
  64. 64. Replace regularizing a point with regularizing the distribution With variational loss += kl_divergence(vector_mu, vector_lv) #2
  65. 65. Replace regularizing a point with regularizing the distribution With variational loss += kl_divergence(vector_mu, vector_lv) #2
  66. 66. Replace point estimates with samples from a distribution. With variational embeddings_mu = nn.Embedding(n_words, n_dim) embeddings_lv = nn.Embedding(n_words, n_dim) ... vector_mu = embeddings_mu(c_index) vector_lv = embeddings_lv(c_index) def normal(mu, lv): random = torch.normal(std.size()) return mu + random * torch.exp(0.5 * lv) c_vector = normal(vector_mu, vector_lv) vector_mu embeddings_mu vector_lv embeddings_lv . . . .
  67. 67. Replace point estimates with samples from a distribution. With variational embeddings_mu = nn.Embedding(n_words, n_dim) embeddings_lv = nn.Embedding(n_words, n_dim) ... vector_mu = embeddings_mu(c_index) vector_lv = embeddings_lv(c_index) def normal(mu, lv): random = torch.normal(std.size()) return mu + random * torch.exp(0.5 * lv) c_vector = normal(vector_mu, vector_lv) vector_mu embeddings_mu vector_lv embeddings_lv . . . .
  68. 68. Replace point estimates with samples from a distribution. With variational embeddings_mu = nn.Embedding(n_words, n_dim) embeddings_lv = nn.Embedding(n_words, n_dim) ... vector_mu = embeddings_mu(c_index) vector_lv = embeddings_lv(c_index) def normal(mu, lv): random = torch.normal(std.size()) return mu + random * torch.exp(0.5 * lv) c_vector = normal(vector_mu, vector_lv) vector_mu embeddings_mu vector_lv embeddings_lv . . . .
  69. 69. See also: ‘word2gauss’ Bach
  70. 70. See also: ‘word2gauss’ Bach Composer
  71. 71. See also: ‘word2gauss’ Bach Composer Classical
  72. 72. See also: ‘word2gauss’ Math Bach Composer Classical
  73. 73. Variational Methods2 …where we’ll make variational versions of: 1. word2vec 2. Factorization Machines 3. t-SNE
  74. 74. Linear Regression
  75. 75. Linear Regression (with 2nd order interactions) Sums over all pairs of features (known and observed) 1 coefficient for each feature (unknown, to be estimated)
  76. 76. Regression with Factorized Interactions https://github.com/cemoody/vfm Factorization Machines
  77. 77. Regression with Factorized Interactions = https://github.com/cemoody/vfm Factorization Machines
  78. 78. Regression with Variational Factorized Interactions Variational Factorization Machines https://github.com/cemoody/vfm
  79. 79. Regression with Variational Factorized Interactions https://github.com/cemoody/vfm Variational Factorization Machines
  80. 80. Regression with Variational Factorized Interactions Can write out uncertainty of prediction!https://github.com/cemoody/vfm Variational Factorization Machines
  81. 81. Variational Methods2 …where we’ll make variational versions of: 1. word2vec 2. Factorization machines 3. t-SNE
  82. 82. t-SNE Input: N D-dimensional vectors
  83. 83. Input: N D-dimensional vectors t-SNE -2.0 -1.5 -0.11.9 -0.2 1.5 -0.5 5.0
  84. 84. 100D+ Input Pairwise Probabilities t-SNE
  85. 85. 100D+ Input Pairwise Probabilities 2D Output t-SNE
  86. 86. 100D+ Input Pairwise Probabilities Pairwise Probabilities 2D Output t-SNE
  87. 87. Form matrix of pairwise probabilities pij = p(choosing i given j from all points) t-SNE
  88. 88. t-SNE Form matrix of pairwise distances p = p(choosing given from )
  89. 89. t-SNE Form matrix of pairwise distances pij =
  90. 90. Form matrix of pairwise distances p = t-SNE
  91. 91. Form matrix of pairwise distances p = t-SNE -2.0 -11.9 -0.2 1.5 -0.5 5.0 1.02.7 -2.0 1.0 1.3
  92. 92. 100D+ Input Pairwise Probabilities Pairwise Probabilities 2D Output t-SNE ✔ ✔
  93. 93. 100D+ Input Pairwise Probabilities Pairwise Probabilities 2D Output t-SNE ✔ ✔
  94. 94. q = SNE How to form pairwise matrix in 2D space?
  95. 95. q = SNE 2.7 -2.0 How to form pairwise matrix in 2D space? 1.9 -0.2
  96. 96. SNE
  97. 97. q = SNE Using a Gaussian to convert distances into probabilities …
  98. 98. q = SNE Using a Gaussian to convert distances into probabilities … Bad for outliers! (As we match high D with low D, get lots of outliers)
  99. 99. q = Using a Gaussian to convert distances into probabilities … Bad for outliers! (As we match high D with low D, get lots of outliers) Use Student’s t-distribution (Heavy-tailed distribution)t-SNE
  100. 100. t-SNE SNE t-SNE
  101. 101. 100D+ Input Pairwise Probabilities Pairwise Probabilities 2D Output t-SNE ✔ ✔ ✔ ✔
  102. 102. 100D+ Input Pairwise Probabilities Pairwise Probabilities 2D Output t-SNE ✔ ✔ ✔ ✔q p
  103. 103. Pairwise Probabilities Pairwise Probabilities t-SNE ✔ ✔ KL Divergence KL( || ) q p q
  104. 104. t-SNE github.com/cemoody/topicsne
  105. 105. 100D+ Input Pairwise Probabilities Pairwise Probabilities 2D Output Variational t-SNE
  106. 106. github.com/cemoody/topicsne
  107. 107. github.com/cemoody/topicsne
  108. 108. github.com/cemoody/topicsne
  109. 109. github.com/cemoody/topicsne
  110. 110. github.com/cemoody/topicsne
  111. 111. t-SNE on MNIST Variational t-SNE on MNIST
  112. 112. t-SNE on MNIST Variational t-SNE on MNIST real? not confident.very confident! not confident. real?
  113. 113. Upcoming 3
  114. 114. 100D+ Input Pairwise Probabilities Pairwise Probabilities 2D Output “Topic” t-SNE Stay tuned! Instead of each output point having an (x,y) Each point loads on to a psuedo-discrete topic / cluster using the Gumbel-Softmax trick!
  115. 115. Adversarial Text to Image “blue razorback tank top” (still training) Wasserstein GANVar W2V Conv Decoder
  116. 116. ?@chrisemoody
  117. 117. Dynamic Graphs 3
  118. 118. 1 trick # declarative model = Sequential() model.add(Embedding(max_features, 128)) # try using a GRU instead, for fun model.add(LSTM(128, 128)) model.add(Dropout(0.5)) model.add(Dense(128, 1)) model.add(Activation('sigmoid')) # try using different optimizers # and different optimizer configs model.compile(loss='binary_crossentropy', optimizer='adam', class_mode="binary") print("Train...") model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=4, validation_data=(X_test, y_test), show_accuracy=True) score, acc = model.evaluate(X_test, y_test, batch_size=batch_size, show_accuracy=True) # Neural net architecture x = chainer.Variable(x_data, volatile=not train) t = chainer.Variable(y_data, volatile=not train) h0 = model.embed(x) h1_in = model.l1_x(F.dropout(h0, train=train)) + model.l1_h(state['h1']) c1, h1 = F.lstm(state['c1'], h1_in) h2_in = model.l2_x(F.dropout(h1, train=train)) + model.l2_h(state['h2']) c2, h2 = F.lstm(state['c2'], h2_in) y = model.l3(F.dropout(h2, train=train)) state = {'c1': c1, 'h1': h1, 'c2': c2, 'h2': h2} loss = F.softmax_cross_entropy(y, t) imperative compile data function
  119. 119. x = Variable(np.ones(10)) y = Variable(np.ones(10)) loss = x + y The low level
  120. 120. x = Variable(np.ones(10)) y = Variable(np.ones(10)) loss = x + y The low level
  121. 121. symbolic variable In [47]: loss.data Out[47]: array([ 2., 2., 2., 2., 2., 2.] x = t.vector(‘x’) y = t.vector(‘y’) loss = x + y x = Variable(np.ones(10)) y = Variable(np.ones(10)) loss = x + y symbolic + numeric variable In [47]: loss Out[47]: theano.tensor.var.TensorVariable
  122. 122. This gets deep.
  123. 123. This gets very deep.
  124. 124. …and then something goes wrong.
  125. 125. …and then something goes wrong. …chainer computes everything at run time… so debug & investigate!
  126. 126. …and then something goes wrong. In [47]: z.data Out[47]: array([ 2., 2., 2., nan, 2., 2.] …chainer computes everything at run time… so debug & investigate!
  127. 127. ?@chrisemoody Multithreaded Stitch Fix 1. Use SVD instead of w2v 2. Use t-SNE for interpreting your model 3. Use k-SVD for interpreting your model 4. Add sparsity to your models (e.g. as in lda2vec)

×