This document discusses different methods for creating word embeddings from text data, including count vectors, positive pointwise mutual information (PPMI), principal component analysis (PCA), and neural network-based approaches like word2vec. It notes that count vectors are high-dimensional and sparse, while PPMI suffers from noise. PCA is presented as a solution that provides dimensionality reduction, densification, and noise reduction by finding a low-rank approximation of the PPMI matrix. The goal is to find embeddings that represent semantic similarity through distances in the embedding space.
Handwritten Text Recognition for manuscripts and early printed texts
Matrix factorization: Demistifying word embeddings
1. Sentiance Confidential 2017
Sentiance Confidential 2017
Matrix Factorization
Demistifying word embeddings
Monthly data science team meeting // journal club
Vincent Spruyt
November 2017
2. Sentiance Confidential 2017
Goal of this talk
1. Intuitively link and think about how these concepts are related
… and in a lot of situations even exactly the same!
2. Realize how many of these simple concepts can easily be applied at Sentiance
(we will discuss some potential applications)
PCA
SVD
LSA
word2vec
Glove
look-alike modeling
3. Sentiance Confidential 2017
Introduction: embeddings
1. What we have:
• Lots of unlabeled data
2. What we want:
• Discover hidden structure in the data
• Underlying ‘causes’ of the data
… and do this without any labels
4. Sentiance Confidential 2017
Introduction: embeddings
1. E.g.
• Observed timelines:
• User 1: [home, car, work]
• User 2: [home, car, shop, car, work]
• User 3: [home, train, work]
Question: Is user 1 more similar to user 2 or to user 3?
5. Sentiance Confidential 2017
Introduction: embeddings
1. E.g.
• Observed timelines:
- User 1: [home, car, work]
- User 2: [home, car, shop, car, work]
- User 3: [home, train, work]
• Latent factors:
- Factor 1: [commute]
- Factor 2: [shopping]
- Factor 3: [green transport]
- Factor 4: [non-green transport]
All observations can be explained
by a combination of these factors
18. Sentiance Confidential 2017
Introduction: embeddings
1. Final goal:
• Given the observation matrix:
- Find the optimal factors
AND
- Find the optimal weight matrix
• Two approaches:
- Matrix factorization (traditional approach)
- Machine learning (state-of-the-art)
-> Which works better?
-> Are they related?
19. Sentiance Confidential 2017
Introduction: embeddings
1. Final goal:
• Given the observation matrix:
- Find the optimal factors
AND
- Find the optimal weight matrix
• Two approaches:
- Matrix factorization (traditional approach)
- Machine learning (state-of-the-art)
-> Which works better?
-> Are they related?
That is the topic of this talk!
20. Sentiance Confidential 2017
Outline
1. Word embeddings: count vectors
2. Word embeddings: PPMI
3. Word embeddings: PCA
4. Word embeddings: SVD
5. Word embeddings: Word2Vec
6. Word embeddings: Glove
7. User Embeddings: LSA
8. User Embeddings: Collaborative filtering
Python code available!
21. Sentiance Confidential 2017
Word embeddings: Toy example
• Input: user timelines of stationary visits
• Flatten the type topology (don’t care about hierarchy for now)
• Small set of 3384 random users
• Total number of events: 789,043
• Vocabulary (event type) size: 125
E.g. [u'work', u'home', u'work', u'travel_busstop', u'work', u'home', u'travel_busstop', u'home',
u'shop_supermarket', u'work', u'travel_parking', u'travel_expensive', u'shop', u'shop',
u'travel_station', …]
22. Sentiance Confidential 2017
Word embeddings: count vectors
1. Define context window (e.g. bi-grams)
2. Count how many times each word co-occurs in a context window
• E.g. window_size = 5
[work, home, work, travel_busstop, work, home, travel_busstop, home, shop_supermarket, work, travel_parking, …]
23. Sentiance Confidential 2017
Word embeddings: count vectors
1. Define context window (e.g. bi-grams)
2. Count how many times each word co-occurs in a context window
• E.g. window_size = 5
[work, home, work, travel_busstop, work, home, travel_busstop, home, shop_supermarket, work, travel_parking, …]
context window
24. Sentiance Confidential 2017
Word embeddings: count vectors
1. Define context window (e.g. bi-grams)
2. Count how many times each word co-occurs in a context window
• E.g. window_size = 5
[work, home, work, travel_busstop, work, home, travel_busstop, home, shop_supermarket, work, travel_parking, …]
context window
busstop work home supermarket parking
home 2 1 1 0 0
25. Sentiance Confidential 2017
Word embeddings: count vectors
1. Define context window (e.g. bi-grams)
2. Count how many times each word co-occurs in a context window
• E.g. window_size = 5
[work, home, work, travel_busstop, work, home, travel_busstop, home, shop_supermarket, work, travel_parking, …]
context window
busstop work home supermarket parking
home 2 1 1 0 0
busstop 0 1 2 1 0
26. Sentiance Confidential 2017
Word embeddings: count vectors
1. Little trick:
• Weigh context words by distance to target word
[work, home, work, travel_busstop, work, home, travel_busstop, home, shop_supermarket, work, travel_parking, …]
context window
busstop work home supermarket parking
home 1.5 1 0.5 0 0
busstop 0 0.5 2 0.5 0
1/11/2 1/21/1Count weights:
Later more on this trick and others
27. Sentiance Confidential 2017
Word embeddings: count vectors
1. Word vectors:
• home = [1.5, 1.0, 0.5, 0.0, 0.0]
• busstop = [0.0, 0.5, 2.0, 0.5, 0.0]
• Distance between word vectors is
- almost independent from their co-occurrence
- almost completely determined by the similarity of the context they appear in
28. Sentiance Confidential 2017
Word embeddings: count vectors
1. Define context window (e.g. bi-grams)
2. Count how many times each word co-occurs in a context window
3. Vector length |v| ~ corpus size
shop_supermarket travel_aerodrome drinks_pub travel_campsite sport_golf shop_marketplace leisure_theatre travel_station
shop_supermarket 12725 399 874 205 58 70 72 1425
travel_aerodrome 398 9274 55 11 1 9 8 416
drinks_pub 873 55 959 37 7 5 18 289
travel_campsite 202 11 37 762 1 1 2 16
sport_golf 57 1 7 1 1212 1 0 8
(Sparse, high dimensional) word embeddings!
29. Sentiance Confidential 2017
Word embeddings: count vectors
1. Let’s visualize in 2D using a manifold projection
• Isomap (just because it looked nicer than t-sne in this case)
• Finds 2D space that maintains local distances
2. Normalize each vector to unit length first
• We are not interested in vector length (depends on vocabulary size)
• Divide by L2-norm
=> Euclidean distance ~ cosine distance
32. Sentiance Confidential 2017
Word embeddings: PPMI
1. Problem: Highly dominated by words that occur a lot
2. Solution: Pointwise Mutual Information
3. ‘How much more do we observe (w,c) compared to random chance?’
• If w and c are completely uncorrelated: P(w,c) = P(w)P(c)
33. Sentiance Confidential 2017
Word embeddings: PPMI
1. Problem: Highly dominated by words that occur a lot
2. Solution: Pointwise Mutual Information
3. Problem with PMI: Negative values often noisy
4. Solution: Set to zero
34. Sentiance Confidential 2017
Word embeddings: PPMI
1. If fij are the co-occurrence counts, we have W (words) rows, and C
(contexts) columns:
35. Sentiance Confidential 2017
Word embeddings: PPMI
1. Define context window (e.g. skip-gram)
2. Count how many times each word co-occurs in the context window
3. Vector length |v| ~ corpus size
shop_supermarket travel_aerodrome drinks_pub travel_campsite sport_golf shop_marketplace leisure_theatre travel_station
shop_supermarket 0.24 0.0 0.0 0.29 0.0 0.16 0.0 0.0
travel_aerodrome 0.0 5.51 0.0 0.0 0.0 0.07 0.0 0.15
drinks_pub 0.0 0.0 2.64 0.89 0.0 0.0 0.41 0.0
travel_campsite 0.27 0.0 0.89 8.19 0.0 0.63 0.14 0.0
sport_golf 0.0 0.0 0.0 0.0 8.86 0.63 0.0 0.0
(Sparse, high dimensional) word embeddings!
37. Sentiance Confidential 2017
Word embeddings: PPMI
1. Problem 1: Huge dimensionality (e.g. 50.000D embedding vector)
• Distances in large dimensions become meaningless
• Everything is almost equally far away from everything else
38. Sentiance Confidential 2017
Word embeddings: PPMI
1. Problem 1: Huge dimensionality (e.g. 50.000D embedding vector)
• Distances in large dimensions become meaningless
• Everything is almost equally far away from everything else
39. Sentiance Confidential 2017
Word embeddings: PPMI
1. Problem 1: Huge dimensionality (e.g. 50.000D embedding vector)
• Distances in large dimensions become meaningless
• Everything is almost equally far away from everything else
2. Problem 2: Very sparse matrix
• Inefficient. Can we compress (turn into dense matrix)?
• Aggravates the high dimensional distance issue
3. Problem: Counts are very noisy
• Can we denoise the matrix?
• Can we transfer knowledge from other co-occurrence counts?
40. Sentiance Confidential 2017
Word embeddings: PPMI
1. Solution:
• Find low-rank approximation of PPMI matrix
- Uncorrelated columns
-> Orthogonal base
- Dense
- De-noised
- Low-dimensional (when decomposed using matrix factorization)
• => Compression + noise reduction
41. Sentiance Confidential 2017
Word embeddings: PCA
1. Idea: PCA!
2. New columns: linear combinations of old columns
3. De-correlated dimensions => Low rank!
• X’ = VX gives reduced dimensionality (X is input data)
• X’’ = VTX’ gives reconstructed low-rank approximation
4. Keeps dimensions of largest variance => Noise reduction!
• Assuming that the noise has smaller variance than the information
43. Sentiance Confidential 2017
Word embeddings: PCA
1. Problems:
1. PCA breaks down if data is not normally distributed
1. Word co-occurences are all positive, so not Gaussian
=> No guarantee that eigenvectors capture largest variance
44. Sentiance Confidential 2017
Word embeddings: PCA
1. Problems:
2. PCA requires centering the data (subtract mean)
1. Sparseness lost => Computationally more complex
2. Angle between vectors not preserved => Cosine distance?
Θ=90°
Cosine distance:
And u.v = ||a||2 ||b||2 cos(Θ)
And cos(90°) = 0.0
Cosine distance = 1.0
45. Sentiance Confidential 2017
Word embeddings: PCA
1. Problems:
2. PCA requires centering the data (subtract mean)
1. Sparseness lost => Computationally more complex
2. Angle between vectors not preserved => Cosine distance?
Θ=180°
Cosine distance:
And u.v = ||a||2 ||b||2 cos(Θ)
And cos(180°) = -1.0
Cosine distance = 2.0
46. Sentiance Confidential 2017
Word embeddings: SVD
1. Forget about capturing largest variance
2. Given co-occurrence matrix A
• Find low-rank approximation Ak
• That minimizes ||A – Ak||F
Note: Frobenius norm ||A||F of matrix A is simply the Euclidean norm of the vector containing its elements:
47. Sentiance Confidential 2017
Word embeddings: SVD
1. Forget about capturing largest variance
2. Given co-occurrence matrix A
• Find low-rank approximation Ak
• That minimizes ||A – Ak||F
3. Optimal solution:
• Singular Value Decomposition (SVD)
48. Sentiance Confidential 2017
Word embeddings: SVD
1. SVD
r = rank of A
r
r r
U S VTA
=
m x n matrix m x m matrix m x n matrix n x n matrix
49. Sentiance Confidential 2017
Word embeddings: SVD
1. SVD
r = rank of A
r
r r
U S VTA
=
m x n matrix m x m matrix m x n matrix n x n matrix
For skip-gram PMI,
m=n and A is symmetric.
But let’s look at the general case here
50. Sentiance Confidential 2017
Word embeddings: SVD
1. SVD
r = rank of A
r
r r
U S VTA
=
m x n matrix m x m matrix m x n matrix n x n matrix
Eigenvector of ATA
Eigenvector of AAT Singular values of AAT
If data is centered:
Remember:
• cov(A) = ATA
=> cov(AT) = AAT
51. Sentiance Confidential 2017
Word embeddings: SVD
1. SVD
r r
U S VTA
=
m x n matrix m x m matrix m x n matrix n x n matrix
Eigenvector of ATA
Eigenvector of AAT Singular values of AAT
If data is centered:
US = PCA(A)
52. Sentiance Confidential 2017
Word embeddings: SVD
S is diagonal
=> A = (US)VT
and
=> A = U(SVT)
1. First case:
Express rows of A as linear combination of rows of VT, with (US) as coefficients
2. Second case:
Express columns of A as linear combinations of columns of U, with (SVT) as coefficients
53. Sentiance Confidential 2017
Word embeddings: SVD
S is diagonal, let D=S1/2 => A = UDDVT
=> A = (UD)DTVT
and
=> A = UD(DTVT)
1. First case:
Word embeddings: UD
2. Second case:
Context embeddings: VD
- Equal in case of skip-gram word-contexts
- Because in that case A is symmetric => U==V
- Not equal in other cases
Remember:
• (AB)T = BTAT
54. Sentiance Confidential 2017
Word embeddings: SVD
1. We now have low-rank word embeddings
-> Word ~ linear combination of decorrelated (orthogonal) factors
2. Still high dimension (~ vocabulary size) and noisy
-> Next step: dimensionality reduction
55. Sentiance Confidential 2017
Word embeddings: SVD
1. SVD: Keep k dimensions
r r
U S VTA’
=
m x n matrix m x m matrix m x n matrix n x n matrix
k
k
k k
words
contexts
56. Sentiance Confidential 2017
Word embeddings: SVD
1. SVD: Keep k dimensions
r r
U S VTA’
=
m x n matrix m x m matrix m x n matrix n x n matrix
k
k
k k
k-dimensional word embeddings
k-dimensional context embeddings
words
contexts
Denoised, low-rank
approximation A’
59. Sentiance Confidential 2017
Word embeddings: SVD
1. Words as linear combination of topics
r
U SA’
=
m x k matrix k x k matrix
k
k
k
words
contexts
Rows: word vectors
Columns: linear weights => topics!
Let’s have a look at some of these topics…
m x k matrix
61. Sentiance Confidential 2017
Word embeddings: SOTA
1. How does this differ from:
• Word2vec
• Glove
Must read:
Levy et al. “Neural word embedding as implicit matrix factorization.”
https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf
Levy et al. "Improving distributional similarity with lessons learned from word embeddings."
https://levyomer.files.wordpress.com/2015/03/improving-distributional-similarity-tacl-2015.pdf
Great summary blog posts:
• http://blog.aylien.com/overview-word-embeddings-history-word2vec-cbow-glove/
• http://ruder.io/secret-word2vec/
• https://rare-technologies.com/making-sense-of-word2vec/
62. Sentiance Confidential 2017
Word embeddings: Word2Vec
1. Word2Vec (skip-gram with negative sampling)
• Learns word embedding matrix W
• Learns auxiliary context embedding matrix C
63. Sentiance Confidential 2017
Word embeddings: Word2Vec
(source: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
65. Sentiance Confidential 2017
Word embeddings: Word2Vec
1. Word2Vec (skip-gram with negative sampling)
• Only 1 hidden layer, no activation function => Linear!
• Learns word embedding matrix W
• Learns auxiliary context embedding matrix C
𝑊
𝑑
𝐶
𝑑
W
C
66. Sentiance Confidential 2017
Word embeddings: Word2Vec
1. Word2Vec (skip-gram with negative sampling)
• What if you multiply them together?
𝑊
𝑑
𝐶
𝑑
?=
W
C
W
C
67. Sentiance Confidential 2017
Word embeddings: Word2Vec
1. Word2Vec (skip-gram with negative sampling)
• Levy et al. proved: “For large enough d and enough iterations:
-> We get a shifted version of the word-context PPMI matrix!
𝑊
𝑑
𝐶
𝑑
?=
W
C
W
C
68. Sentiance Confidential 2017
Word embeddings: Word2Vec
1. Word2Vec (skip-gram with negative sampling)
• Opt(w.c) = max(0.0, PMI(w, c)-log(k))
• k = number of negative samples
• log(k) acts as a prior/regularizer and is 0 if k=1
𝑊
𝑑
𝐶
𝑑
?=
W
C
W
C
69. Sentiance Confidential 2017
Word embeddings: Word2Vec
1. Word2Vec (skip-gram with negative sampling)
• Thus: Word2Vec factorizes the word-context co-occurrence matrix
• So does SVD!
- Although the matrices here are unconstrained (not orthogonal)
• What about Glove?
70. Sentiance Confidential 2017
Word embeddings: Glove
1. Glove
• Explicitly start by calculating the co-occurrence matrix X
• Define cost function
- Predict co-occurrence of two words based on their word embedding vectors
• Optimize a weighted least square objective
Weighting function to decrease importance of high co-occurence
71. Sentiance Confidential 2017
Word embeddings: Glove
1. Glove
• Thus: Glove factorizes the (logarithm of the) word-context co-occurrence
matrix
• So does SVD!
72. Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Do Word2Vec and Glove perform better than SVD?
• They all do matrix factorization
• But literature seems to suggest ‘yes’?
• Levy et al. showed:
- Reformulating prediction based (Word2Vec, Glove) methods into count
based (SVD, PMI) framework reveals a bunch of new hyper parameters
- Performance difference is all in the hyper parameters!
73. Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Preprocessing (word2vec)
1. Dynamic context windows
2. Subsampling
3. Deleting rare words
2. Postprocessing (Glove)
1. Adding word and context vectors
2. Eigenvalue weighting
3. Vector normalization
3. Association Metric (skip-gram with negative sampling)
1. Shifted PPMI
2. Context distribution smoothing
74. Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Preprocessing: Dynamic context windows
1. Assign higher weight to contexts closer to the target word
2. Glove: Harmonic function
-> Weight = 1/N, where N is the number of tokens between the two words
3. Word2Vec:
-> Weight ~ N/S, where S is the window size
4. Why called ‘dynamic context window’?
-> Word2Vec implements this by random sampling the window size
75. Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Preprocessing: Subsampling
1. With probability p, remove words with frequency f higher than threshold t
2. Do this *before* creating the context windows
-> This implicitly increases the window size!
76. Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Preprocessing: Deleting rare words
1. Remove words that occur very infrequently in the training data
2. Do this *before* creating the context windows
-> This implicitly increases the window size!
77. Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Postprocessing: Adding word and context vectors (e.g. Glove)
• Resulting cosine distance:
- combines first-order similarities (w.c) and second-order similarities (wx.wy,
cx.cy)
- (Proof: https://levyomer.files.wordpress.com/2015/03/improving-distributional-similarity-tacl-2015.pdf)
78. Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Postprocessing: Eigenvalue weighting
1. Different ways to get word vectors W and context vectors C:
- W = U, C = V
- W = US, C = V
- W = US1/2, C = VS1/2
2. In general:
- W = USp, C = VSp
79. Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Postprocessing: Vector normalization
• Different ways to normalize W
1. L2-normalize the rows => word vectors are unit length
2. L2-normalize the columns
3. Both, or other norms
In practice, first option usually works significantly better than others
80. Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Association metric: Shifted PPMI
• SPPMI = max(0.0, PMI(w, c)-log(k))
• k acts as a prior on the probability of observing a positive co-occurrence
• This is what skip-gram Word2Vec with negative sampling (k negative
samples) optimizes implicitly
81. Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Association metric: Context distribution smoothing
• PMI is biased towards rare words
• To alleviate this bias:
- Raise context counts to the power of α≤1 (e.g. α = 0.75 seems to work well)
- This increases Pα (c) for rare words => Reduces PMI
• Implicitly done in Word2Vec during negative sampling
82. Sentiance Confidential 2017
Word embeddings: Comparison
1. Levy et al. compared:
• PPMI
• SVD
• Word2Vec (skip-gram with negative sampling: SGNS)
• Glove
• Use the same hyper parameters everywhere (if transferable)
• Evaluate different hyper parameter values
83. Sentiance Confidential 2017
Word embeddings: Comparison
1. Levy et al. compared:
• 8 benchmarks
- 6 word similarity tasks
- 2 analogy tasks
84. Sentiance Confidential 2017
Word embeddings: Comparison
1. Levy et al. conclusions:
• Word2Vec outperforms Glove on every task
- If used with the same hyper parameters!
• SVD slightly outperforms Word2Vec sometimes
- If used with the same hyper parameters!
- On word similarity tasks
• Word2Vec slightly outperforms SVD sometimes
- If used with the same hyper parameters!
- On word analogy tasks
85. Sentiance Confidential 2017
Word embeddings: Comparison
1. Nevertheless, Glove/Word2Vec:
• Can be easier to train sometimes
• Hyper parameters are implicitly part of the algorithms
• Outperform SVD out-of-the-box
Main take-away:
• Mind the hyper parameters
• Quick experiments/results with SVD (or even PCA) can give good indications!
• If we had extracted the means from all columns before SVD… this would just be PCA!
86. Sentiance Confidential 2017
Latent Semantic Analysis
1. What if we have:
• Multiple documents (or e.g. users)
• A whole set of words per document (or e.g. user specific attributes)
2. And we want:
• Word embeddings
• User embeddings
Can we re-use the same SVD approach described earlier?
-> Remember: Input matrix A does not have to be square!
87. Sentiance Confidential 2017
Latent Semantic Analysis
1. First step: Create some kind of co-occurrence matrix
2. But for term-documents this time
3. For each word, count how many times it occurs in each document
words
Documents
Venuevisitcounts
Users
or
88. Sentiance Confidential 2017
Latent Semantic Analysis
1. Normalize counts to decrease importance of words that occur in all docs
2. Typical way: Term Frequency Inverse Document Frequency (TF-IDF) matrix
3. Term frequency:
-> K often set to 0.4 or 0.5 to make sure small changes to counts don’t cause big
changes in value
89. Sentiance Confidential 2017
Latent Semantic Analysis
1. Term frequency: tf =
2. Inverse document frequency:
3. tf-idf = tf . idf
words
Documents
tf-idf counts
(many different variants exist)
90. Sentiance Confidential 2017
Latent Semantic Analysis
1. SVD: Keep k dimensions
r r
U S VTA’
=
m x n matrix m x m matrix m x n matrix n x n matrix
k
k
k k
k-dimensional word embeddings
k-dimensional document embeddings
words
Documents
Denoised, low-rank
approximation A’
91. Sentiance Confidential 2017
Latent Semantic Analysis
1. Word embeddings: W = US
2. Document embeddings: D = VS
3. W and D are in the same space
92. Sentiance Confidential 2017
Latent Semantic Analysis
1. Our toy example:
• 5000 users
• For each user: count events
• Calculate tf-idf and apply SVD
• Goal: get user embeddings for look-a-like modeling
93. Sentiance Confidential 2017
Latent Semantic Analysis
1. Our toy example:
• Add info on time-of-day, duration and distance
- log(duration_in_minutes)/log(1000)
- Consider 1 digit precision
96. Sentiance Confidential 2017
Latent Semantic Analysis
1. Our toy example:
• Similar for distance and time-of-day
• Append augmentations to suffix list
• Add all possible suffix combinations to vocabulary and increase count
• E.g. Car trip of 10 km in 8 minutes at 20h00:
- Suffix list: [‘dur: 4’, ‘dist: 4’, ‘tod: 20’]
- Increase term-frequency counts for:
– Car
– Car_dur:4
– Car_dur:4_dist:4
– Car_dur:4_dist:4_tod:20
– Car_dist:4
– Car_dist:4_tod:20
– Car_tod:20
– Car_tod:20_dur:4
97. Sentiance Confidential 2017
Latent Semantic Analysis
1. Our toy example:
• Similar for distance and time-of-day
• Append augmentations to suffix list
• Add all possible suffix combinations to vocabulary and increase count
• E.g. Car trip of 10 km in 8 minutes at 20h00:
- Suffix list: [‘dur: 4’, ‘dist: 4’, ‘tod: 20’]
- Increase term-frequency counts for:
– Car
– Car_dur:4
– Car_dur:4_dist:4
– Car_dur:4_dist:4_tod:20
– Car_dist:4
– Car_dist:4_tod:20
– Car_tod:20
– Car_tod:20_dur:4
Allows LSA to learn relations between
these different features
100. Sentiance Confidential 2017
Latent Semantic Analysis
To visualize our toy example:
1. Check which segments we currently assign to these users
- Rule based segments
2. Randomly select two segments
3. Find a linear transformation of our feature space
- That maximizes the distance between samples which:
– Have segment 1
– Have segment 2
– Have segment 1 and 2
– Have none of both segments
-> Linear Discriminant Analysis
101. Sentiance Confidential 2017
Latent Semantic Analysis
To quickly estimate the capabilities:
1. Balance classes (resample majority class)
2. Calculate user embeddings
3. Train a logistic regressor
- Treat current segment detections as ground truth
- 10-fold cross validation
- Binary classification
109. Sentiance Confidential 2017
Latent Semantic Analysis
1. What do these feature spaces capture?
2. Fun experiment:
1. We have ‘age’ data for 500 users.
2. Let’s see if different ages behave differently
3. If so: We have an age classifier, built in a few minutes time!
111. Sentiance Confidential 2017
• Simple logistic regression
• 200D feature space
• 10-fold cross-validation
• Accuracy: 56%
• Recall: 55%
Immediate use-case:
Split up our parent segment (mom/dad)
• If we only trust if prob > 0.7
• Then we only classify 70% of the users
• Accuracy: 70%
• Recall: 70%
112. Sentiance Confidential 2017
Latent Semantic Analysis
1. Project new document into that space?
A = USVT
VT = S-1U-1A
VT = S-1UTA
V = (S-1UTA)T
V= ATUS-1
New document q: q’ = qUS-1
So: we can project new users in an existing, pre-trained feature space!
113. Sentiance Confidential 2017
Latent Semantic Analysis
1. Word embeddings: W = US
2. Document embeddings: D = VS
3. W and D are in the same space
4. W represent words as a linear combination of ‘topics’
5. D represents documents as a linear combination of ‘topics’
6. S represents the importance of each topic
114. Sentiance Confidential 2017
Latent Semantic Analysis
1. Topics are ‘semantic dimensions’
• Uncorrelated (orthogonal) factors that can be linearly combined to obtain all
terms
2. Topic analysis?
1. Each column of W is a topic
2. The value inside the cell shows how important a word is to that topic
3. Looking at the most important ones can help interpreting topic meanings
116. Sentiance Confidential 2017
Latent Semantic Analysis
1. LSA is basically finds the latent factors underlying the terms and documents:
Word 1
Word 2
Word 3
Word 4
Word 5
Topic 1
Topic 2
Topic 3
Document 1
Document 2
Document 3
Document 4
containsdescribed by
118. Sentiance Confidential 2017
Latent Semantic Analysis
1. Remember:
1. We simply applied SVD
2. Exactly in the same way as for word embeddings
3. But now we also have document embeddings
2. Also remember:
• If we had extracted the mean from all our columns beforehand…
• Then SVD would just be PCA! (you would get the same word embeddings)
120. Sentiance Confidential 2017
Recommender systems: look-alike modeling
1. We have a user-item matrix with attributes
2. Some entries might be missing or noisy
-> Replace the few missing entries with their average and consider them noise
3. We now want to:
1. Calculate user-similarity. Which users have similar behavior?
2. Remove the noise from our original matrix (and fill/correct missing values)
121. Sentiance Confidential 2017
Recommender systems: look-alike modeling
1. Solution:
• Treat users as ‘Documents’
• Treat user attributes as ‘term counts’
• Attributes have different meaning/units
- So we have to standardize them: use z-scores
• Apply SVD (LSA)
- To find user embeddings
- To get the most similar users
• Exactly the same as what we did for LSA!
122. Sentiance Confidential 2017
Recommender systems: look-alike modeling
1. SVD: Keep k dimensions
r r
U S VTA’
=
m x n matrix m x m matrix m x n matrix n x n matrix
k
k
k k
k-dimensional item embeddings
k-dimensional user embeddings
Attributes
Users
Denoised, low-rank
approximation A’
123. Sentiance Confidential 2017
Recommender systems: look-alike modeling
1. Note:
• If the ratings or attributes have different meanings and scales:
- Need to standardize (subtract mean, divide by std)
- In this case, SVD is exactly the same as PCA (to get the item
embeddings)!
- But we get the user embeddings for free with SVD
124. Sentiance Confidential 2017
Recommender systems: look-alike modeling
1. SVD: Keep k dimensions
r r
U S VTA’
=
m x n matrix m x m matrix m x n matrix n x n matrix
k
k
k k
k-dimensional item embeddings
k-dimensional user embeddings
Attributes
Users
Denoised, low-rank
approximation A’
==PCA!
125. Sentiance Confidential 2017
Recommender systems: look-alike modeling
1. Reduce noise without reducing the dimensionality?
1. S contains the singular values on its diagonal
2. Set the k smallest singular values to 0
3. Now calculate A’ = USVT
2. We now have cleaned estimates of the user’s attributes
126. Sentiance Confidential 2017
Recommender systems: look-alike modeling
1. Input: user attributes (‘Sentiance attributes table’)
1. Filtered out non-numeric columns
2. Kept all numeric columns named *.value
3. Except for ‘has_home.value’ and ‘is_stable.value’
2. To evaluate the performance:
1. Similar exercise as before
2. Treat current segment detections as ground truth
3. This time we use 90k users (versus 5k users before)
134. Sentiance Confidential 2017
This is just the beginning...
1. Non-linear layers
• E.g. LSTM encoders
-> Careful with non-linearities if you want a Euclidean space!
2. Self-supervised learning
• E.g. triplet-loss optimization
-> Explicitly optimizes to find a Euclidean metric space!
3. Graph embeddings
• E.g. ‘deep city’
-> Avoids the bottleneck of explicitly encoding a complete user timeline into a single vector
135. Sentiance Confidential 2017
This is just the beginning...
1. Non-linear layers
• E.g. LSTM encoders
-> Careful with non-linearities if you want a Euclidean space!
2. Self-supervised learning
• E.g. triplet-loss optimization
-> Explicitly optimizes to find a Euclidean metric space!
3. Graph embeddings
• E.g. ‘deep city’
-> Avoids the bottleneck of explicitly encoding a complete user timeline into a single vector
Let’s keep that
for another talk
136. Sentiance Confidential 2017
Conclusion
1. Embeddings:
1. Are extremely powerful
2. Allow us to cope with noisy, unlabeled input data
3. Are not magic
- You might need to fine-tune them for specific tasks (e.g. word embeddings for sentiment analysis
versus analogy inference)
- They may not capture differences caused by rare events
2. Often you can just start with:
1. PCA
2. SVD