Matrix factorization: Demistifying word embeddings

Sentiance Confidential 2017
Matrix Factorization
Demistifying word embeddings
Monthly data science team meeting // journal club
Vincent Spruyt
November 2017

Goal of this talk
1. Intuitively link and think about how these concepts are related
… and in a lot of situations even exactly the same!
2. Realize how many of these simple concepts can easily be applied at Sentiance
(we will discuss some potential applications)
PCA
SVD
LSA
word2vec
Glove
look-alike modeling

Introduction: embeddings
1. What we have:
• Lots of unlabeled data
2. What we want:
• Discover hidden structure in the data
• Underlying ‘causes’ of the data
… and do this without any labels

1. E.g.
• Observed timelines:
• User 1: [home, car, work]
• User 2: [home, car, shop, car, work]
• User 3: [home, train, work]
Question: Is user 1 more similar to user 2 or to user 3?

1. E.g.
- User 1: [home, car, work]
- User 2: [home, car, shop, car, work]
- User 3: [home, train, work]
• Latent factors:
- Factor 1: [commute]
- Factor 2: [shopping]
- Factor 3: [green transport]
- Factor 4: [non-green transport]
All observations can be explained
by a combination of these factors

1. E.g.
- User 1: [home, car, work]
- User 2: [home, car, shop, walk, work]
- User 3: [home, train, work]
• Latent factors:
- Factor 1: [commute]
- Factor 2: [shopping]
- Factor 3: [green transport]
- Factor 4: [non-green transport]
User 1 = 1*factor1 + 0*factor2 + 0*factor3 + 1*factor4
User 2 = 1*factor1 + 1*factor2 + 0.5*factor3 + 0.5*factor4
User 3 = 1*factor1 + 0*factor2 + 1.0*factor3 + 0*factor4

User 1 = 1*factor1 + 0*factor2 + 0*factor3 + 1*factor4
User 2 = 1*factor1 + 1*factor2 + 0.5*factor3 + 0.5*factor4
User 3 = 1*factor1 + 0*factor2 + 1.0*factor3 + 0*factor4
Embeddings:
User1 = [1.0, 0.0, 0.0, 1.0]
User2 = [1.0, 1.0, 0.5, 0.5]
User3 = [1.0, 0.0, 1.0, 0.0]

1. Embeddings define coordinates: E.g. 3D embeddings:
User1 = [1.0, 0.0, 0.0]
User2 = [1.0, 1.0, 0.5]
User3 = [1.0, 0.0, 1.0]
(0,0,0) 1 2
1
2
1
2
commute
shopping
(1,1,1)

1. User similarity?
• Measure distance in our feature space
A=[0.3, 1.0]
B=[1.0, 0.3]
distance(a,b) = ?

1. User similarity?
A=[0.2, 1.0]
B=[1.0, 0.3]
distance(a,b) = ?
Remember Pythagoras? ;)
0.7
0.8

1. User similarity?
A=[0.2, 1.0]
B=[1.0, 0.3]
distance(a,b) = sqrt(0.8*0.8 + 0.7*0.7) = 1.06
Remember Pythagoras? ;)
0.7
0.8

1. User similarity?
E.g. 3D embeddings:
User1 = [1.0, 0.0, 0.0]
User2 = [1.0, 1.0, 0.5]
User3 = [1.0, 0.0, 1.0]
(0,0,0) 1 2
1
2
1
2
(1,1,1)
commute
shopping

1. Problem with Euclidean distance
• Not scale invariant
A=[0.3, 1.0]
B=[1.0, 0.3]
A’=[3.0, 10.0]
B’=[10.0, 3.0]
C’=[1.0, 2.0]
C=[0.1, 0.2]

1. Solution
• Cosine distance
A=[0.2, 1.0]
B=[1.0, 0.3]
Cos(θ): angle between two vectors
-> Invariant to scale!
θ

1. Thus, our goal:
• Find a metric space that:
- captures semantic similarity between data points in a distance metric
- captures hidden patterns
– aka semantic meaning
– aka latent factors
- Each observation = linear combination of these latent factors
– obs1 = a*factor1 + b*factor2 + c*factor3
– obs2 = d*factor1 + e*factor2 + f*factor3
– obs3 = g*factor1 + h*factor2 + i*factor3
…

1. Matrix notation:
- Each observation = linear combination of these latent factors
– obs1 = a*factor1 + b*factor2 + c*factor3
– obs2 = d*factor1 + e*factor2 + f*factor3
– obs3 = g*factor1 + h*factor2 + i*factor3

1. Final goal:
• Given the observation matrix:
- Find the optimal factors
AND
- Find the optimal weight matrix

1. Final goal:
AND
• Two approaches:
- Matrix factorization (traditional approach)
- Machine learning (state-of-the-art)
-> Which works better?
-> Are they related?

1. Final goal:
AND
• Two approaches:
- Matrix factorization (traditional approach)
- Machine learning (state-of-the-art)
-> Which works better?
-> Are they related?
That is the topic of this talk!

Outline
1. Word embeddings: count vectors
2. Word embeddings: PPMI
3. Word embeddings: PCA
4. Word embeddings: SVD
5. Word embeddings: Word2Vec
6. Word embeddings: Glove
7. User Embeddings: LSA
8. User Embeddings: Collaborative filtering
Python code available!

Word embeddings: Toy example
• Input: user timelines of stationary visits
• Flatten the type topology (don’t care about hierarchy for now)
• Small set of 3384 random users
• Total number of events: 789,043
• Vocabulary (event type) size: 125
E.g. [u'work', u'home', u'work', u'travel_busstop', u'work', u'home', u'travel_busstop', u'home',
u'shop_supermarket', u'work', u'travel_parking', u'travel_expensive', u'shop', u'shop',
u'travel_station', …]

Word embeddings: count vectors
1. Define context window (e.g. bi-grams)
2. Count how many times each word co-occurs in a context window
• E.g. window_size = 5
[work, home, work, travel_busstop, work, home, travel_busstop, home, shop_supermarket, work, travel_parking, …]

context window

context window
busstop work home supermarket parking
home 2 1 1 0 0

context window
home 2 1 1 0 0
busstop 0 1 2 1 0

1. Little trick:
• Weigh context words by distance to target word
context window
home 1.5 1 0.5 0 0
busstop 0 0.5 2 0.5 0
1/11/2 1/21/1Count weights:
Later more on this trick and others

1. Word vectors:
• home = [1.5, 1.0, 0.5, 0.0, 0.0]
• busstop = [0.0, 0.5, 2.0, 0.5, 0.0]
• Distance between word vectors is
- almost independent from their co-occurrence
- almost completely determined by the similarity of the context they appear in

3. Vector length |v| ~ corpus size
shop_supermarket travel_aerodrome drinks_pub travel_campsite sport_golf shop_marketplace leisure_theatre travel_station
shop_supermarket 12725 399 874 205 58 70 72 1425
travel_aerodrome 398 9274 55 11 1 9 8 416
drinks_pub 873 55 959 37 7 5 18 289
travel_campsite 202 11 37 762 1 1 2 16
sport_golf 57 1 7 1 1212 1 0 8
(Sparse, high dimensional) word embeddings!

1. Let’s visualize in 2D using a manifold projection
• Isomap (just because it looked nicer than t-sne in this case)
• Finds 2D space that maintains local distances
2. Normalize each vector to unit length first
• We are not interested in vector length (depends on vocabulary size)
• Divide by L2-norm
=> Euclidean distance ~ cosine distance

Word embeddings: PPMI
1. Problem: Highly dominated by words that occur a lot
2. Solution: Pointwise Mutual Information
3. ‘How much more do we observe (w,c) compared to random chance?’
• If w and c are completely uncorrelated: P(w,c) = P(w)P(c)

1. Problem: Highly dominated by words that occur a lot
2. Solution: Pointwise Mutual Information
3. Problem with PMI: Negative values often noisy
4. Solution: Set to zero

1. If fij are the co-occurrence counts, we have W (words) rows, and C
(contexts) columns:

1. Define context window (e.g. skip-gram)
2. Count how many times each word co-occurs in the context window
3. Vector length |v| ~ corpus size
shop_supermarket travel_aerodrome drinks_pub travel_campsite sport_golf shop_marketplace leisure_theatre travel_station
shop_supermarket 0.24 0.0 0.0 0.29 0.0 0.16 0.0 0.0
travel_aerodrome 0.0 5.51 0.0 0.0 0.0 0.07 0.0 0.15
drinks_pub 0.0 0.0 2.64 0.89 0.0 0.0 0.41 0.0
travel_campsite 0.27 0.0 0.89 8.19 0.0 0.63 0.14 0.0
sport_golf 0.0 0.0 0.0 0.0 8.86 0.63 0.0 0.0
(Sparse, high dimensional) word embeddings!

1. Problem 1: Huge dimensionality (e.g. 50.000D embedding vector)
• Distances in large dimensions become meaningless
• Everything is almost equally far away from everything else

1. Problem 1: Huge dimensionality (e.g. 50.000D embedding vector)
• Distances in large dimensions become meaningless
• Everything is almost equally far away from everything else
2. Problem 2: Very sparse matrix
• Inefficient. Can we compress (turn into dense matrix)?
• Aggravates the high dimensional distance issue
3. Problem: Counts are very noisy
• Can we denoise the matrix?
• Can we transfer knowledge from other co-occurrence counts?

1. Solution:
• Find low-rank approximation of PPMI matrix
- Uncorrelated columns
-> Orthogonal base
- Dense
- De-noised
- Low-dimensional (when decomposed using matrix factorization)
• => Compression + noise reduction

Word embeddings: PCA
1. Idea: PCA!
2. New columns: linear combinations of old columns
3. De-correlated dimensions => Low rank!
• X’ = VX gives reduced dimensionality (X is input data)
• X’’ = VTX’ gives reconstructed low-rank approximation
4. Keeps dimensions of largest variance => Noise reduction!
• Assuming that the noise has smaller variance than the information

1. Problems:
1. PCA breaks down if data is not normally distributed
1. Word co-occurences are all positive, so not Gaussian
=> No guarantee that eigenvectors capture largest variance

1. Problems:
2. PCA requires centering the data (subtract mean)
1. Sparseness lost => Computationally more complex
2. Angle between vectors not preserved => Cosine distance?
Θ=90°
Cosine distance:
And u.v = ||a||2 ||b||2 cos(Θ)
And cos(90°) = 0.0
 Cosine distance = 1.0

1. Problems:
2. PCA requires centering the data (subtract mean)
1. Sparseness lost => Computationally more complex
2. Angle between vectors not preserved => Cosine distance?
Θ=180°
Cosine distance:
And u.v = ||a||2 ||b||2 cos(Θ)
And cos(180°) = -1.0
 Cosine distance = 2.0

Word embeddings: SVD
1. Forget about capturing largest variance
2. Given co-occurrence matrix A
• Find low-rank approximation Ak
• That minimizes ||A – Ak||F
Note: Frobenius norm ||A||F of matrix A is simply the Euclidean norm of the vector containing its elements:

1. Forget about capturing largest variance
2. Given co-occurrence matrix A
• Find low-rank approximation Ak
• That minimizes ||A – Ak||F
3. Optimal solution:
• Singular Value Decomposition (SVD)

1. SVD
r = rank of A
r
r r
U S VTA
=
m x n matrix m x m matrix m x n matrix n x n matrix

1. SVD
r = rank of A
r
r r
U S VTA
=
For skip-gram PMI,
m=n and A is symmetric.
But let’s look at the general case here

1. SVD
r = rank of A
r
r r
U S VTA
=
Eigenvector of ATA
Eigenvector of AAT Singular values of AAT
If data is centered:
Remember:
• cov(A) = ATA
=> cov(AT) = AAT

1. SVD
r r
U S VTA
=
Eigenvector of ATA
Eigenvector of AAT Singular values of AAT
If data is centered:
US = PCA(A)

S is diagonal
=> A = (US)VT
and
=> A = U(SVT)
1. First case:
Express rows of A as linear combination of rows of VT, with (US) as coefficients
2. Second case:
Express columns of A as linear combinations of columns of U, with (SVT) as coefficients

S is diagonal, let D=S1/2 => A = UDDVT
=> A = (UD)DTVT
and
=> A = UD(DTVT)
1. First case:
Word embeddings: UD
2. Second case:
Context embeddings: VD
- Equal in case of skip-gram word-contexts
- Because in that case A is symmetric => U==V
- Not equal in other cases
Remember:
• (AB)T = BTAT

1. We now have low-rank word embeddings
-> Word ~ linear combination of decorrelated (orthogonal) factors
2. Still high dimension (~ vocabulary size) and noisy
-> Next step: dimensionality reduction

1. SVD: Keep k dimensions
r r
U S VTA’
=
k
k
k k
words
contexts

r r
U S VTA’
=
k
k
k k
k-dimensional word embeddings
k-dimensional context embeddings
words
contexts
Denoised, low-rank
approximation A’

Home-related shopping activities
Public transport
Health Sport
Restaurants Holiday
Office

1. Words as linear combination of topics
r
U SA’
=
m x k matrix k x k matrix
k
k
k
words
contexts
Rows: word vectors
Columns: linear weights => topics!
Let’s have a look at some of these topics…
m x k matrix

1. Topics as linear combinations of words
Topic 1: (tourism?)
1.59 * leisure_garden
0.95 * leisure_themepark
0.77 * leisure_playground
0.75 * leisure_museum
0.64 * leisure
Topic 2: (home-shops?)
0.74 * shop_wholesale
0.73 * shop_pet
0.60 * shop_furniture
0.60 * shop_varietystore
0.57 * shop_hardware
Topic 3: (holidays?)
0.94 * travel_terminal
0.88 * travel_aerodrome
0.84 * leisure_playground
0.77 * sport_swimmingpool
0.57 * travel_caravansite
Topic 4: (holiday-shops?)
1.59 * shop_hobby
0.79 * shop_food
0.75 * travel_ferryterminal
0.65 * shop_retail
0.62 * leisure_marina
Topic 5: (office/health?)
0.63 * restaurant_french
0.58 * office_library
0.56 * health_doctors
0.56 * health_dentist
0.56 * office_townhall
Topic 6: (sport-ish?) ¯_(ツ)_/¯
0.83 * shop_alcohol
0.63 * leisure_naturereserve
0.56 * shop_hairdresser
0.55 * sport_pitch
0.53 * sport_stadium

Word embeddings: SOTA
1. How does this differ from:
• Word2vec
• Glove
Must read:
Levy et al. “Neural word embedding as implicit matrix factorization.”
https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf
Levy et al. "Improving distributional similarity with lessons learned from word embeddings."
https://levyomer.files.wordpress.com/2015/03/improving-distributional-similarity-tacl-2015.pdf
Great summary blog posts:
• http://blog.aylien.com/overview-word-embeddings-history-word2vec-cbow-glove/
• http://ruder.io/secret-word2vec/
• https://rare-technologies.com/making-sense-of-word2vec/

Word embeddings: Word2Vec
1. Word2Vec (skip-gram with negative sampling)
• Learns word embedding matrix W
• Learns auxiliary context embedding matrix C

(source: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)

10000x300 matrix W
300x10000 matrix C

• Only 1 hidden layer, no activation function => Linear!
• Learns word embedding matrix W
• Learns auxiliary context embedding matrix C
𝑊
𝑑
𝐶
𝑑
W
C

• What if you multiply them together?
𝑊
𝑑
𝐶
𝑑
?=
W
C
W
C

• Levy et al. proved: “For large enough d and enough iterations:
-> We get a shifted version of the word-context PPMI matrix!
𝑊
𝑑
𝐶
𝑑
?=
W
C
W
C

• Opt(w.c) = max(0.0, PMI(w, c)-log(k))
• k = number of negative samples
• log(k) acts as a prior/regularizer and is 0 if k=1
𝑊
𝑑
𝐶
𝑑
?=
W
C
W
C

• Thus: Word2Vec factorizes the word-context co-occurrence matrix
• So does SVD!
- Although the matrices here are unconstrained (not orthogonal)
• What about Glove?

Word embeddings: Glove
1. Glove
• Explicitly start by calculating the co-occurrence matrix X
• Define cost function
- Predict co-occurrence of two words based on their word embedding vectors
• Optimize a weighted least square objective
Weighting function to decrease importance of high co-occurence

Word embeddings: Glove
1. Glove
• Thus: Glove factorizes the (logarithm of the) word-context co-occurrence
matrix
• So does SVD!

Word embeddings: Hyperparameters
1. Do Word2Vec and Glove perform better than SVD?
• They all do matrix factorization
• But literature seems to suggest ‘yes’?
• Levy et al. showed:
- Reformulating prediction based (Word2Vec, Glove) methods into count
based (SVD, PMI) framework reveals a bunch of new hyper parameters
- Performance difference is all in the hyper parameters!

1. Preprocessing (word2vec)
1. Dynamic context windows
2. Subsampling
3. Deleting rare words
2. Postprocessing (Glove)
1. Adding word and context vectors
2. Eigenvalue weighting
3. Vector normalization
3. Association Metric (skip-gram with negative sampling)
1. Shifted PPMI
2. Context distribution smoothing

1. Preprocessing: Dynamic context windows
1. Assign higher weight to contexts closer to the target word
2. Glove: Harmonic function
-> Weight = 1/N, where N is the number of tokens between the two words
3. Word2Vec:
-> Weight ~ N/S, where S is the window size
4. Why called ‘dynamic context window’?
-> Word2Vec implements this by random sampling the window size

1. Preprocessing: Subsampling
1. With probability p, remove words with frequency f higher than threshold t
2. Do this *before* creating the context windows
-> This implicitly increases the window size!

1. Preprocessing: Deleting rare words
1. Remove words that occur very infrequently in the training data
2. Do this *before* creating the context windows
-> This implicitly increases the window size!

1. Postprocessing: Adding word and context vectors (e.g. Glove)
• Resulting cosine distance:
- combines first-order similarities (w.c) and second-order similarities (wx.wy,
cx.cy)
- (Proof: https://levyomer.files.wordpress.com/2015/03/improving-distributional-similarity-tacl-2015.pdf)

1. Postprocessing: Eigenvalue weighting
1. Different ways to get word vectors W and context vectors C:
- W = U, C = V
- W = US, C = V
- W = US1/2, C = VS1/2
2. In general:
- W = USp, C = VSp

1. Postprocessing: Vector normalization
• Different ways to normalize W
1. L2-normalize the rows => word vectors are unit length
2. L2-normalize the columns
3. Both, or other norms
In practice, first option usually works significantly better than others

1. Association metric: Shifted PPMI
• SPPMI = max(0.0, PMI(w, c)-log(k))
• k acts as a prior on the probability of observing a positive co-occurrence
• This is what skip-gram Word2Vec with negative sampling (k negative
samples) optimizes implicitly

1. Association metric: Context distribution smoothing
• PMI is biased towards rare words
• To alleviate this bias:
- Raise context counts to the power of α≤1 (e.g. α = 0.75 seems to work well)
- This increases Pα (c) for rare words => Reduces PMI
• Implicitly done in Word2Vec during negative sampling

Word embeddings: Comparison
1. Levy et al. compared:
• PPMI
• SVD
• Word2Vec (skip-gram with negative sampling: SGNS)
• Glove
• Use the same hyper parameters everywhere (if transferable)
• Evaluate different hyper parameter values

1. Levy et al. compared:
• 8 benchmarks
- 6 word similarity tasks
- 2 analogy tasks

1. Levy et al. conclusions:
• Word2Vec outperforms Glove on every task
- If used with the same hyper parameters!
• SVD slightly outperforms Word2Vec sometimes
- On word similarity tasks
• Word2Vec slightly outperforms SVD sometimes
- On word analogy tasks

1. Nevertheless, Glove/Word2Vec:
• Can be easier to train sometimes
• Hyper parameters are implicitly part of the algorithms
• Outperform SVD out-of-the-box
Main take-away:
• Mind the hyper parameters
• Quick experiments/results with SVD (or even PCA) can give good indications!
• If we had extracted the means from all columns before SVD… this would just be PCA!

Latent Semantic Analysis
1. What if we have:
• Multiple documents (or e.g. users)
• A whole set of words per document (or e.g. user specific attributes)
2. And we want:
• Word embeddings
• User embeddings
Can we re-use the same SVD approach described earlier?
-> Remember: Input matrix A does not have to be square!

1. First step: Create some kind of co-occurrence matrix
2. But for term-documents this time
3. For each word, count how many times it occurs in each document
words
Documents
Venuevisitcounts
Users
or

1. Normalize counts to decrease importance of words that occur in all docs
2. Typical way: Term Frequency Inverse Document Frequency (TF-IDF) matrix
3. Term frequency:
-> K often set to 0.4 or 0.5 to make sure small changes to counts don’t cause big
changes in value

1. Term frequency: tf =
2. Inverse document frequency:
3. tf-idf = tf . idf
words
Documents
tf-idf counts
(many different variants exist)

r r
U S VTA’
=
k
k
k k
k-dimensional word embeddings
k-dimensional document embeddings
words
Documents
Denoised, low-rank
approximation A’

1. Word embeddings: W = US
2. Document embeddings: D = VS
3. W and D are in the same space

1. Our toy example:
• 5000 users
• For each user: count events
• Calculate tf-idf and apply SVD
• Goal: get user embeddings for look-a-like modeling

1. Our toy example:
• Add info on time-of-day, duration and distance
- log(duration_in_minutes)/log(1000)
- Consider 1 digit precision

1. Our toy example:
• Add info on time-of-day, duration and distance
Buckets:
• 1: 1-2 minutes
• 2: 2-4 minutes
• 3: 4-8 minutes
• 4: 8-16 minutes
• 5: 16-31 minutes
• 6: 31-63 minutes
• 7: 63-126 minutes
• 8: 126-251 minutes
• 9: 251-501 minutes
• 10: 501-1000 minutes

1. Our toy example:
• Add info on duration
Buckets:
• 1: 1-2 minutes
• 2: 2-4 minutes
• 3: 4-8 minutes
• 4: 8-16 minutes
• 5: 16-31 minutes
• 6: 31-63 minutes
• 7: 63-126 minutes
• 8: 126-251 minutes
• 9: 251-501 minutes
• 10: 501-1000 minutes
• shop_4 => Shop visit of 8-16 minutes
• car_7 => Car trip of 63-126 minutes
• …

1. Our toy example:
• Similar for distance and time-of-day
• Append augmentations to suffix list
• Add all possible suffix combinations to vocabulary and increase count
• E.g. Car trip of 10 km in 8 minutes at 20h00:
- Suffix list: [‘dur: 4’, ‘dist: 4’, ‘tod: 20’]
- Increase term-frequency counts for:
– Car
– Car_dur:4
– Car_dur:4_dist:4
– Car_dur:4_dist:4_tod:20
– Car_dist:4
– Car_dist:4_tod:20
– Car_tod:20
– Car_tod:20_dur:4

1. Our toy example:
• Similar for distance and time-of-day
• Append augmentations to suffix list
• Add all possible suffix combinations to vocabulary and increase count
• E.g. Car trip of 10 km in 8 minutes at 20h00:
- Suffix list: [‘dur: 4’, ‘dist: 4’, ‘tod: 20’]
- Increase term-frequency counts for:
– Car
– Car_dur:4
– Car_dur:4_dist:4
– Car_dur:4_dist:4_tod:20
– Car_dist:4
– Car_dist:4_tod:20
– Car_tod:20
– Car_tod:20_dur:4
 Allows LSA to learn relations between
these different features

¯_(ツ)_/¯
?

To visualize our toy example:
1. Check which segments we currently assign to these users
- Rule based segments
2. Randomly select two segments
3. Find a linear transformation of our feature space
- That maximizes the distance between samples which:
– Have segment 1
– Have segment 2
– Have segment 1 and 2
– Have none of both segments
-> Linear Discriminant Analysis

To quickly estimate the capabilities:
1. Balance classes (resample majority class)
2. Calculate user embeddings
3. Train a logistic regressor
- Treat current segment detections as ground truth
- 10-fold cross validation
- Binary classification

1. Long commuter:
• Accuracy: 81%
2. Short commuter:
• Accuracy: 81%

1. Uber parent:
• Accuracy: 77%
2. Social activity:
• Accuracy: 88%

1. Shopaholic:
• Accuracy: 83%
2. Student:
• Accuracy: 63%

1. Sportive:
• Accuracy: 90%
2. Fastfood:
• Accuracy: 89%

1. Green commuter:
• Accuracy: 89%
2. Shopaholic:
• Accuracy: 83%

1. Home worker:
• Accuracy: 88%
2. Fulltime worker:
• Accuracy: 90%

1. Work traveler:
• Accuracy: 76%
2. Frequent Flyer:
• Accuracy: 85%

1. What do these feature spaces capture?
2. Fun experiment:
1. We have ‘age’ data for 500 users.
2. Let’s see if different ages behave differently
3. If so: We have an age classifier, built in a few minutes time!

• Simple logistic regression
• 200D feature space
• 10-fold cross-validation
• Accuracy: 67%
• Recall: 67%
Immediate use-case:
Split up our student/teacher segment

• Simple logistic regression
• 200D feature space
• 10-fold cross-validation
• Accuracy: 56%
• Recall: 55%
Immediate use-case:
Split up our parent segment (mom/dad)
• If we only trust if prob > 0.7
• Then we only classify 70% of the users
• Accuracy: 70%
• Recall: 70%

1. Project new document into that space?
A = USVT
 VT = S-1U-1A
 VT = S-1UTA
 V = (S-1UTA)T
 V= ATUS-1
 New document q: q’ = qUS-1
So: we can project new users in an existing, pre-trained feature space!

1. Word embeddings: W = US
2. Document embeddings: D = VS
3. W and D are in the same space
4. W represent words as a linear combination of ‘topics’
5. D represents documents as a linear combination of ‘topics’
6. S represents the importance of each topic

1. Topics are ‘semantic dimensions’
• Uncorrelated (orthogonal) factors that can be linearly combined to obtain all
terms
2. Topic analysis?
1. Each column of W is a topic
2. The value inside the cell shows how important a word is to that topic
3. Looking at the most important ones can help interpreting topic meanings

1. E.g.
Topic vector
Word vector Document vector

1. LSA is basically finds the latent factors underlying the terms and documents:
Word 1
Word 2
Word 3
Word 4
Word 5
Topic 1
Topic 2
Topic 3
Document 1
Document 2
Document 3
Document 4
containsdescribed by

1. Topics as linear combinations of words
Topic 1: (green commute?)
16.60 * biking (very short)
13.75 * biking (very short) (noon)
13.16 * biking
11.63 * train (very short)
11.28 * biking (very short) (morning)
Topic 2: (drunk students?)
12.90 * biking (very long)
7.81 * drinks_pub
7.56 * shop_supermarket (medium)
7.16 * biking (very long) (noon)
5.77 * biking (medium)
Topic 3: (sales people?)
14.76 * travel_expensive
13.15 * travel_expensive (very short)
9.44 * drinks_pub
9.03 * travel_expensive (very short) (noon)
7.91 * home (long) (morning)
Topic 4: (students?)
11.26 * travel_expensive
10.80 * travel_expensive (very short)
8.18 * walking (very short) (noon) (medium dist)
7.67 * travel_expensive (very short) (noon)
6.20 * education_school

1. Remember:
1. We simply applied SVD
2. Exactly in the same way as for word embeddings
3. But now we also have document embeddings
2. Also remember:
• If we had extracted the mean from all our columns beforehand…
• Then SVD would just be PCA! (you would get the same word embeddings)

Recommender systems: look-alike modeling
1. We have a user-item matrix with attributes
Items
Users

1. We have a user-item matrix with attributes
2. Some entries might be missing or noisy
-> Replace the few missing entries with their average and consider them noise
3. We now want to:
1. Calculate user-similarity. Which users have similar behavior?
2. Remove the noise from our original matrix (and fill/correct missing values)

1. Solution:
• Treat users as ‘Documents’
• Treat user attributes as ‘term counts’
• Attributes have different meaning/units
- So we have to standardize them: use z-scores
• Apply SVD (LSA)
- To find user embeddings
- To get the most similar users
• Exactly the same as what we did for LSA!

r r
U S VTA’
=
k
k
k k
k-dimensional item embeddings
k-dimensional user embeddings
Attributes
Users
Denoised, low-rank
approximation A’

1. Note:
• If the ratings or attributes have different meanings and scales:
- Need to standardize (subtract mean, divide by std)
- In this case, SVD is exactly the same as PCA (to get the item
embeddings)!
- But we get the user embeddings for free with SVD

r r
U S VTA’
=
k
k
k k
k-dimensional item embeddings
k-dimensional user embeddings
Attributes
Users
Denoised, low-rank
approximation A’
==PCA!

1. Reduce noise without reducing the dimensionality?
1. S contains the singular values on its diagonal
2. Set the k smallest singular values to 0
3. Now calculate A’ = USVT
2. We now have cleaned estimates of the user’s attributes

1. Input: user attributes (‘Sentiance attributes table’)
1. Filtered out non-numeric columns
2. Kept all numeric columns named *.value
3. Except for ‘has_home.value’ and ‘is_stable.value’
2. To evaluate the performance:
1. Similar exercise as before
2. Treat current segment detections as ground truth
3. This time we use 90k users (versus 5k users before)

1. Long commuter:
• Accuracy: 96% (81%)
2. Short commuter:
• Accuracy: 98% (81%)

1. Uber parent:
• Accuracy: 94% (77%)
2. Social activity:
• Accuracy: 96% (88%)

1. Shopaholic:
• Accuracy: 92% (83%)
2. Student:
• Accuracy: 96% (63%)

1. Sportive:
• Accuracy: 96% (90%)
2. Fastfood:
• Accuracy: 94% (89%)

1. Green commuter:
• Accuracy: 96% (89%)
2. Shopaholic:
• Accuracy: 92% (83%)

1. Home worker:
• Accuracy: 93% (88%)
2. Fulltime worker:
• Accuracy: 96% (90%)

1. Work traveler:
• Accuracy: 93% (76%)
2. Frequent Flyer:
• Accuracy: 94% (85%)

This is just the beginning...
1. Non-linear layers
• E.g. LSTM encoders
-> Careful with non-linearities if you want a Euclidean space!
2. Self-supervised learning
• E.g. triplet-loss optimization
-> Explicitly optimizes to find a Euclidean metric space!
3. Graph embeddings
• E.g. ‘deep city’
-> Avoids the bottleneck of explicitly encoding a complete user timeline into a single vector

This is just the beginning...
1. Non-linear layers
• E.g. LSTM encoders
-> Careful with non-linearities if you want a Euclidean space!
2. Self-supervised learning
• E.g. triplet-loss optimization
-> Explicitly optimizes to find a Euclidean metric space!
3. Graph embeddings
• E.g. ‘deep city’
-> Avoids the bottleneck of explicitly encoding a complete user timeline into a single vector
Let’s keep that
for another talk 

Conclusion
1. Embeddings:
1. Are extremely powerful
2. Allow us to cope with noisy, unlabeled input data
3. Are not magic
- You might need to fine-tune them for specific tasks (e.g. word embeddings for sentiment analysis
versus analogy inference)
- They may not capture differences caused by rare events
2. Often you can just start with:
1. PCA
2. SVD

www.sentiance.com

Matrix factorization: Demistifying word embeddings

Recommended

Recommended

More Related Content

Similar to Matrix factorization: Demistifying word embeddings

Similar to Matrix factorization: Demistifying word embeddings (20)

Recently uploaded

Recently uploaded (20)

Matrix factorization: Demistifying word embeddings