SlideShare a Scribd company logo
1 of 137
Sentiance Confidential 2017
Sentiance Confidential 2017
Matrix Factorization
Demistifying word embeddings
Monthly data science team meeting // journal club
Vincent Spruyt
November 2017
Sentiance Confidential 2017
Goal of this talk
1. Intuitively link and think about how these concepts are related
… and in a lot of situations even exactly the same!
2. Realize how many of these simple concepts can easily be applied at Sentiance
(we will discuss some potential applications)
PCA
SVD
LSA
word2vec
Glove
look-alike modeling
Sentiance Confidential 2017
Introduction: embeddings
1. What we have:
• Lots of unlabeled data
2. What we want:
• Discover hidden structure in the data
• Underlying ‘causes’ of the data
… and do this without any labels
Sentiance Confidential 2017
Introduction: embeddings
1. E.g.
• Observed timelines:
• User 1: [home, car, work]
• User 2: [home, car, shop, car, work]
• User 3: [home, train, work]
Question: Is user 1 more similar to user 2 or to user 3?
Sentiance Confidential 2017
Introduction: embeddings
1. E.g.
• Observed timelines:
- User 1: [home, car, work]
- User 2: [home, car, shop, car, work]
- User 3: [home, train, work]
• Latent factors:
- Factor 1: [commute]
- Factor 2: [shopping]
- Factor 3: [green transport]
- Factor 4: [non-green transport]
All observations can be explained
by a combination of these factors
Sentiance Confidential 2017
Introduction: embeddings
1. E.g.
• Observed timelines:
- User 1: [home, car, work]
- User 2: [home, car, shop, walk, work]
- User 3: [home, train, work]
• Latent factors:
- Factor 1: [commute]
- Factor 2: [shopping]
- Factor 3: [green transport]
- Factor 4: [non-green transport]
User 1 = 1*factor1 + 0*factor2 + 0*factor3 + 1*factor4
User 2 = 1*factor1 + 1*factor2 + 0.5*factor3 + 0.5*factor4
User 3 = 1*factor1 + 0*factor2 + 1.0*factor3 + 0*factor4
Sentiance Confidential 2017
Introduction: embeddings
User 1 = 1*factor1 + 0*factor2 + 0*factor3 + 1*factor4
User 2 = 1*factor1 + 1*factor2 + 0.5*factor3 + 0.5*factor4
User 3 = 1*factor1 + 0*factor2 + 1.0*factor3 + 0*factor4
Embeddings:
User1 = [1.0, 0.0, 0.0, 1.0]
User2 = [1.0, 1.0, 0.5, 0.5]
User3 = [1.0, 0.0, 1.0, 0.0]
Sentiance Confidential 2017
Introduction: embeddings
1. Embeddings define coordinates: E.g. 3D embeddings:
User1 = [1.0, 0.0, 0.0]
User2 = [1.0, 1.0, 0.5]
User3 = [1.0, 0.0, 1.0]
(0,0,0) 1 2
1
2
1
2
commute
shopping
(1,1,1)
Sentiance Confidential 2017
Introduction: embeddings
1. User similarity?
• Measure distance in our feature space
A=[0.3, 1.0]
B=[1.0, 0.3]
distance(a,b) = ?
Sentiance Confidential 2017
Introduction: embeddings
1. User similarity?
• Measure distance in our feature space
A=[0.2, 1.0]
B=[1.0, 0.3]
distance(a,b) = ?
Remember Pythagoras? ;)
0.7
0.8
Sentiance Confidential 2017
Introduction: embeddings
1. User similarity?
• Measure distance in our feature space
A=[0.2, 1.0]
B=[1.0, 0.3]
distance(a,b) = sqrt(0.8*0.8 + 0.7*0.7) = 1.06
Remember Pythagoras? ;)
0.7
0.8
Sentiance Confidential 2017
Introduction: embeddings
1. User similarity?
E.g. 3D embeddings:
User1 = [1.0, 0.0, 0.0]
User2 = [1.0, 1.0, 0.5]
User3 = [1.0, 0.0, 1.0]
(0,0,0) 1 2
1
2
1
2
(1,1,1)
commute
shopping
Sentiance Confidential 2017
Introduction: embeddings
1. Problem with Euclidean distance
• Not scale invariant
A=[0.3, 1.0]
B=[1.0, 0.3]
A’=[3.0, 10.0]
B’=[10.0, 3.0]
C’=[1.0, 2.0]
C=[0.1, 0.2]
Sentiance Confidential 2017
Introduction: embeddings
1. Solution
• Cosine distance
A=[0.2, 1.0]
B=[1.0, 0.3]
Cos(θ): angle between two vectors
-> Invariant to scale!
θ
Sentiance Confidential 2017
Introduction: embeddings
1. Thus, our goal:
• Find a metric space that:
- captures semantic similarity between data points in a distance metric
- captures hidden patterns
– aka semantic meaning
– aka latent factors
- Each observation = linear combination of these latent factors
– obs1 = a*factor1 + b*factor2 + c*factor3
– obs2 = d*factor1 + e*factor2 + f*factor3
– obs3 = g*factor1 + h*factor2 + i*factor3
…
Sentiance Confidential 2017
Introduction: embeddings
1. Matrix notation:
- Each observation = linear combination of these latent factors
– obs1 = a*factor1 + b*factor2 + c*factor3
– obs2 = d*factor1 + e*factor2 + f*factor3
– obs3 = g*factor1 + h*factor2 + i*factor3
Sentiance Confidential 2017
Introduction: embeddings
1. Final goal:
• Given the observation matrix:
- Find the optimal factors
AND
- Find the optimal weight matrix
Sentiance Confidential 2017
Introduction: embeddings
1. Final goal:
• Given the observation matrix:
- Find the optimal factors
AND
- Find the optimal weight matrix
• Two approaches:
- Matrix factorization (traditional approach)
- Machine learning (state-of-the-art)
-> Which works better?
-> Are they related?
Sentiance Confidential 2017
Introduction: embeddings
1. Final goal:
• Given the observation matrix:
- Find the optimal factors
AND
- Find the optimal weight matrix
• Two approaches:
- Matrix factorization (traditional approach)
- Machine learning (state-of-the-art)
-> Which works better?
-> Are they related?
That is the topic of this talk!
Sentiance Confidential 2017
Outline
1. Word embeddings: count vectors
2. Word embeddings: PPMI
3. Word embeddings: PCA
4. Word embeddings: SVD
5. Word embeddings: Word2Vec
6. Word embeddings: Glove
7. User Embeddings: LSA
8. User Embeddings: Collaborative filtering
Python code available!
Sentiance Confidential 2017
Word embeddings: Toy example
• Input: user timelines of stationary visits
• Flatten the type topology (don’t care about hierarchy for now)
• Small set of 3384 random users
• Total number of events: 789,043
• Vocabulary (event type) size: 125
E.g. [u'work', u'home', u'work', u'travel_busstop', u'work', u'home', u'travel_busstop', u'home',
u'shop_supermarket', u'work', u'travel_parking', u'travel_expensive', u'shop', u'shop',
u'travel_station', …]
Sentiance Confidential 2017
Word embeddings: count vectors
1. Define context window (e.g. bi-grams)
2. Count how many times each word co-occurs in a context window
• E.g. window_size = 5
[work, home, work, travel_busstop, work, home, travel_busstop, home, shop_supermarket, work, travel_parking, …]
Sentiance Confidential 2017
Word embeddings: count vectors
1. Define context window (e.g. bi-grams)
2. Count how many times each word co-occurs in a context window
• E.g. window_size = 5
[work, home, work, travel_busstop, work, home, travel_busstop, home, shop_supermarket, work, travel_parking, …]
context window
Sentiance Confidential 2017
Word embeddings: count vectors
1. Define context window (e.g. bi-grams)
2. Count how many times each word co-occurs in a context window
• E.g. window_size = 5
[work, home, work, travel_busstop, work, home, travel_busstop, home, shop_supermarket, work, travel_parking, …]
context window
busstop work home supermarket parking
home 2 1 1 0 0
Sentiance Confidential 2017
Word embeddings: count vectors
1. Define context window (e.g. bi-grams)
2. Count how many times each word co-occurs in a context window
• E.g. window_size = 5
[work, home, work, travel_busstop, work, home, travel_busstop, home, shop_supermarket, work, travel_parking, …]
context window
busstop work home supermarket parking
home 2 1 1 0 0
busstop 0 1 2 1 0
Sentiance Confidential 2017
Word embeddings: count vectors
1. Little trick:
• Weigh context words by distance to target word
[work, home, work, travel_busstop, work, home, travel_busstop, home, shop_supermarket, work, travel_parking, …]
context window
busstop work home supermarket parking
home 1.5 1 0.5 0 0
busstop 0 0.5 2 0.5 0
1/11/2 1/21/1Count weights:
Later more on this trick and others
Sentiance Confidential 2017
Word embeddings: count vectors
1. Word vectors:
• home = [1.5, 1.0, 0.5, 0.0, 0.0]
• busstop = [0.0, 0.5, 2.0, 0.5, 0.0]
• Distance between word vectors is
- almost independent from their co-occurrence
- almost completely determined by the similarity of the context they appear in
Sentiance Confidential 2017
Word embeddings: count vectors
1. Define context window (e.g. bi-grams)
2. Count how many times each word co-occurs in a context window
3. Vector length |v| ~ corpus size
shop_supermarket travel_aerodrome drinks_pub travel_campsite sport_golf shop_marketplace leisure_theatre travel_station
shop_supermarket 12725 399 874 205 58 70 72 1425
travel_aerodrome 398 9274 55 11 1 9 8 416
drinks_pub 873 55 959 37 7 5 18 289
travel_campsite 202 11 37 762 1 1 2 16
sport_golf 57 1 7 1 1212 1 0 8
(Sparse, high dimensional) word embeddings!
Sentiance Confidential 2017
Word embeddings: count vectors
1. Let’s visualize in 2D using a manifold projection
• Isomap (just because it looked nicer than t-sne in this case)
• Finds 2D space that maintains local distances
2. Normalize each vector to unit length first
• We are not interested in vector length (depends on vocabulary size)
• Divide by L2-norm
=> Euclidean distance ~ cosine distance
Sentiance Confidential 2017
Word embeddings: count vectors
Sentiance Confidential 2017
Sentiance Confidential 2017
Word embeddings: PPMI
1. Problem: Highly dominated by words that occur a lot
2. Solution: Pointwise Mutual Information
3. ‘How much more do we observe (w,c) compared to random chance?’
• If w and c are completely uncorrelated: P(w,c) = P(w)P(c)
Sentiance Confidential 2017
Word embeddings: PPMI
1. Problem: Highly dominated by words that occur a lot
2. Solution: Pointwise Mutual Information
3. Problem with PMI: Negative values often noisy
4. Solution: Set to zero
Sentiance Confidential 2017
Word embeddings: PPMI
1. If fij are the co-occurrence counts, we have W (words) rows, and C
(contexts) columns:
Sentiance Confidential 2017
Word embeddings: PPMI
1. Define context window (e.g. skip-gram)
2. Count how many times each word co-occurs in the context window
3. Vector length |v| ~ corpus size
shop_supermarket travel_aerodrome drinks_pub travel_campsite sport_golf shop_marketplace leisure_theatre travel_station
shop_supermarket 0.24 0.0 0.0 0.29 0.0 0.16 0.0 0.0
travel_aerodrome 0.0 5.51 0.0 0.0 0.0 0.07 0.0 0.15
drinks_pub 0.0 0.0 2.64 0.89 0.0 0.0 0.41 0.0
travel_campsite 0.27 0.0 0.89 8.19 0.0 0.63 0.14 0.0
sport_golf 0.0 0.0 0.0 0.0 8.86 0.63 0.0 0.0
(Sparse, high dimensional) word embeddings!
Sentiance Confidential 2017
Sentiance Confidential 2017
Word embeddings: PPMI
1. Problem 1: Huge dimensionality (e.g. 50.000D embedding vector)
• Distances in large dimensions become meaningless
• Everything is almost equally far away from everything else
Sentiance Confidential 2017
Word embeddings: PPMI
1. Problem 1: Huge dimensionality (e.g. 50.000D embedding vector)
• Distances in large dimensions become meaningless
• Everything is almost equally far away from everything else
Sentiance Confidential 2017
Word embeddings: PPMI
1. Problem 1: Huge dimensionality (e.g. 50.000D embedding vector)
• Distances in large dimensions become meaningless
• Everything is almost equally far away from everything else
2. Problem 2: Very sparse matrix
• Inefficient. Can we compress (turn into dense matrix)?
• Aggravates the high dimensional distance issue
3. Problem: Counts are very noisy
• Can we denoise the matrix?
• Can we transfer knowledge from other co-occurrence counts?
Sentiance Confidential 2017
Word embeddings: PPMI
1. Solution:
• Find low-rank approximation of PPMI matrix
- Uncorrelated columns
-> Orthogonal base
- Dense
- De-noised
- Low-dimensional (when decomposed using matrix factorization)
• => Compression + noise reduction
Sentiance Confidential 2017
Word embeddings: PCA
1. Idea: PCA!
2. New columns: linear combinations of old columns
3. De-correlated dimensions => Low rank!
• X’ = VX gives reduced dimensionality (X is input data)
• X’’ = VTX’ gives reconstructed low-rank approximation
4. Keeps dimensions of largest variance => Noise reduction!
• Assuming that the noise has smaller variance than the information
Sentiance Confidential 2017
Sentiance Confidential 2017
Word embeddings: PCA
1. Problems:
1. PCA breaks down if data is not normally distributed
1. Word co-occurences are all positive, so not Gaussian
=> No guarantee that eigenvectors capture largest variance
Sentiance Confidential 2017
Word embeddings: PCA
1. Problems:
2. PCA requires centering the data (subtract mean)
1. Sparseness lost => Computationally more complex
2. Angle between vectors not preserved => Cosine distance?
Θ=90°
Cosine distance:
And u.v = ||a||2 ||b||2 cos(Θ)
And cos(90°) = 0.0
 Cosine distance = 1.0
Sentiance Confidential 2017
Word embeddings: PCA
1. Problems:
2. PCA requires centering the data (subtract mean)
1. Sparseness lost => Computationally more complex
2. Angle between vectors not preserved => Cosine distance?
Θ=180°
Cosine distance:
And u.v = ||a||2 ||b||2 cos(Θ)
And cos(180°) = -1.0
 Cosine distance = 2.0
Sentiance Confidential 2017
Word embeddings: SVD
1. Forget about capturing largest variance
2. Given co-occurrence matrix A
• Find low-rank approximation Ak
• That minimizes ||A – Ak||F
Note: Frobenius norm ||A||F of matrix A is simply the Euclidean norm of the vector containing its elements:
Sentiance Confidential 2017
Word embeddings: SVD
1. Forget about capturing largest variance
2. Given co-occurrence matrix A
• Find low-rank approximation Ak
• That minimizes ||A – Ak||F
3. Optimal solution:
• Singular Value Decomposition (SVD)
Sentiance Confidential 2017
Word embeddings: SVD
1. SVD
r = rank of A
r
r r
U S VTA
=
m x n matrix m x m matrix m x n matrix n x n matrix
Sentiance Confidential 2017
Word embeddings: SVD
1. SVD
r = rank of A
r
r r
U S VTA
=
m x n matrix m x m matrix m x n matrix n x n matrix
For skip-gram PMI,
m=n and A is symmetric.
But let’s look at the general case here
Sentiance Confidential 2017
Word embeddings: SVD
1. SVD
r = rank of A
r
r r
U S VTA
=
m x n matrix m x m matrix m x n matrix n x n matrix
Eigenvector of ATA
Eigenvector of AAT Singular values of AAT
If data is centered:
Remember:
• cov(A) = ATA
=> cov(AT) = AAT
Sentiance Confidential 2017
Word embeddings: SVD
1. SVD
r r
U S VTA
=
m x n matrix m x m matrix m x n matrix n x n matrix
Eigenvector of ATA
Eigenvector of AAT Singular values of AAT
If data is centered:
US = PCA(A)
Sentiance Confidential 2017
Word embeddings: SVD
S is diagonal
=> A = (US)VT
and
=> A = U(SVT)
1. First case:
Express rows of A as linear combination of rows of VT, with (US) as coefficients
2. Second case:
Express columns of A as linear combinations of columns of U, with (SVT) as coefficients
Sentiance Confidential 2017
Word embeddings: SVD
S is diagonal, let D=S1/2 => A = UDDVT
=> A = (UD)DTVT
and
=> A = UD(DTVT)
1. First case:
Word embeddings: UD
2. Second case:
Context embeddings: VD
- Equal in case of skip-gram word-contexts
- Because in that case A is symmetric => U==V
- Not equal in other cases
Remember:
• (AB)T = BTAT
Sentiance Confidential 2017
Word embeddings: SVD
1. We now have low-rank word embeddings
-> Word ~ linear combination of decorrelated (orthogonal) factors
2. Still high dimension (~ vocabulary size) and noisy
-> Next step: dimensionality reduction
Sentiance Confidential 2017
Word embeddings: SVD
1. SVD: Keep k dimensions
r r
U S VTA’
=
m x n matrix m x m matrix m x n matrix n x n matrix
k
k
k k
words
contexts
Sentiance Confidential 2017
Word embeddings: SVD
1. SVD: Keep k dimensions
r r
U S VTA’
=
m x n matrix m x m matrix m x n matrix n x n matrix
k
k
k k
k-dimensional word embeddings
k-dimensional context embeddings
words
contexts
Denoised, low-rank
approximation A’
Sentiance Confidential 2017
Sentiance Confidential 2017
Home-related shopping activities
Public transport
Health Sport
Restaurants Holiday
Office
Sentiance Confidential 2017
Word embeddings: SVD
1. Words as linear combination of topics
r
U SA’
=
m x k matrix k x k matrix
k
k
k
words
contexts
Rows: word vectors
Columns: linear weights => topics!
Let’s have a look at some of these topics…
m x k matrix
Sentiance Confidential 2017
Word embeddings: SVD
1. Topics as linear combinations of words
Topic 1: (tourism?)
1.59 * leisure_garden
0.95 * leisure_themepark
0.77 * leisure_playground
0.75 * leisure_museum
0.64 * leisure
Topic 2: (home-shops?)
0.74 * shop_wholesale
0.73 * shop_pet
0.60 * shop_furniture
0.60 * shop_varietystore
0.57 * shop_hardware
Topic 3: (holidays?)
0.94 * travel_terminal
0.88 * travel_aerodrome
0.84 * leisure_playground
0.77 * sport_swimmingpool
0.57 * travel_caravansite
Topic 4: (holiday-shops?)
1.59 * shop_hobby
0.79 * shop_food
0.75 * travel_ferryterminal
0.65 * shop_retail
0.62 * leisure_marina
Topic 5: (office/health?)
0.63 * restaurant_french
0.58 * office_library
0.56 * health_doctors
0.56 * health_dentist
0.56 * office_townhall
Topic 6: (sport-ish?) ¯_(ツ)_/¯
0.83 * shop_alcohol
0.63 * leisure_naturereserve
0.56 * shop_hairdresser
0.55 * sport_pitch
0.53 * sport_stadium
Sentiance Confidential 2017
Word embeddings: SOTA
1. How does this differ from:
• Word2vec
• Glove
Must read:
Levy et al. “Neural word embedding as implicit matrix factorization.”
https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf
Levy et al. "Improving distributional similarity with lessons learned from word embeddings."
https://levyomer.files.wordpress.com/2015/03/improving-distributional-similarity-tacl-2015.pdf
Great summary blog posts:
• http://blog.aylien.com/overview-word-embeddings-history-word2vec-cbow-glove/
• http://ruder.io/secret-word2vec/
• https://rare-technologies.com/making-sense-of-word2vec/
Sentiance Confidential 2017
Word embeddings: Word2Vec
1. Word2Vec (skip-gram with negative sampling)
• Learns word embedding matrix W
• Learns auxiliary context embedding matrix C
Sentiance Confidential 2017
Word embeddings: Word2Vec
(source: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
Sentiance Confidential 2017
Word embeddings: Word2Vec
10000x300 matrix W
300x10000 matrix C
Sentiance Confidential 2017
Word embeddings: Word2Vec
1. Word2Vec (skip-gram with negative sampling)
• Only 1 hidden layer, no activation function => Linear!
• Learns word embedding matrix W
• Learns auxiliary context embedding matrix C
𝑊
𝑑
𝐶
𝑑
W
C
Sentiance Confidential 2017
Word embeddings: Word2Vec
1. Word2Vec (skip-gram with negative sampling)
• What if you multiply them together?
𝑊
𝑑
𝐶
𝑑
?=
W
C
W
C
Sentiance Confidential 2017
Word embeddings: Word2Vec
1. Word2Vec (skip-gram with negative sampling)
• Levy et al. proved: “For large enough d and enough iterations:
-> We get a shifted version of the word-context PPMI matrix!
𝑊
𝑑
𝐶
𝑑
?=
W
C
W
C
Sentiance Confidential 2017
Word embeddings: Word2Vec
1. Word2Vec (skip-gram with negative sampling)
• Opt(w.c) = max(0.0, PMI(w, c)-log(k))
• k = number of negative samples
• log(k) acts as a prior/regularizer and is 0 if k=1
𝑊
𝑑
𝐶
𝑑
?=
W
C
W
C
Sentiance Confidential 2017
Word embeddings: Word2Vec
1. Word2Vec (skip-gram with negative sampling)
• Thus: Word2Vec factorizes the word-context co-occurrence matrix
• So does SVD!
- Although the matrices here are unconstrained (not orthogonal)
• What about Glove?
Sentiance Confidential 2017
Word embeddings: Glove
1. Glove
• Explicitly start by calculating the co-occurrence matrix X
• Define cost function
- Predict co-occurrence of two words based on their word embedding vectors
• Optimize a weighted least square objective
Weighting function to decrease importance of high co-occurence
Sentiance Confidential 2017
Word embeddings: Glove
1. Glove
• Thus: Glove factorizes the (logarithm of the) word-context co-occurrence
matrix
• So does SVD!
Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Do Word2Vec and Glove perform better than SVD?
• They all do matrix factorization
• But literature seems to suggest ‘yes’?
• Levy et al. showed:
- Reformulating prediction based (Word2Vec, Glove) methods into count
based (SVD, PMI) framework reveals a bunch of new hyper parameters
- Performance difference is all in the hyper parameters!
Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Preprocessing (word2vec)
1. Dynamic context windows
2. Subsampling
3. Deleting rare words
2. Postprocessing (Glove)
1. Adding word and context vectors
2. Eigenvalue weighting
3. Vector normalization
3. Association Metric (skip-gram with negative sampling)
1. Shifted PPMI
2. Context distribution smoothing
Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Preprocessing: Dynamic context windows
1. Assign higher weight to contexts closer to the target word
2. Glove: Harmonic function
-> Weight = 1/N, where N is the number of tokens between the two words
3. Word2Vec:
-> Weight ~ N/S, where S is the window size
4. Why called ‘dynamic context window’?
-> Word2Vec implements this by random sampling the window size
Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Preprocessing: Subsampling
1. With probability p, remove words with frequency f higher than threshold t
2. Do this *before* creating the context windows
-> This implicitly increases the window size!
Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Preprocessing: Deleting rare words
1. Remove words that occur very infrequently in the training data
2. Do this *before* creating the context windows
-> This implicitly increases the window size!
Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Postprocessing: Adding word and context vectors (e.g. Glove)
• Resulting cosine distance:
- combines first-order similarities (w.c) and second-order similarities (wx.wy,
cx.cy)
- (Proof: https://levyomer.files.wordpress.com/2015/03/improving-distributional-similarity-tacl-2015.pdf)
Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Postprocessing: Eigenvalue weighting
1. Different ways to get word vectors W and context vectors C:
- W = U, C = V
- W = US, C = V
- W = US1/2, C = VS1/2
2. In general:
- W = USp, C = VSp
Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Postprocessing: Vector normalization
• Different ways to normalize W
1. L2-normalize the rows => word vectors are unit length
2. L2-normalize the columns
3. Both, or other norms
In practice, first option usually works significantly better than others
Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Association metric: Shifted PPMI
• SPPMI = max(0.0, PMI(w, c)-log(k))
• k acts as a prior on the probability of observing a positive co-occurrence
• This is what skip-gram Word2Vec with negative sampling (k negative
samples) optimizes implicitly
Sentiance Confidential 2017
Word embeddings: Hyperparameters
1. Association metric: Context distribution smoothing
• PMI is biased towards rare words
• To alleviate this bias:
- Raise context counts to the power of α≤1 (e.g. α = 0.75 seems to work well)
- This increases Pα (c) for rare words => Reduces PMI
• Implicitly done in Word2Vec during negative sampling
Sentiance Confidential 2017
Word embeddings: Comparison
1. Levy et al. compared:
• PPMI
• SVD
• Word2Vec (skip-gram with negative sampling: SGNS)
• Glove
• Use the same hyper parameters everywhere (if transferable)
• Evaluate different hyper parameter values
Sentiance Confidential 2017
Word embeddings: Comparison
1. Levy et al. compared:
• 8 benchmarks
- 6 word similarity tasks
- 2 analogy tasks
Sentiance Confidential 2017
Word embeddings: Comparison
1. Levy et al. conclusions:
• Word2Vec outperforms Glove on every task
- If used with the same hyper parameters!
• SVD slightly outperforms Word2Vec sometimes
- If used with the same hyper parameters!
- On word similarity tasks
• Word2Vec slightly outperforms SVD sometimes
- If used with the same hyper parameters!
- On word analogy tasks
Sentiance Confidential 2017
Word embeddings: Comparison
1. Nevertheless, Glove/Word2Vec:
• Can be easier to train sometimes
• Hyper parameters are implicitly part of the algorithms
• Outperform SVD out-of-the-box
Main take-away:
• Mind the hyper parameters
• Quick experiments/results with SVD (or even PCA) can give good indications!
• If we had extracted the means from all columns before SVD… this would just be PCA!
Sentiance Confidential 2017
Latent Semantic Analysis
1. What if we have:
• Multiple documents (or e.g. users)
• A whole set of words per document (or e.g. user specific attributes)
2. And we want:
• Word embeddings
• User embeddings
Can we re-use the same SVD approach described earlier?
-> Remember: Input matrix A does not have to be square!
Sentiance Confidential 2017
Latent Semantic Analysis
1. First step: Create some kind of co-occurrence matrix
2. But for term-documents this time
3. For each word, count how many times it occurs in each document
words
Documents
Venuevisitcounts
Users
or
Sentiance Confidential 2017
Latent Semantic Analysis
1. Normalize counts to decrease importance of words that occur in all docs
2. Typical way: Term Frequency Inverse Document Frequency (TF-IDF) matrix
3. Term frequency:
-> K often set to 0.4 or 0.5 to make sure small changes to counts don’t cause big
changes in value
Sentiance Confidential 2017
Latent Semantic Analysis
1. Term frequency: tf =
2. Inverse document frequency:
3. tf-idf = tf . idf
words
Documents
tf-idf counts
(many different variants exist)
Sentiance Confidential 2017
Latent Semantic Analysis
1. SVD: Keep k dimensions
r r
U S VTA’
=
m x n matrix m x m matrix m x n matrix n x n matrix
k
k
k k
k-dimensional word embeddings
k-dimensional document embeddings
words
Documents
Denoised, low-rank
approximation A’
Sentiance Confidential 2017
Latent Semantic Analysis
1. Word embeddings: W = US
2. Document embeddings: D = VS
3. W and D are in the same space
Sentiance Confidential 2017
Latent Semantic Analysis
1. Our toy example:
• 5000 users
• For each user: count events
• Calculate tf-idf and apply SVD
• Goal: get user embeddings for look-a-like modeling
Sentiance Confidential 2017
Latent Semantic Analysis
1. Our toy example:
• Add info on time-of-day, duration and distance
- log(duration_in_minutes)/log(1000)
- Consider 1 digit precision
Sentiance Confidential 2017
Latent Semantic Analysis
1. Our toy example:
• Add info on time-of-day, duration and distance
- log(duration_in_minutes)/log(1000)
- Consider 1 digit precision
Buckets:
• 1: 1-2 minutes
• 2: 2-4 minutes
• 3: 4-8 minutes
• 4: 8-16 minutes
• 5: 16-31 minutes
• 6: 31-63 minutes
• 7: 63-126 minutes
• 8: 126-251 minutes
• 9: 251-501 minutes
• 10: 501-1000 minutes
Sentiance Confidential 2017
Latent Semantic Analysis
1. Our toy example:
• Add info on duration
- log(duration_in_minutes)/log(1000)
- Consider 1 digit precision
Buckets:
• 1: 1-2 minutes
• 2: 2-4 minutes
• 3: 4-8 minutes
• 4: 8-16 minutes
• 5: 16-31 minutes
• 6: 31-63 minutes
• 7: 63-126 minutes
• 8: 126-251 minutes
• 9: 251-501 minutes
• 10: 501-1000 minutes
• shop_4 => Shop visit of 8-16 minutes
• car_7 => Car trip of 63-126 minutes
• …
Sentiance Confidential 2017
Latent Semantic Analysis
1. Our toy example:
• Similar for distance and time-of-day
• Append augmentations to suffix list
• Add all possible suffix combinations to vocabulary and increase count
• E.g. Car trip of 10 km in 8 minutes at 20h00:
- Suffix list: [‘dur: 4’, ‘dist: 4’, ‘tod: 20’]
- Increase term-frequency counts for:
– Car
– Car_dur:4
– Car_dur:4_dist:4
– Car_dur:4_dist:4_tod:20
– Car_dist:4
– Car_dist:4_tod:20
– Car_tod:20
– Car_tod:20_dur:4
Sentiance Confidential 2017
Latent Semantic Analysis
1. Our toy example:
• Similar for distance and time-of-day
• Append augmentations to suffix list
• Add all possible suffix combinations to vocabulary and increase count
• E.g. Car trip of 10 km in 8 minutes at 20h00:
- Suffix list: [‘dur: 4’, ‘dist: 4’, ‘tod: 20’]
- Increase term-frequency counts for:
– Car
– Car_dur:4
– Car_dur:4_dist:4
– Car_dur:4_dist:4_tod:20
– Car_dist:4
– Car_dist:4_tod:20
– Car_tod:20
– Car_tod:20_dur:4
 Allows LSA to learn relations between
these different features
Sentiance Confidential 2017
Latent Semantic Analysis
Sentiance Confidential 2017
Latent Semantic Analysis
¯_(ツ)_/¯
?
Sentiance Confidential 2017
Latent Semantic Analysis
To visualize our toy example:
1. Check which segments we currently assign to these users
- Rule based segments
2. Randomly select two segments
3. Find a linear transformation of our feature space
- That maximizes the distance between samples which:
– Have segment 1
– Have segment 2
– Have segment 1 and 2
– Have none of both segments
-> Linear Discriminant Analysis
Sentiance Confidential 2017
Latent Semantic Analysis
To quickly estimate the capabilities:
1. Balance classes (resample majority class)
2. Calculate user embeddings
3. Train a logistic regressor
- Treat current segment detections as ground truth
- 10-fold cross validation
- Binary classification
Sentiance Confidential 2017
1. Long commuter:
• Accuracy: 81%
2. Short commuter:
• Accuracy: 81%
Sentiance Confidential 2017
1. Uber parent:
• Accuracy: 77%
2. Social activity:
• Accuracy: 88%
Sentiance Confidential 2017
1. Shopaholic:
• Accuracy: 83%
2. Student:
• Accuracy: 63%
Sentiance Confidential 2017
1. Sportive:
• Accuracy: 90%
2. Fastfood:
• Accuracy: 89%
Sentiance Confidential 2017
1. Green commuter:
• Accuracy: 89%
2. Shopaholic:
• Accuracy: 83%
Sentiance Confidential 2017
1. Home worker:
• Accuracy: 88%
2. Fulltime worker:
• Accuracy: 90%
Sentiance Confidential 2017
1. Work traveler:
• Accuracy: 76%
2. Frequent Flyer:
• Accuracy: 85%
Sentiance Confidential 2017
Latent Semantic Analysis
1. What do these feature spaces capture?
2. Fun experiment:
1. We have ‘age’ data for 500 users.
2. Let’s see if different ages behave differently
3. If so: We have an age classifier, built in a few minutes time!
Sentiance Confidential 2017
• Simple logistic regression
• 200D feature space
• 10-fold cross-validation
• Accuracy: 67%
• Recall: 67%
Immediate use-case:
Split up our student/teacher segment
Sentiance Confidential 2017
• Simple logistic regression
• 200D feature space
• 10-fold cross-validation
• Accuracy: 56%
• Recall: 55%
Immediate use-case:
Split up our parent segment (mom/dad)
• If we only trust if prob > 0.7
• Then we only classify 70% of the users
• Accuracy: 70%
• Recall: 70%
Sentiance Confidential 2017
Latent Semantic Analysis
1. Project new document into that space?
A = USVT
 VT = S-1U-1A
 VT = S-1UTA
 V = (S-1UTA)T
 V= ATUS-1
 New document q: q’ = qUS-1
So: we can project new users in an existing, pre-trained feature space!
Sentiance Confidential 2017
Latent Semantic Analysis
1. Word embeddings: W = US
2. Document embeddings: D = VS
3. W and D are in the same space
4. W represent words as a linear combination of ‘topics’
5. D represents documents as a linear combination of ‘topics’
6. S represents the importance of each topic
Sentiance Confidential 2017
Latent Semantic Analysis
1. Topics are ‘semantic dimensions’
• Uncorrelated (orthogonal) factors that can be linearly combined to obtain all
terms
2. Topic analysis?
1. Each column of W is a topic
2. The value inside the cell shows how important a word is to that topic
3. Looking at the most important ones can help interpreting topic meanings
Sentiance Confidential 2017
Latent Semantic Analysis
1. E.g.
Topic vector
Word vector Document vector
Sentiance Confidential 2017
Latent Semantic Analysis
1. LSA is basically finds the latent factors underlying the terms and documents:
Word 1
Word 2
Word 3
Word 4
Word 5
Topic 1
Topic 2
Topic 3
Document 1
Document 2
Document 3
Document 4
containsdescribed by
Sentiance Confidential 2017
Latent Semantic Analysis
1. Topics as linear combinations of words
Topic 1: (green commute?)
16.60 * biking (very short)
13.75 * biking (very short) (noon)
13.16 * biking
11.63 * train (very short)
11.28 * biking (very short) (morning)
Topic 2: (drunk students?)
12.90 * biking (very long)
7.81 * drinks_pub
7.56 * shop_supermarket (medium)
7.16 * biking (very long) (noon)
5.77 * biking (medium)
Topic 3: (sales people?)
14.76 * travel_expensive
13.15 * travel_expensive (very short)
9.44 * drinks_pub
9.03 * travel_expensive (very short) (noon)
7.91 * home (long) (morning)
Topic 4: (students?)
11.26 * travel_expensive
10.80 * travel_expensive (very short)
8.18 * walking (very short) (noon) (medium dist)
7.67 * travel_expensive (very short) (noon)
6.20 * education_school
Sentiance Confidential 2017
Latent Semantic Analysis
1. Remember:
1. We simply applied SVD
2. Exactly in the same way as for word embeddings
3. But now we also have document embeddings
2. Also remember:
• If we had extracted the mean from all our columns beforehand…
• Then SVD would just be PCA! (you would get the same word embeddings)
Sentiance Confidential 2017
Recommender systems: look-alike modeling
1. We have a user-item matrix with attributes
Items
Users
Sentiance Confidential 2017
Recommender systems: look-alike modeling
1. We have a user-item matrix with attributes
2. Some entries might be missing or noisy
-> Replace the few missing entries with their average and consider them noise
3. We now want to:
1. Calculate user-similarity. Which users have similar behavior?
2. Remove the noise from our original matrix (and fill/correct missing values)
Sentiance Confidential 2017
Recommender systems: look-alike modeling
1. Solution:
• Treat users as ‘Documents’
• Treat user attributes as ‘term counts’
• Attributes have different meaning/units
- So we have to standardize them: use z-scores
• Apply SVD (LSA)
- To find user embeddings
- To get the most similar users
• Exactly the same as what we did for LSA!
Sentiance Confidential 2017
Recommender systems: look-alike modeling
1. SVD: Keep k dimensions
r r
U S VTA’
=
m x n matrix m x m matrix m x n matrix n x n matrix
k
k
k k
k-dimensional item embeddings
k-dimensional user embeddings
Attributes
Users
Denoised, low-rank
approximation A’
Sentiance Confidential 2017
Recommender systems: look-alike modeling
1. Note:
• If the ratings or attributes have different meanings and scales:
- Need to standardize (subtract mean, divide by std)
- In this case, SVD is exactly the same as PCA (to get the item
embeddings)!
- But we get the user embeddings for free with SVD
Sentiance Confidential 2017
Recommender systems: look-alike modeling
1. SVD: Keep k dimensions
r r
U S VTA’
=
m x n matrix m x m matrix m x n matrix n x n matrix
k
k
k k
k-dimensional item embeddings
k-dimensional user embeddings
Attributes
Users
Denoised, low-rank
approximation A’
==PCA!
Sentiance Confidential 2017
Recommender systems: look-alike modeling
1. Reduce noise without reducing the dimensionality?
1. S contains the singular values on its diagonal
2. Set the k smallest singular values to 0
3. Now calculate A’ = USVT
2. We now have cleaned estimates of the user’s attributes
Sentiance Confidential 2017
Recommender systems: look-alike modeling
1. Input: user attributes (‘Sentiance attributes table’)
1. Filtered out non-numeric columns
2. Kept all numeric columns named *.value
3. Except for ‘has_home.value’ and ‘is_stable.value’
2. To evaluate the performance:
1. Similar exercise as before
2. Treat current segment detections as ground truth
3. This time we use 90k users (versus 5k users before)
Sentiance Confidential 2017
1. Long commuter:
• Accuracy: 96% (81%)
2. Short commuter:
• Accuracy: 98% (81%)
Sentiance Confidential 2017
1. Uber parent:
• Accuracy: 94% (77%)
2. Social activity:
• Accuracy: 96% (88%)
Sentiance Confidential 2017
1. Shopaholic:
• Accuracy: 92% (83%)
2. Student:
• Accuracy: 96% (63%)
Sentiance Confidential 2017
1. Sportive:
• Accuracy: 96% (90%)
2. Fastfood:
• Accuracy: 94% (89%)
Sentiance Confidential 2017
1. Green commuter:
• Accuracy: 96% (89%)
2. Shopaholic:
• Accuracy: 92% (83%)
Sentiance Confidential 2017
1. Home worker:
• Accuracy: 93% (88%)
2. Fulltime worker:
• Accuracy: 96% (90%)
Sentiance Confidential 2017
1. Work traveler:
• Accuracy: 93% (76%)
2. Frequent Flyer:
• Accuracy: 94% (85%)
Sentiance Confidential 2017
This is just the beginning...
1. Non-linear layers
• E.g. LSTM encoders
-> Careful with non-linearities if you want a Euclidean space!
2. Self-supervised learning
• E.g. triplet-loss optimization
-> Explicitly optimizes to find a Euclidean metric space!
3. Graph embeddings
• E.g. ‘deep city’
-> Avoids the bottleneck of explicitly encoding a complete user timeline into a single vector
Sentiance Confidential 2017
This is just the beginning...
1. Non-linear layers
• E.g. LSTM encoders
-> Careful with non-linearities if you want a Euclidean space!
2. Self-supervised learning
• E.g. triplet-loss optimization
-> Explicitly optimizes to find a Euclidean metric space!
3. Graph embeddings
• E.g. ‘deep city’
-> Avoids the bottleneck of explicitly encoding a complete user timeline into a single vector
Let’s keep that
for another talk 
Sentiance Confidential 2017
Conclusion
1. Embeddings:
1. Are extremely powerful
2. Allow us to cope with noisy, unlabeled input data
3. Are not magic
- You might need to fine-tune them for specific tasks (e.g. word embeddings for sentiment analysis
versus analogy inference)
- They may not capture differences caused by rare events
2. Often you can just start with:
1. PCA
2. SVD
Sentiance Confidential 2017
www.sentiance.com

More Related Content

Similar to Matrix factorization: Demistifying word embeddings

Sentimental analysis of financial articles using neural network
Sentimental analysis of financial articles using neural networkSentimental analysis of financial articles using neural network
Sentimental analysis of financial articles using neural networkBhavyateja Potineni
 
""Into the Wild" ... with Natural Language Processing and Text Classification...
""Into the Wild" ... with Natural Language Processing and Text Classification...""Into the Wild" ... with Natural Language Processing and Text Classification...
""Into the Wild" ... with Natural Language Processing and Text Classification...Dataconomy Media
 
Into the Wild - wilth Natural Language Processing and Text Classification - D...
Into the Wild - wilth Natural Language Processing and Text Classification - D...Into the Wild - wilth Natural Language Processing and Text Classification - D...
Into the Wild - wilth Natural Language Processing and Text Classification - D...Peter Grosskopf
 
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text MiningWill Stanton
 
Uva Essay Questions 2013. UVA Supplemental Essays: 2021-22 Guide
Uva Essay Questions 2013. UVA Supplemental Essays: 2021-22 GuideUva Essay Questions 2013. UVA Supplemental Essays: 2021-22 Guide
Uva Essay Questions 2013. UVA Supplemental Essays: 2021-22 Guidedavih0fytav3
 
#like or #fail - How Can Computers Tell the Difference?
#like or #fail - How Can Computers Tell the Difference? #like or #fail - How Can Computers Tell the Difference?
#like or #fail - How Can Computers Tell the Difference? Mark Cieliebak
 
Knowing when to look
Knowing when to lookKnowing when to look
Knowing when to lookJaeHo Jang
 
Machine Learning for Designers
Machine Learning for DesignersMachine Learning for Designers
Machine Learning for DesignersMemi Beltrame
 
Bára Bühnová: Naučte se taktizovat s pomocí bad code smells a quality tactics
Bára Bühnová: Naučte se taktizovat s pomocí bad code smells a quality tacticsBára Bühnová: Naučte se taktizovat s pomocí bad code smells a quality tactics
Bára Bühnová: Naučte se taktizovat s pomocí bad code smells a quality tacticsDevelcz
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixStefan Krawczyk
 
DeepSearch_Project_Report
DeepSearch_Project_ReportDeepSearch_Project_Report
DeepSearch_Project_ReportUrjit Patel
 
Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory acijjournal
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codesNAVER D2
 

Similar to Matrix factorization: Demistifying word embeddings (20)

Raising the Bar
Raising the BarRaising the Bar
Raising the Bar
 
Sentimental analysis of financial articles using neural network
Sentimental analysis of financial articles using neural networkSentimental analysis of financial articles using neural network
Sentimental analysis of financial articles using neural network
 
""Into the Wild" ... with Natural Language Processing and Text Classification...
""Into the Wild" ... with Natural Language Processing and Text Classification...""Into the Wild" ... with Natural Language Processing and Text Classification...
""Into the Wild" ... with Natural Language Processing and Text Classification...
 
Into the Wild - wilth Natural Language Processing and Text Classification - D...
Into the Wild - wilth Natural Language Processing and Text Classification - D...Into the Wild - wilth Natural Language Processing and Text Classification - D...
Into the Wild - wilth Natural Language Processing and Text Classification - D...
 
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text Mining
 
Uva Essay Questions 2013. UVA Supplemental Essays: 2021-22 Guide
Uva Essay Questions 2013. UVA Supplemental Essays: 2021-22 GuideUva Essay Questions 2013. UVA Supplemental Essays: 2021-22 Guide
Uva Essay Questions 2013. UVA Supplemental Essays: 2021-22 Guide
 
#like or #fail - How Can Computers Tell the Difference?
#like or #fail - How Can Computers Tell the Difference? #like or #fail - How Can Computers Tell the Difference?
#like or #fail - How Can Computers Tell the Difference?
 
NoEstimates @ miniIAD
NoEstimates @ miniIADNoEstimates @ miniIAD
NoEstimates @ miniIAD
 
Knowing when to look
Knowing when to lookKnowing when to look
Knowing when to look
 
Lean Content Is Smart Content - Andrew Becraft
Lean Content Is Smart Content - Andrew BecraftLean Content Is Smart Content - Andrew Becraft
Lean Content Is Smart Content - Andrew Becraft
 
Machine Learning for Designers
Machine Learning for DesignersMachine Learning for Designers
Machine Learning for Designers
 
Bára Bühnová: Naučte se taktizovat s pomocí bad code smells a quality tactics
Bára Bühnová: Naučte se taktizovat s pomocí bad code smells a quality tacticsBára Bühnová: Naučte se taktizovat s pomocí bad code smells a quality tactics
Bára Bühnová: Naučte se taktizovat s pomocí bad code smells a quality tactics
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
 
DeepSearch_Project_Report
DeepSearch_Project_ReportDeepSearch_Project_Report
DeepSearch_Project_Report
 
Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory
 
Lecture1.pptx
Lecture1.pptxLecture1.pptx
Lecture1.pptx
 
50.000 orange stickies later
50.000 orange stickies later50.000 orange stickies later
50.000 orange stickies later
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codes
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Matrix factorization: Demistifying word embeddings

  • 1. Sentiance Confidential 2017 Sentiance Confidential 2017 Matrix Factorization Demistifying word embeddings Monthly data science team meeting // journal club Vincent Spruyt November 2017
  • 2. Sentiance Confidential 2017 Goal of this talk 1. Intuitively link and think about how these concepts are related … and in a lot of situations even exactly the same! 2. Realize how many of these simple concepts can easily be applied at Sentiance (we will discuss some potential applications) PCA SVD LSA word2vec Glove look-alike modeling
  • 3. Sentiance Confidential 2017 Introduction: embeddings 1. What we have: • Lots of unlabeled data 2. What we want: • Discover hidden structure in the data • Underlying ‘causes’ of the data … and do this without any labels
  • 4. Sentiance Confidential 2017 Introduction: embeddings 1. E.g. • Observed timelines: • User 1: [home, car, work] • User 2: [home, car, shop, car, work] • User 3: [home, train, work] Question: Is user 1 more similar to user 2 or to user 3?
  • 5. Sentiance Confidential 2017 Introduction: embeddings 1. E.g. • Observed timelines: - User 1: [home, car, work] - User 2: [home, car, shop, car, work] - User 3: [home, train, work] • Latent factors: - Factor 1: [commute] - Factor 2: [shopping] - Factor 3: [green transport] - Factor 4: [non-green transport] All observations can be explained by a combination of these factors
  • 6. Sentiance Confidential 2017 Introduction: embeddings 1. E.g. • Observed timelines: - User 1: [home, car, work] - User 2: [home, car, shop, walk, work] - User 3: [home, train, work] • Latent factors: - Factor 1: [commute] - Factor 2: [shopping] - Factor 3: [green transport] - Factor 4: [non-green transport] User 1 = 1*factor1 + 0*factor2 + 0*factor3 + 1*factor4 User 2 = 1*factor1 + 1*factor2 + 0.5*factor3 + 0.5*factor4 User 3 = 1*factor1 + 0*factor2 + 1.0*factor3 + 0*factor4
  • 7. Sentiance Confidential 2017 Introduction: embeddings User 1 = 1*factor1 + 0*factor2 + 0*factor3 + 1*factor4 User 2 = 1*factor1 + 1*factor2 + 0.5*factor3 + 0.5*factor4 User 3 = 1*factor1 + 0*factor2 + 1.0*factor3 + 0*factor4 Embeddings: User1 = [1.0, 0.0, 0.0, 1.0] User2 = [1.0, 1.0, 0.5, 0.5] User3 = [1.0, 0.0, 1.0, 0.0]
  • 8. Sentiance Confidential 2017 Introduction: embeddings 1. Embeddings define coordinates: E.g. 3D embeddings: User1 = [1.0, 0.0, 0.0] User2 = [1.0, 1.0, 0.5] User3 = [1.0, 0.0, 1.0] (0,0,0) 1 2 1 2 1 2 commute shopping (1,1,1)
  • 9. Sentiance Confidential 2017 Introduction: embeddings 1. User similarity? • Measure distance in our feature space A=[0.3, 1.0] B=[1.0, 0.3] distance(a,b) = ?
  • 10. Sentiance Confidential 2017 Introduction: embeddings 1. User similarity? • Measure distance in our feature space A=[0.2, 1.0] B=[1.0, 0.3] distance(a,b) = ? Remember Pythagoras? ;) 0.7 0.8
  • 11. Sentiance Confidential 2017 Introduction: embeddings 1. User similarity? • Measure distance in our feature space A=[0.2, 1.0] B=[1.0, 0.3] distance(a,b) = sqrt(0.8*0.8 + 0.7*0.7) = 1.06 Remember Pythagoras? ;) 0.7 0.8
  • 12. Sentiance Confidential 2017 Introduction: embeddings 1. User similarity? E.g. 3D embeddings: User1 = [1.0, 0.0, 0.0] User2 = [1.0, 1.0, 0.5] User3 = [1.0, 0.0, 1.0] (0,0,0) 1 2 1 2 1 2 (1,1,1) commute shopping
  • 13. Sentiance Confidential 2017 Introduction: embeddings 1. Problem with Euclidean distance • Not scale invariant A=[0.3, 1.0] B=[1.0, 0.3] A’=[3.0, 10.0] B’=[10.0, 3.0] C’=[1.0, 2.0] C=[0.1, 0.2]
  • 14. Sentiance Confidential 2017 Introduction: embeddings 1. Solution • Cosine distance A=[0.2, 1.0] B=[1.0, 0.3] Cos(θ): angle between two vectors -> Invariant to scale! θ
  • 15. Sentiance Confidential 2017 Introduction: embeddings 1. Thus, our goal: • Find a metric space that: - captures semantic similarity between data points in a distance metric - captures hidden patterns – aka semantic meaning – aka latent factors - Each observation = linear combination of these latent factors – obs1 = a*factor1 + b*factor2 + c*factor3 – obs2 = d*factor1 + e*factor2 + f*factor3 – obs3 = g*factor1 + h*factor2 + i*factor3 …
  • 16. Sentiance Confidential 2017 Introduction: embeddings 1. Matrix notation: - Each observation = linear combination of these latent factors – obs1 = a*factor1 + b*factor2 + c*factor3 – obs2 = d*factor1 + e*factor2 + f*factor3 – obs3 = g*factor1 + h*factor2 + i*factor3
  • 17. Sentiance Confidential 2017 Introduction: embeddings 1. Final goal: • Given the observation matrix: - Find the optimal factors AND - Find the optimal weight matrix
  • 18. Sentiance Confidential 2017 Introduction: embeddings 1. Final goal: • Given the observation matrix: - Find the optimal factors AND - Find the optimal weight matrix • Two approaches: - Matrix factorization (traditional approach) - Machine learning (state-of-the-art) -> Which works better? -> Are they related?
  • 19. Sentiance Confidential 2017 Introduction: embeddings 1. Final goal: • Given the observation matrix: - Find the optimal factors AND - Find the optimal weight matrix • Two approaches: - Matrix factorization (traditional approach) - Machine learning (state-of-the-art) -> Which works better? -> Are they related? That is the topic of this talk!
  • 20. Sentiance Confidential 2017 Outline 1. Word embeddings: count vectors 2. Word embeddings: PPMI 3. Word embeddings: PCA 4. Word embeddings: SVD 5. Word embeddings: Word2Vec 6. Word embeddings: Glove 7. User Embeddings: LSA 8. User Embeddings: Collaborative filtering Python code available!
  • 21. Sentiance Confidential 2017 Word embeddings: Toy example • Input: user timelines of stationary visits • Flatten the type topology (don’t care about hierarchy for now) • Small set of 3384 random users • Total number of events: 789,043 • Vocabulary (event type) size: 125 E.g. [u'work', u'home', u'work', u'travel_busstop', u'work', u'home', u'travel_busstop', u'home', u'shop_supermarket', u'work', u'travel_parking', u'travel_expensive', u'shop', u'shop', u'travel_station', …]
  • 22. Sentiance Confidential 2017 Word embeddings: count vectors 1. Define context window (e.g. bi-grams) 2. Count how many times each word co-occurs in a context window • E.g. window_size = 5 [work, home, work, travel_busstop, work, home, travel_busstop, home, shop_supermarket, work, travel_parking, …]
  • 23. Sentiance Confidential 2017 Word embeddings: count vectors 1. Define context window (e.g. bi-grams) 2. Count how many times each word co-occurs in a context window • E.g. window_size = 5 [work, home, work, travel_busstop, work, home, travel_busstop, home, shop_supermarket, work, travel_parking, …] context window
  • 24. Sentiance Confidential 2017 Word embeddings: count vectors 1. Define context window (e.g. bi-grams) 2. Count how many times each word co-occurs in a context window • E.g. window_size = 5 [work, home, work, travel_busstop, work, home, travel_busstop, home, shop_supermarket, work, travel_parking, …] context window busstop work home supermarket parking home 2 1 1 0 0
  • 25. Sentiance Confidential 2017 Word embeddings: count vectors 1. Define context window (e.g. bi-grams) 2. Count how many times each word co-occurs in a context window • E.g. window_size = 5 [work, home, work, travel_busstop, work, home, travel_busstop, home, shop_supermarket, work, travel_parking, …] context window busstop work home supermarket parking home 2 1 1 0 0 busstop 0 1 2 1 0
  • 26. Sentiance Confidential 2017 Word embeddings: count vectors 1. Little trick: • Weigh context words by distance to target word [work, home, work, travel_busstop, work, home, travel_busstop, home, shop_supermarket, work, travel_parking, …] context window busstop work home supermarket parking home 1.5 1 0.5 0 0 busstop 0 0.5 2 0.5 0 1/11/2 1/21/1Count weights: Later more on this trick and others
  • 27. Sentiance Confidential 2017 Word embeddings: count vectors 1. Word vectors: • home = [1.5, 1.0, 0.5, 0.0, 0.0] • busstop = [0.0, 0.5, 2.0, 0.5, 0.0] • Distance between word vectors is - almost independent from their co-occurrence - almost completely determined by the similarity of the context they appear in
  • 28. Sentiance Confidential 2017 Word embeddings: count vectors 1. Define context window (e.g. bi-grams) 2. Count how many times each word co-occurs in a context window 3. Vector length |v| ~ corpus size shop_supermarket travel_aerodrome drinks_pub travel_campsite sport_golf shop_marketplace leisure_theatre travel_station shop_supermarket 12725 399 874 205 58 70 72 1425 travel_aerodrome 398 9274 55 11 1 9 8 416 drinks_pub 873 55 959 37 7 5 18 289 travel_campsite 202 11 37 762 1 1 2 16 sport_golf 57 1 7 1 1212 1 0 8 (Sparse, high dimensional) word embeddings!
  • 29. Sentiance Confidential 2017 Word embeddings: count vectors 1. Let’s visualize in 2D using a manifold projection • Isomap (just because it looked nicer than t-sne in this case) • Finds 2D space that maintains local distances 2. Normalize each vector to unit length first • We are not interested in vector length (depends on vocabulary size) • Divide by L2-norm => Euclidean distance ~ cosine distance
  • 30. Sentiance Confidential 2017 Word embeddings: count vectors
  • 32. Sentiance Confidential 2017 Word embeddings: PPMI 1. Problem: Highly dominated by words that occur a lot 2. Solution: Pointwise Mutual Information 3. ‘How much more do we observe (w,c) compared to random chance?’ • If w and c are completely uncorrelated: P(w,c) = P(w)P(c)
  • 33. Sentiance Confidential 2017 Word embeddings: PPMI 1. Problem: Highly dominated by words that occur a lot 2. Solution: Pointwise Mutual Information 3. Problem with PMI: Negative values often noisy 4. Solution: Set to zero
  • 34. Sentiance Confidential 2017 Word embeddings: PPMI 1. If fij are the co-occurrence counts, we have W (words) rows, and C (contexts) columns:
  • 35. Sentiance Confidential 2017 Word embeddings: PPMI 1. Define context window (e.g. skip-gram) 2. Count how many times each word co-occurs in the context window 3. Vector length |v| ~ corpus size shop_supermarket travel_aerodrome drinks_pub travel_campsite sport_golf shop_marketplace leisure_theatre travel_station shop_supermarket 0.24 0.0 0.0 0.29 0.0 0.16 0.0 0.0 travel_aerodrome 0.0 5.51 0.0 0.0 0.0 0.07 0.0 0.15 drinks_pub 0.0 0.0 2.64 0.89 0.0 0.0 0.41 0.0 travel_campsite 0.27 0.0 0.89 8.19 0.0 0.63 0.14 0.0 sport_golf 0.0 0.0 0.0 0.0 8.86 0.63 0.0 0.0 (Sparse, high dimensional) word embeddings!
  • 37. Sentiance Confidential 2017 Word embeddings: PPMI 1. Problem 1: Huge dimensionality (e.g. 50.000D embedding vector) • Distances in large dimensions become meaningless • Everything is almost equally far away from everything else
  • 38. Sentiance Confidential 2017 Word embeddings: PPMI 1. Problem 1: Huge dimensionality (e.g. 50.000D embedding vector) • Distances in large dimensions become meaningless • Everything is almost equally far away from everything else
  • 39. Sentiance Confidential 2017 Word embeddings: PPMI 1. Problem 1: Huge dimensionality (e.g. 50.000D embedding vector) • Distances in large dimensions become meaningless • Everything is almost equally far away from everything else 2. Problem 2: Very sparse matrix • Inefficient. Can we compress (turn into dense matrix)? • Aggravates the high dimensional distance issue 3. Problem: Counts are very noisy • Can we denoise the matrix? • Can we transfer knowledge from other co-occurrence counts?
  • 40. Sentiance Confidential 2017 Word embeddings: PPMI 1. Solution: • Find low-rank approximation of PPMI matrix - Uncorrelated columns -> Orthogonal base - Dense - De-noised - Low-dimensional (when decomposed using matrix factorization) • => Compression + noise reduction
  • 41. Sentiance Confidential 2017 Word embeddings: PCA 1. Idea: PCA! 2. New columns: linear combinations of old columns 3. De-correlated dimensions => Low rank! • X’ = VX gives reduced dimensionality (X is input data) • X’’ = VTX’ gives reconstructed low-rank approximation 4. Keeps dimensions of largest variance => Noise reduction! • Assuming that the noise has smaller variance than the information
  • 43. Sentiance Confidential 2017 Word embeddings: PCA 1. Problems: 1. PCA breaks down if data is not normally distributed 1. Word co-occurences are all positive, so not Gaussian => No guarantee that eigenvectors capture largest variance
  • 44. Sentiance Confidential 2017 Word embeddings: PCA 1. Problems: 2. PCA requires centering the data (subtract mean) 1. Sparseness lost => Computationally more complex 2. Angle between vectors not preserved => Cosine distance? Θ=90° Cosine distance: And u.v = ||a||2 ||b||2 cos(Θ) And cos(90°) = 0.0  Cosine distance = 1.0
  • 45. Sentiance Confidential 2017 Word embeddings: PCA 1. Problems: 2. PCA requires centering the data (subtract mean) 1. Sparseness lost => Computationally more complex 2. Angle between vectors not preserved => Cosine distance? Θ=180° Cosine distance: And u.v = ||a||2 ||b||2 cos(Θ) And cos(180°) = -1.0  Cosine distance = 2.0
  • 46. Sentiance Confidential 2017 Word embeddings: SVD 1. Forget about capturing largest variance 2. Given co-occurrence matrix A • Find low-rank approximation Ak • That minimizes ||A – Ak||F Note: Frobenius norm ||A||F of matrix A is simply the Euclidean norm of the vector containing its elements:
  • 47. Sentiance Confidential 2017 Word embeddings: SVD 1. Forget about capturing largest variance 2. Given co-occurrence matrix A • Find low-rank approximation Ak • That minimizes ||A – Ak||F 3. Optimal solution: • Singular Value Decomposition (SVD)
  • 48. Sentiance Confidential 2017 Word embeddings: SVD 1. SVD r = rank of A r r r U S VTA = m x n matrix m x m matrix m x n matrix n x n matrix
  • 49. Sentiance Confidential 2017 Word embeddings: SVD 1. SVD r = rank of A r r r U S VTA = m x n matrix m x m matrix m x n matrix n x n matrix For skip-gram PMI, m=n and A is symmetric. But let’s look at the general case here
  • 50. Sentiance Confidential 2017 Word embeddings: SVD 1. SVD r = rank of A r r r U S VTA = m x n matrix m x m matrix m x n matrix n x n matrix Eigenvector of ATA Eigenvector of AAT Singular values of AAT If data is centered: Remember: • cov(A) = ATA => cov(AT) = AAT
  • 51. Sentiance Confidential 2017 Word embeddings: SVD 1. SVD r r U S VTA = m x n matrix m x m matrix m x n matrix n x n matrix Eigenvector of ATA Eigenvector of AAT Singular values of AAT If data is centered: US = PCA(A)
  • 52. Sentiance Confidential 2017 Word embeddings: SVD S is diagonal => A = (US)VT and => A = U(SVT) 1. First case: Express rows of A as linear combination of rows of VT, with (US) as coefficients 2. Second case: Express columns of A as linear combinations of columns of U, with (SVT) as coefficients
  • 53. Sentiance Confidential 2017 Word embeddings: SVD S is diagonal, let D=S1/2 => A = UDDVT => A = (UD)DTVT and => A = UD(DTVT) 1. First case: Word embeddings: UD 2. Second case: Context embeddings: VD - Equal in case of skip-gram word-contexts - Because in that case A is symmetric => U==V - Not equal in other cases Remember: • (AB)T = BTAT
  • 54. Sentiance Confidential 2017 Word embeddings: SVD 1. We now have low-rank word embeddings -> Word ~ linear combination of decorrelated (orthogonal) factors 2. Still high dimension (~ vocabulary size) and noisy -> Next step: dimensionality reduction
  • 55. Sentiance Confidential 2017 Word embeddings: SVD 1. SVD: Keep k dimensions r r U S VTA’ = m x n matrix m x m matrix m x n matrix n x n matrix k k k k words contexts
  • 56. Sentiance Confidential 2017 Word embeddings: SVD 1. SVD: Keep k dimensions r r U S VTA’ = m x n matrix m x m matrix m x n matrix n x n matrix k k k k k-dimensional word embeddings k-dimensional context embeddings words contexts Denoised, low-rank approximation A’
  • 58. Sentiance Confidential 2017 Home-related shopping activities Public transport Health Sport Restaurants Holiday Office
  • 59. Sentiance Confidential 2017 Word embeddings: SVD 1. Words as linear combination of topics r U SA’ = m x k matrix k x k matrix k k k words contexts Rows: word vectors Columns: linear weights => topics! Let’s have a look at some of these topics… m x k matrix
  • 60. Sentiance Confidential 2017 Word embeddings: SVD 1. Topics as linear combinations of words Topic 1: (tourism?) 1.59 * leisure_garden 0.95 * leisure_themepark 0.77 * leisure_playground 0.75 * leisure_museum 0.64 * leisure Topic 2: (home-shops?) 0.74 * shop_wholesale 0.73 * shop_pet 0.60 * shop_furniture 0.60 * shop_varietystore 0.57 * shop_hardware Topic 3: (holidays?) 0.94 * travel_terminal 0.88 * travel_aerodrome 0.84 * leisure_playground 0.77 * sport_swimmingpool 0.57 * travel_caravansite Topic 4: (holiday-shops?) 1.59 * shop_hobby 0.79 * shop_food 0.75 * travel_ferryterminal 0.65 * shop_retail 0.62 * leisure_marina Topic 5: (office/health?) 0.63 * restaurant_french 0.58 * office_library 0.56 * health_doctors 0.56 * health_dentist 0.56 * office_townhall Topic 6: (sport-ish?) ¯_(ツ)_/¯ 0.83 * shop_alcohol 0.63 * leisure_naturereserve 0.56 * shop_hairdresser 0.55 * sport_pitch 0.53 * sport_stadium
  • 61. Sentiance Confidential 2017 Word embeddings: SOTA 1. How does this differ from: • Word2vec • Glove Must read: Levy et al. “Neural word embedding as implicit matrix factorization.” https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf Levy et al. "Improving distributional similarity with lessons learned from word embeddings." https://levyomer.files.wordpress.com/2015/03/improving-distributional-similarity-tacl-2015.pdf Great summary blog posts: • http://blog.aylien.com/overview-word-embeddings-history-word2vec-cbow-glove/ • http://ruder.io/secret-word2vec/ • https://rare-technologies.com/making-sense-of-word2vec/
  • 62. Sentiance Confidential 2017 Word embeddings: Word2Vec 1. Word2Vec (skip-gram with negative sampling) • Learns word embedding matrix W • Learns auxiliary context embedding matrix C
  • 63. Sentiance Confidential 2017 Word embeddings: Word2Vec (source: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
  • 64. Sentiance Confidential 2017 Word embeddings: Word2Vec 10000x300 matrix W 300x10000 matrix C
  • 65. Sentiance Confidential 2017 Word embeddings: Word2Vec 1. Word2Vec (skip-gram with negative sampling) • Only 1 hidden layer, no activation function => Linear! • Learns word embedding matrix W • Learns auxiliary context embedding matrix C 𝑊 𝑑 𝐶 𝑑 W C
  • 66. Sentiance Confidential 2017 Word embeddings: Word2Vec 1. Word2Vec (skip-gram with negative sampling) • What if you multiply them together? 𝑊 𝑑 𝐶 𝑑 ?= W C W C
  • 67. Sentiance Confidential 2017 Word embeddings: Word2Vec 1. Word2Vec (skip-gram with negative sampling) • Levy et al. proved: “For large enough d and enough iterations: -> We get a shifted version of the word-context PPMI matrix! 𝑊 𝑑 𝐶 𝑑 ?= W C W C
  • 68. Sentiance Confidential 2017 Word embeddings: Word2Vec 1. Word2Vec (skip-gram with negative sampling) • Opt(w.c) = max(0.0, PMI(w, c)-log(k)) • k = number of negative samples • log(k) acts as a prior/regularizer and is 0 if k=1 𝑊 𝑑 𝐶 𝑑 ?= W C W C
  • 69. Sentiance Confidential 2017 Word embeddings: Word2Vec 1. Word2Vec (skip-gram with negative sampling) • Thus: Word2Vec factorizes the word-context co-occurrence matrix • So does SVD! - Although the matrices here are unconstrained (not orthogonal) • What about Glove?
  • 70. Sentiance Confidential 2017 Word embeddings: Glove 1. Glove • Explicitly start by calculating the co-occurrence matrix X • Define cost function - Predict co-occurrence of two words based on their word embedding vectors • Optimize a weighted least square objective Weighting function to decrease importance of high co-occurence
  • 71. Sentiance Confidential 2017 Word embeddings: Glove 1. Glove • Thus: Glove factorizes the (logarithm of the) word-context co-occurrence matrix • So does SVD!
  • 72. Sentiance Confidential 2017 Word embeddings: Hyperparameters 1. Do Word2Vec and Glove perform better than SVD? • They all do matrix factorization • But literature seems to suggest ‘yes’? • Levy et al. showed: - Reformulating prediction based (Word2Vec, Glove) methods into count based (SVD, PMI) framework reveals a bunch of new hyper parameters - Performance difference is all in the hyper parameters!
  • 73. Sentiance Confidential 2017 Word embeddings: Hyperparameters 1. Preprocessing (word2vec) 1. Dynamic context windows 2. Subsampling 3. Deleting rare words 2. Postprocessing (Glove) 1. Adding word and context vectors 2. Eigenvalue weighting 3. Vector normalization 3. Association Metric (skip-gram with negative sampling) 1. Shifted PPMI 2. Context distribution smoothing
  • 74. Sentiance Confidential 2017 Word embeddings: Hyperparameters 1. Preprocessing: Dynamic context windows 1. Assign higher weight to contexts closer to the target word 2. Glove: Harmonic function -> Weight = 1/N, where N is the number of tokens between the two words 3. Word2Vec: -> Weight ~ N/S, where S is the window size 4. Why called ‘dynamic context window’? -> Word2Vec implements this by random sampling the window size
  • 75. Sentiance Confidential 2017 Word embeddings: Hyperparameters 1. Preprocessing: Subsampling 1. With probability p, remove words with frequency f higher than threshold t 2. Do this *before* creating the context windows -> This implicitly increases the window size!
  • 76. Sentiance Confidential 2017 Word embeddings: Hyperparameters 1. Preprocessing: Deleting rare words 1. Remove words that occur very infrequently in the training data 2. Do this *before* creating the context windows -> This implicitly increases the window size!
  • 77. Sentiance Confidential 2017 Word embeddings: Hyperparameters 1. Postprocessing: Adding word and context vectors (e.g. Glove) • Resulting cosine distance: - combines first-order similarities (w.c) and second-order similarities (wx.wy, cx.cy) - (Proof: https://levyomer.files.wordpress.com/2015/03/improving-distributional-similarity-tacl-2015.pdf)
  • 78. Sentiance Confidential 2017 Word embeddings: Hyperparameters 1. Postprocessing: Eigenvalue weighting 1. Different ways to get word vectors W and context vectors C: - W = U, C = V - W = US, C = V - W = US1/2, C = VS1/2 2. In general: - W = USp, C = VSp
  • 79. Sentiance Confidential 2017 Word embeddings: Hyperparameters 1. Postprocessing: Vector normalization • Different ways to normalize W 1. L2-normalize the rows => word vectors are unit length 2. L2-normalize the columns 3. Both, or other norms In practice, first option usually works significantly better than others
  • 80. Sentiance Confidential 2017 Word embeddings: Hyperparameters 1. Association metric: Shifted PPMI • SPPMI = max(0.0, PMI(w, c)-log(k)) • k acts as a prior on the probability of observing a positive co-occurrence • This is what skip-gram Word2Vec with negative sampling (k negative samples) optimizes implicitly
  • 81. Sentiance Confidential 2017 Word embeddings: Hyperparameters 1. Association metric: Context distribution smoothing • PMI is biased towards rare words • To alleviate this bias: - Raise context counts to the power of α≤1 (e.g. α = 0.75 seems to work well) - This increases Pα (c) for rare words => Reduces PMI • Implicitly done in Word2Vec during negative sampling
  • 82. Sentiance Confidential 2017 Word embeddings: Comparison 1. Levy et al. compared: • PPMI • SVD • Word2Vec (skip-gram with negative sampling: SGNS) • Glove • Use the same hyper parameters everywhere (if transferable) • Evaluate different hyper parameter values
  • 83. Sentiance Confidential 2017 Word embeddings: Comparison 1. Levy et al. compared: • 8 benchmarks - 6 word similarity tasks - 2 analogy tasks
  • 84. Sentiance Confidential 2017 Word embeddings: Comparison 1. Levy et al. conclusions: • Word2Vec outperforms Glove on every task - If used with the same hyper parameters! • SVD slightly outperforms Word2Vec sometimes - If used with the same hyper parameters! - On word similarity tasks • Word2Vec slightly outperforms SVD sometimes - If used with the same hyper parameters! - On word analogy tasks
  • 85. Sentiance Confidential 2017 Word embeddings: Comparison 1. Nevertheless, Glove/Word2Vec: • Can be easier to train sometimes • Hyper parameters are implicitly part of the algorithms • Outperform SVD out-of-the-box Main take-away: • Mind the hyper parameters • Quick experiments/results with SVD (or even PCA) can give good indications! • If we had extracted the means from all columns before SVD… this would just be PCA!
  • 86. Sentiance Confidential 2017 Latent Semantic Analysis 1. What if we have: • Multiple documents (or e.g. users) • A whole set of words per document (or e.g. user specific attributes) 2. And we want: • Word embeddings • User embeddings Can we re-use the same SVD approach described earlier? -> Remember: Input matrix A does not have to be square!
  • 87. Sentiance Confidential 2017 Latent Semantic Analysis 1. First step: Create some kind of co-occurrence matrix 2. But for term-documents this time 3. For each word, count how many times it occurs in each document words Documents Venuevisitcounts Users or
  • 88. Sentiance Confidential 2017 Latent Semantic Analysis 1. Normalize counts to decrease importance of words that occur in all docs 2. Typical way: Term Frequency Inverse Document Frequency (TF-IDF) matrix 3. Term frequency: -> K often set to 0.4 or 0.5 to make sure small changes to counts don’t cause big changes in value
  • 89. Sentiance Confidential 2017 Latent Semantic Analysis 1. Term frequency: tf = 2. Inverse document frequency: 3. tf-idf = tf . idf words Documents tf-idf counts (many different variants exist)
  • 90. Sentiance Confidential 2017 Latent Semantic Analysis 1. SVD: Keep k dimensions r r U S VTA’ = m x n matrix m x m matrix m x n matrix n x n matrix k k k k k-dimensional word embeddings k-dimensional document embeddings words Documents Denoised, low-rank approximation A’
  • 91. Sentiance Confidential 2017 Latent Semantic Analysis 1. Word embeddings: W = US 2. Document embeddings: D = VS 3. W and D are in the same space
  • 92. Sentiance Confidential 2017 Latent Semantic Analysis 1. Our toy example: • 5000 users • For each user: count events • Calculate tf-idf and apply SVD • Goal: get user embeddings for look-a-like modeling
  • 93. Sentiance Confidential 2017 Latent Semantic Analysis 1. Our toy example: • Add info on time-of-day, duration and distance - log(duration_in_minutes)/log(1000) - Consider 1 digit precision
  • 94. Sentiance Confidential 2017 Latent Semantic Analysis 1. Our toy example: • Add info on time-of-day, duration and distance - log(duration_in_minutes)/log(1000) - Consider 1 digit precision Buckets: • 1: 1-2 minutes • 2: 2-4 minutes • 3: 4-8 minutes • 4: 8-16 minutes • 5: 16-31 minutes • 6: 31-63 minutes • 7: 63-126 minutes • 8: 126-251 minutes • 9: 251-501 minutes • 10: 501-1000 minutes
  • 95. Sentiance Confidential 2017 Latent Semantic Analysis 1. Our toy example: • Add info on duration - log(duration_in_minutes)/log(1000) - Consider 1 digit precision Buckets: • 1: 1-2 minutes • 2: 2-4 minutes • 3: 4-8 minutes • 4: 8-16 minutes • 5: 16-31 minutes • 6: 31-63 minutes • 7: 63-126 minutes • 8: 126-251 minutes • 9: 251-501 minutes • 10: 501-1000 minutes • shop_4 => Shop visit of 8-16 minutes • car_7 => Car trip of 63-126 minutes • …
  • 96. Sentiance Confidential 2017 Latent Semantic Analysis 1. Our toy example: • Similar for distance and time-of-day • Append augmentations to suffix list • Add all possible suffix combinations to vocabulary and increase count • E.g. Car trip of 10 km in 8 minutes at 20h00: - Suffix list: [‘dur: 4’, ‘dist: 4’, ‘tod: 20’] - Increase term-frequency counts for: – Car – Car_dur:4 – Car_dur:4_dist:4 – Car_dur:4_dist:4_tod:20 – Car_dist:4 – Car_dist:4_tod:20 – Car_tod:20 – Car_tod:20_dur:4
  • 97. Sentiance Confidential 2017 Latent Semantic Analysis 1. Our toy example: • Similar for distance and time-of-day • Append augmentations to suffix list • Add all possible suffix combinations to vocabulary and increase count • E.g. Car trip of 10 km in 8 minutes at 20h00: - Suffix list: [‘dur: 4’, ‘dist: 4’, ‘tod: 20’] - Increase term-frequency counts for: – Car – Car_dur:4 – Car_dur:4_dist:4 – Car_dur:4_dist:4_tod:20 – Car_dist:4 – Car_dist:4_tod:20 – Car_tod:20 – Car_tod:20_dur:4  Allows LSA to learn relations between these different features
  • 99. Sentiance Confidential 2017 Latent Semantic Analysis ¯_(ツ)_/¯ ?
  • 100. Sentiance Confidential 2017 Latent Semantic Analysis To visualize our toy example: 1. Check which segments we currently assign to these users - Rule based segments 2. Randomly select two segments 3. Find a linear transformation of our feature space - That maximizes the distance between samples which: – Have segment 1 – Have segment 2 – Have segment 1 and 2 – Have none of both segments -> Linear Discriminant Analysis
  • 101. Sentiance Confidential 2017 Latent Semantic Analysis To quickly estimate the capabilities: 1. Balance classes (resample majority class) 2. Calculate user embeddings 3. Train a logistic regressor - Treat current segment detections as ground truth - 10-fold cross validation - Binary classification
  • 102. Sentiance Confidential 2017 1. Long commuter: • Accuracy: 81% 2. Short commuter: • Accuracy: 81%
  • 103. Sentiance Confidential 2017 1. Uber parent: • Accuracy: 77% 2. Social activity: • Accuracy: 88%
  • 104. Sentiance Confidential 2017 1. Shopaholic: • Accuracy: 83% 2. Student: • Accuracy: 63%
  • 105. Sentiance Confidential 2017 1. Sportive: • Accuracy: 90% 2. Fastfood: • Accuracy: 89%
  • 106. Sentiance Confidential 2017 1. Green commuter: • Accuracy: 89% 2. Shopaholic: • Accuracy: 83%
  • 107. Sentiance Confidential 2017 1. Home worker: • Accuracy: 88% 2. Fulltime worker: • Accuracy: 90%
  • 108. Sentiance Confidential 2017 1. Work traveler: • Accuracy: 76% 2. Frequent Flyer: • Accuracy: 85%
  • 109. Sentiance Confidential 2017 Latent Semantic Analysis 1. What do these feature spaces capture? 2. Fun experiment: 1. We have ‘age’ data for 500 users. 2. Let’s see if different ages behave differently 3. If so: We have an age classifier, built in a few minutes time!
  • 110. Sentiance Confidential 2017 • Simple logistic regression • 200D feature space • 10-fold cross-validation • Accuracy: 67% • Recall: 67% Immediate use-case: Split up our student/teacher segment
  • 111. Sentiance Confidential 2017 • Simple logistic regression • 200D feature space • 10-fold cross-validation • Accuracy: 56% • Recall: 55% Immediate use-case: Split up our parent segment (mom/dad) • If we only trust if prob > 0.7 • Then we only classify 70% of the users • Accuracy: 70% • Recall: 70%
  • 112. Sentiance Confidential 2017 Latent Semantic Analysis 1. Project new document into that space? A = USVT  VT = S-1U-1A  VT = S-1UTA  V = (S-1UTA)T  V= ATUS-1  New document q: q’ = qUS-1 So: we can project new users in an existing, pre-trained feature space!
  • 113. Sentiance Confidential 2017 Latent Semantic Analysis 1. Word embeddings: W = US 2. Document embeddings: D = VS 3. W and D are in the same space 4. W represent words as a linear combination of ‘topics’ 5. D represents documents as a linear combination of ‘topics’ 6. S represents the importance of each topic
  • 114. Sentiance Confidential 2017 Latent Semantic Analysis 1. Topics are ‘semantic dimensions’ • Uncorrelated (orthogonal) factors that can be linearly combined to obtain all terms 2. Topic analysis? 1. Each column of W is a topic 2. The value inside the cell shows how important a word is to that topic 3. Looking at the most important ones can help interpreting topic meanings
  • 115. Sentiance Confidential 2017 Latent Semantic Analysis 1. E.g. Topic vector Word vector Document vector
  • 116. Sentiance Confidential 2017 Latent Semantic Analysis 1. LSA is basically finds the latent factors underlying the terms and documents: Word 1 Word 2 Word 3 Word 4 Word 5 Topic 1 Topic 2 Topic 3 Document 1 Document 2 Document 3 Document 4 containsdescribed by
  • 117. Sentiance Confidential 2017 Latent Semantic Analysis 1. Topics as linear combinations of words Topic 1: (green commute?) 16.60 * biking (very short) 13.75 * biking (very short) (noon) 13.16 * biking 11.63 * train (very short) 11.28 * biking (very short) (morning) Topic 2: (drunk students?) 12.90 * biking (very long) 7.81 * drinks_pub 7.56 * shop_supermarket (medium) 7.16 * biking (very long) (noon) 5.77 * biking (medium) Topic 3: (sales people?) 14.76 * travel_expensive 13.15 * travel_expensive (very short) 9.44 * drinks_pub 9.03 * travel_expensive (very short) (noon) 7.91 * home (long) (morning) Topic 4: (students?) 11.26 * travel_expensive 10.80 * travel_expensive (very short) 8.18 * walking (very short) (noon) (medium dist) 7.67 * travel_expensive (very short) (noon) 6.20 * education_school
  • 118. Sentiance Confidential 2017 Latent Semantic Analysis 1. Remember: 1. We simply applied SVD 2. Exactly in the same way as for word embeddings 3. But now we also have document embeddings 2. Also remember: • If we had extracted the mean from all our columns beforehand… • Then SVD would just be PCA! (you would get the same word embeddings)
  • 119. Sentiance Confidential 2017 Recommender systems: look-alike modeling 1. We have a user-item matrix with attributes Items Users
  • 120. Sentiance Confidential 2017 Recommender systems: look-alike modeling 1. We have a user-item matrix with attributes 2. Some entries might be missing or noisy -> Replace the few missing entries with their average and consider them noise 3. We now want to: 1. Calculate user-similarity. Which users have similar behavior? 2. Remove the noise from our original matrix (and fill/correct missing values)
  • 121. Sentiance Confidential 2017 Recommender systems: look-alike modeling 1. Solution: • Treat users as ‘Documents’ • Treat user attributes as ‘term counts’ • Attributes have different meaning/units - So we have to standardize them: use z-scores • Apply SVD (LSA) - To find user embeddings - To get the most similar users • Exactly the same as what we did for LSA!
  • 122. Sentiance Confidential 2017 Recommender systems: look-alike modeling 1. SVD: Keep k dimensions r r U S VTA’ = m x n matrix m x m matrix m x n matrix n x n matrix k k k k k-dimensional item embeddings k-dimensional user embeddings Attributes Users Denoised, low-rank approximation A’
  • 123. Sentiance Confidential 2017 Recommender systems: look-alike modeling 1. Note: • If the ratings or attributes have different meanings and scales: - Need to standardize (subtract mean, divide by std) - In this case, SVD is exactly the same as PCA (to get the item embeddings)! - But we get the user embeddings for free with SVD
  • 124. Sentiance Confidential 2017 Recommender systems: look-alike modeling 1. SVD: Keep k dimensions r r U S VTA’ = m x n matrix m x m matrix m x n matrix n x n matrix k k k k k-dimensional item embeddings k-dimensional user embeddings Attributes Users Denoised, low-rank approximation A’ ==PCA!
  • 125. Sentiance Confidential 2017 Recommender systems: look-alike modeling 1. Reduce noise without reducing the dimensionality? 1. S contains the singular values on its diagonal 2. Set the k smallest singular values to 0 3. Now calculate A’ = USVT 2. We now have cleaned estimates of the user’s attributes
  • 126. Sentiance Confidential 2017 Recommender systems: look-alike modeling 1. Input: user attributes (‘Sentiance attributes table’) 1. Filtered out non-numeric columns 2. Kept all numeric columns named *.value 3. Except for ‘has_home.value’ and ‘is_stable.value’ 2. To evaluate the performance: 1. Similar exercise as before 2. Treat current segment detections as ground truth 3. This time we use 90k users (versus 5k users before)
  • 127. Sentiance Confidential 2017 1. Long commuter: • Accuracy: 96% (81%) 2. Short commuter: • Accuracy: 98% (81%)
  • 128. Sentiance Confidential 2017 1. Uber parent: • Accuracy: 94% (77%) 2. Social activity: • Accuracy: 96% (88%)
  • 129. Sentiance Confidential 2017 1. Shopaholic: • Accuracy: 92% (83%) 2. Student: • Accuracy: 96% (63%)
  • 130. Sentiance Confidential 2017 1. Sportive: • Accuracy: 96% (90%) 2. Fastfood: • Accuracy: 94% (89%)
  • 131. Sentiance Confidential 2017 1. Green commuter: • Accuracy: 96% (89%) 2. Shopaholic: • Accuracy: 92% (83%)
  • 132. Sentiance Confidential 2017 1. Home worker: • Accuracy: 93% (88%) 2. Fulltime worker: • Accuracy: 96% (90%)
  • 133. Sentiance Confidential 2017 1. Work traveler: • Accuracy: 93% (76%) 2. Frequent Flyer: • Accuracy: 94% (85%)
  • 134. Sentiance Confidential 2017 This is just the beginning... 1. Non-linear layers • E.g. LSTM encoders -> Careful with non-linearities if you want a Euclidean space! 2. Self-supervised learning • E.g. triplet-loss optimization -> Explicitly optimizes to find a Euclidean metric space! 3. Graph embeddings • E.g. ‘deep city’ -> Avoids the bottleneck of explicitly encoding a complete user timeline into a single vector
  • 135. Sentiance Confidential 2017 This is just the beginning... 1. Non-linear layers • E.g. LSTM encoders -> Careful with non-linearities if you want a Euclidean space! 2. Self-supervised learning • E.g. triplet-loss optimization -> Explicitly optimizes to find a Euclidean metric space! 3. Graph embeddings • E.g. ‘deep city’ -> Avoids the bottleneck of explicitly encoding a complete user timeline into a single vector Let’s keep that for another talk 
  • 136. Sentiance Confidential 2017 Conclusion 1. Embeddings: 1. Are extremely powerful 2. Allow us to cope with noisy, unlabeled input data 3. Are not magic - You might need to fine-tune them for specific tasks (e.g. word embeddings for sentiment analysis versus analogy inference) - They may not capture differences caused by rare events 2. Often you can just start with: 1. PCA 2. SVD