SlideShare a Scribd company logo
Matrix Decomposition Techniques
Matrix Decomposition
● Last week we examined the idea of latent spaces and how
we could use Latent Dirichlet Allocation to create a ”topic
space.”
● LDA is not the only method to create latent spaces, so
today we’ll investigate some more “mathematically
rigorous” ways to accomplish the same task.
● Let’s start by reviewing latent spaces…
“Rabbit”
“Dish”
“Pet”
“Rabbit”
“Dish”
“Pet”
From 3D to a 2D space made of a bit
of each of the original dimensions.
Latent Spaces – A Reminder
● Latent spaces are a “hidden” set of axes that we use to look at the information
in a different way.
● For NLP, we typically want to go from a “word space,” which may be 1000s of
dimensions, to a smaller dimensionality “combinations of words space.” We
called this a “topic space” last week.
● We used LDA to decide on how the topics were defined last week – it’s not our
only option.
● Clustering can work MUCH better in these lower dimensional spaces due to
the curse of dimensionality.
● Let’s look at some other ways to get to latent spaces.
Dimensionality Reduction without NLP
● Dimensionality reduction is an “old” technique in terms of mathematical
problems. The most famous version is Principal Component Analysis
(PCA).
Dimensionality Reduction without NLP
● A principle component analysis shifts the axes around to find a set of
axes that capture most of the variance in the data. In this example,
with just this one axes, we’re getting the majority of how the data
changes. Principle Component 1
Dimensionality Reduction without NLP
● If we add back the same number of components we capture the data
fully. However, we don’t HAVE to add that last component back. We
could just convert to a lower dimensionality representation of the data.
Principle Component 1
Principle Component 2
Dimensionality Reduction without NLP
● If we stick with one dimension in this example, we’ve captured MOST
of what makes each data point special, and we’re only storing half as
much data (PC1 vs [x,y]).
Principle Component 1
Simple PCA Example in Python
Output:
Input:
from sklearn.datasets import load_iris
iris_data = iris.data
iris_data = pd.DataFrame(iris.data)
iris_data.columns = [‘Sepal_Length', ‘Sepal_Width', ‘Petal_Length', ‘Petal_Width']
print(iris_data.head()) # print function requires Python 3
Simple PCA Example in Python
Output:
Input:
from sklearn.decomposition import PCA
pca_1 = PCA(n_components=1)
iris_1_component = pca_1.fit_transform(iris_data)
print(pca_1.components_[0]) # print function only works in Python 3
array([ 0.36158968, -0.08226889, 0.85657211, 0.35884393])
First principal component is:
0.36 * Sepal_length + (-0.08 * Sepal_width) + 0.86 * Petal_length + 0.36 * Petal_width
Simple PCA Example in Python
Output:
Input:
from sklearn.decomposition import PCA
iris_1_component_df = pd.DataFrame(iris_reduced,
columns=[“new_component”])
print(iris_1_component_df.head()) # print function only works in Python 3
Values of observations on
this new component
Principle Component Analysis: The Math
● In PCA, we start by finding the covariance matrix. Then we can find
eigenvalues and eigenvectors on that matrix.
● Let’s imagine that we have some covariance matrix we’re calling A.
We can imagine there is some linear transformation X that can be
equivalent to applying a scaler to A.
● This last line is a set of equations that we can solve for lambda’s – the
eigenvalues. This also gives us the eigenvectors, with equation 1.
Principle Component Analysis
● The eigenvectors will define the new axis set, since the eigen-
decomposition is asking “what axes can we find that make the
covariance matrix as small (and diagonal) as possible.”
● The new space is a latent space. Great!
● So, can we just use this with text?
Principle Component Analysis
● Not really.
● One way to calculate principal components uses the covariance matrix
of the data.
● A covariance matrix requires that the variance of each data point be
truly meaningful. For example, does the appearance of x1 precipitate a
change in x2? For something like CountVectorizer, we’re
approximating those relationships in a multinomial way, but we’re not
really setup to truly understand the covariance.
Enter: Singular Value Decomposition (SVD)
● Singular value decomposition is another method of finding the principle
components, but without relying on the covariance matrix. We can
interact with the data directly to get out resulting latent space. We also
don’t need a square matrix like in PCA.
● SVD results in taking one matrix: our data, and converting it to a
product of three matrices.
● This separation is called a “matrix factorization.” We’re taking one
matrix and breaking it into sub-matrices or “factors.”
Enter: Singular Value Decomposition (SVD)
DATA
=
U
𝛴
V*
Singular Value Decomposition (SVD)
Why does converting to 3 matrices help?
● The columns of U are the eigenvectors of MM*, meaning those
eigenvectors are informed about the data. They know something about
our data set.
U
Singular Value Decomposition (SVD)
Why does converting to 3 matrices help?
● The rows of V are the eigenvectors of M*M, so they also know
something about our data.
V*
Singular Value Decomposition (SVD)
Why does converting to 3 matrices help?
● 𝛴 is made up of the square roots of the eigenvalues from both MM*
and M*M, placed along the diagonal. These are known as the
“Singular Values.”
𝛴
Singular Value Decomposition (SVD)
● If all the vectors and singular values are aligned properly, these three
matrices multiply together to give us exactly our data back.
● However, what if we chopped off the vectors and singular values that
are small, and only contribute a little bit to giving us our data back?
We’d then be approximating our data and getting close to the right
answer… but we’d have fewer dimensions!
● The remaining eigenvectors can help us determine our latent space
axes as well.
Truncated SVD
APPROXIMATE
DATA
=
U
𝛴
V*
SVD in Python
Output:
Input:
import numpy as np
U, sigma, V = np.linalg.svd(iris_data)
iris_reconst = np.matrix(U)[:, :4] * np.diag(sigma) * np.matrix(V)
print(np.array(iris_reconst[:5, :])) # first 5 rows of reconstructed data
print(np.array(iris_data)[:5,]) # first 5 rows of original data
array([[ 5.1, 3.5, 1.4, 0.2],
[ 4.9, 3. , 1.4, 0.2],
[ 4.7, 3.2, 1.3, 0.2],
[ 4.6, 3.1, 1.5, 0.2],
[ 5. , 3.6, 1.4, 0.2]])
array([[ 5.1, 3.5, 1.4, 0.2],
[ 4.9, 3. , 1.4, 0.2],
[ 4.7, 3.2, 1.3, 0.2],
[ 4.6, 3.1, 1.5, 0.2],
[ 5. , 3.6, 1.4, 0.2]])
Identical
SVD in Python (reducing dimensions)
Output:
Input:
import numpy as np
U, sigma, V = np.linalg.svd(iris_data)
iris_reconst = np.matrix(U)[:, :2] * np.diag(sigma[:2]) * np.matrix(V[:2])
print(np.array(iris_reconst[:5, :])) # first 5 rows of reconstructed data
print(np.array(iris_data)[:5,]) # first 5 rows of original data
array([[ 5.1, 3.5, 1.4, 0.2],
[ 4.9, 3. , 1.4, 0.2],
[ 4.7, 3.2, 1.3, 0.2],
[ 4.6, 3.1, 1.5, 0.2],
[ 5. , 3.6, 1.4, 0.2]])
array([[ 5.1, 3.5, 1.4, 0.2],
[ 4.7, 3.2, 1.5, 0.3],
[ 4.7, 3.2, 1.3, 0.2],
[ 4.6, 3.1, 1.5, 0.3],
[ 5.1, 3.5, 1.4, 0.2]])
Close, with only
one dimension
Latent Semantic Analysis (LSA)
● If we apply SVD, and keep track of the how the words (features) are
combined in the latent space, it’s called Latent Semantic Analysis
(LSA).
● Essentially, we do SVD with reducing the dimensionality (“Truncated
SVD”). Then we keep track that the new 1st dimension is made up of
mostly old dimensions that were called “bat, glove, base, and outfield.”
● That sounds an awful lot like topic modeling.
Pet Rabbits Eating Rabbit
“I love my pet rabbit.” 87% 13%
“That’s my pet rabbit’s favorite dish.” 42% 58%
“She cooks the best rabbit dish.” 14% 84%
Topic Modeling
● Documents are not required to be one topic or another. They will be an overlap
of many topics.
● For each document, we will have a ‘percentage breakdown’ of how much of
each topic is involved.
● The topics will not be defined by the machine, we must interpret the word
groupings.
Why does LSA work?
DATA
Words
Documents
=
U
Words
New Dimensions
𝛴
New Dimensions
NewDimensions
V*
NewDimensions
Documents
We end up creating a system where the words and documents are both understood in our
latent space individually. We can then recombine them to get our original data back.
Why does LSA work?
DATA
Words
Documents
=
U
Words
New Dimensions
𝛴
Weights
New Dimensions
NewDimensions
V*
NewDimensions
Documents
The 𝛴 matrix acts as a set of weights to determine how the words and documents combine
together in the latent space to correctly represent the real data.
Why does LSA work?
DATA
Words
Documents
=
U
Words
New Dimensions
𝛴
New Dimensions
NewDimensions
V*
NewDimensions
Documents
If we truncate our analysis to a certain number of topics, we’re essentially saying let’s only
take the most USEFUL hidden dimensions. That’s akin to setting everything in the red region
of 𝛴 to 0. That means the word*document combinations no longer contribute.
Why does LSA work?
DATA
Words
Documents
=
U
Words
New Dimensions
𝛴
New Dimensions
NewDimensions
V*
NewDimensions
Documents
If these dimensions don’t contribute, we’re now limited to a smaller number of “topics” to
represent our data. Which means we won’t be able to reproduce the data exactly, but as long
as we don’t cut too many topics, we should still get a good representation of the data.
Latent Semantic Analysis (LSA)
So an LSA pipeline for topic modeling would look like:
1. Preprocess the text
2. Vectorize the text (TF-IDF, Binomial, Multinomial…)
3. Reduce the dimensionality with SVD, keeping track of the feature
composition of the new latent space
4. Investigate the output ”topics”
LSA with Python - Dataset Creation
Input:
# Dataset creation same as in prior weeks
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
ng_train = fetch_20newsgroups(subset=‘train’,
remove=(‘headers’, ‘footers’, ‘quotes’))
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2),
stop_words='english',
token_pattern=“b[a-z][a-z]+b”,
lowercase=True,
max_df = 0.6)
tfidf_data = tfidf_vectorizer.fit_transform(ng_train.data)
LSA with Python - Modeling
Output:
Input:
from sklearn.decomposition import TruncatedSVD
lsa_tfidf = TruncatedSVD(n_components=10)
lsa_tfidf_data = lsa_tfidf.fit_transform(tfidf_data)
show_topics(lsa_tfidf,tfidf.get_feature_names()) # a function I wrote
Topic 0: graphics, image, files, file, thanks, program, format, ftp, windows, gif
Topic 1: god, atheism, people, atheists, say, exist, belief, religion, believe,
bible
Topic 2: just, aspects graphics, aspects, group, graphics, groups, split, posts
week, week group, think
Non-Negative Matrix Factorization (NMF)
● SVD isn’t the only way to do matrix factorization. That means LSA isn’t
the only way to do topic modeling with matrix factorization.
● Non-negative Matrix Factorization is another mathematical technique
to decompose a matrix into sub-matrices.
● NMF assumes there are no negative numbers in your original matrix.
● In NMF, we break the data matrix M down into two component
matrices.
● W is the features matrix. It will be of the shape (number of words,
number of topics). So a system with 5000 words and 10 topics will
have W be (5000,10).
● H is the “coefficients matrix.” It’s of the shape (number of topics,
number of documents).
● The combination of these two tells us how each word contributes to
each document, via some assumed middle dimension of size 10 (in
this example). That hidden dimension is our topic space!
Non-Negative Matrix Factorization (NMF)
● Since NMF assumes no negative values, so we have to be careful
about any specialized document-vectorization we do. For instance,
depending on how you normalize TF-IDF, it may be possible to end up
with a negative value.
● NMF and LSA are very similar in technique, but can end up with
different results due to different underlying assumptions.
● Always try both, to see which is working better for your dataset!
Non-Negative Matrix Factorization (NMF)
NMF with Python
Output:
Input:
from sklearn.decomposition import NMF
nmf_tfidf = NMF(n_components=10)
nmf_tfidf_data = lsa_tfidf.fit_transform(tfidf_data)
show_topics(nmf_tfidf,tfidf.get_feature_names()) # a function I wrote
Topic 0: jpeg, image, gif, file, color, images, format, quality, version, files
Topic 1: edu, graphics, pub, mail, ray, send, ftp, com, objects, server
Topic 2: jesus, matthew, prophecy, people, said, messiah, isaiah, psalm, david, king
Some rules of thumb, that you will break:
● In practice, LSA/NMF tends to outperform LDA on small documents.
Things like tweets, forum posts, etc. That’s not always true, but worth
keeping in mind.
● In contrast, doing LSA/NMF on HUGE documents can lead to needing
100s of topics to make sense of things. LDA usually does a better job
on big documents with many topics.
● LSA and NMF are similar techniques and one doesn’t systematically
outperform the other.
● LSA and NMF both perform differently with TF-IDF vs Count Vectorizer
vs Binomial Vectorizers. Try them all.
Some rules of thumb, that you will break:
● If you’re getting topics that make
sense, and your code isn’t buggy, your
process is good. There’s a lot of art in
NLP, and especially in topic modeling.
Similarity
● Two similar documents should be near each other in the latent space,
since they should have overlapping topics.
● We can’t rely on Euclidean distance in this space, because we want to
count documents as similar whether they have 900 mentions of a
certain topic, or only 600 mentions of a certain topic.
● Cosine distance measures this similarity well.
Similarity Cars
Dogs
No dogs,
only cars
No cars,
only dogs
Mostly
dogs
Similarity Cars
Dogs
No dogs,
only cars
No cars,
only dogs
Mostly
dogs
Similarity Cars
Dogs
No dogs,
only cars
No cars,
only dogs
Mostly
dogs
Most similar by
Euclidean
Distance
Similarity Cars
Dogs
No dogs,
only cars
No cars,
only dogs
Mostly
dogs
Most similar by
Euclidean
Distance
Similarity Cars
Dogs
No dogs,
only cars
No cars,
only dogs
Mostly
dogs
Most similar by
Euclidean
Distance
Similarity Cars
Dogs
No dogs,
only cars
No cars,
only dogs
Mostly
dogs
Most similar by
Cosine Distance
Similarity
● By using cosine distance in the topic space, the length of the
documents doesn’t matter – either relative or absolute.
● Comparisons in the latent space can allow us to make
recommendations: “If you like this, you’ll probably like this!”
● If we know each documents breakdown in the latent space, we can
find documents about the same topics.
Clustering
● In our latent spaces, if we’ve done our dimensionality reduction, we
should get much more meaningful clusters.
● LSA/NMF allow us to use clustering, just like LDA does.
● We also know that distances are more meaningful. So a two
documents that are very similar will have very similar place in the topic
space. Example: two articles about baseball should be very high on
the “bat, glove, outfield” topic and very low on the “gif, jpg, png” topic.
● We can use this to document recommendation too!
Clustering with Python
Input:
from sklearn.cluster import Means
clus = KMeans(n_clusters=25,random_state=42)
# Note: similarity in KMeans determined by Euclidean distance
labels = clus.fit_predict(lsa_tfidf_data)
Clustering with Python
● 2D visualization of resulting 25 clusters:
Topic Modeling Summary – A Reminder
● Choose some algorithm to figure out how words are related
(LDA, LSA, NMF)
● Create a latent space that makes use of those in-document correlations to
transition from a “word space” to a latent “topic space”
● Each axis in the latent space represents a topic
● We, as the humans, must understand the meaning of the topics. The machine
doesn’t understand the meanings, just the correlations.
Appendix
Output:
Input:
from gensim import corpora, models, utils
lsi = models.LsiModel(corpus, id2word=id2word, num_topics=200)
lsi.print_topics(num_topics=2, num_words=5)
topic #0(3.341): 0.644*"system" + 0.404*"user" + 0.301*"eps" + 0.265*"time" +
0.265*"response"
topic #1(2.542): 0.623*"graph" + 0.490*"trees" + 0.451*"minors" + 0.274*"survey" + -
0.167*"system"
LSA with Python (GenSim)
Output:
Input:
from gensim import corpora, models, utils
lsi = models.LsiModel(corpus, id2word=id2word, num_topics=200)
lsi.print_topics(num_topics=2, num_words=5)
topic #0(3.341): 0.644*"system" + 0.404*"user" + 0.301*"eps" + 0.265*"time" +
0.265*"response"
topic #1(2.542): 0.623*"graph" + 0.490*"trees" + 0.451*"minors" + 0.274*"survey" + -
0.167*"system"

More Related Content

What's hot

Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
Abhimanyu Dwivedi
 
Linear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaLinear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | Edureka
Edureka!
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
Knoldus Inc.
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
Edgar Marca
 
Association rules
Association rulesAssociation rules
Association rules
Dr. C.V. Suresh Babu
 
ML - Simple Linear Regression
ML - Simple Linear RegressionML - Simple Linear Regression
ML - Simple Linear Regression
Andrew Ferlitsch
 
multiple linear regression
multiple linear regressionmultiple linear regression
multiple linear regression
Akhilesh Joshi
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
nszakir
 
Confusion Matrix
Confusion MatrixConfusion Matrix
Confusion Matrix
Rajat Gupta
 
Graph Theory
Graph TheoryGraph Theory
Graph Theory
kailash shaw
 
Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applications
Viet-Trung TRAN
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Jason Rodrigues
 
Sparse matrix
Sparse matrixSparse matrix
Sparse matrix
dincyjain
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)
Sharayu Patil
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)
Learnbay Datascience
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
Lakshmi Sarvani Videla
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
SOUMIT KAR
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear Regression
Andrew Ferlitsch
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 

What's hot (20)

Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
Linear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaLinear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | Edureka
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
Association rules
Association rulesAssociation rules
Association rules
 
ML - Simple Linear Regression
ML - Simple Linear RegressionML - Simple Linear Regression
ML - Simple Linear Regression
 
multiple linear regression
multiple linear regressionmultiple linear regression
multiple linear regression
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
 
Confusion Matrix
Confusion MatrixConfusion Matrix
Confusion Matrix
 
Graph Theory
Graph TheoryGraph Theory
Graph Theory
 
Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applications
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Sparse matrix
Sparse matrixSparse matrix
Sparse matrix
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
PCA
PCAPCA
PCA
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear Regression
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 

Similar to Matrix decomposition and_applications_to_nlp

Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
Suresh Pokharel
 
R Programming
R ProgrammingR Programming
R Programming
AKSHANSH MISHRA
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
preetikumara
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
Rai University
 
Clustering in Machine Learning.pdf
Clustering in Machine Learning.pdfClustering in Machine Learning.pdf
Clustering in Machine Learning.pdf
SudhanshiBakre1
 
Bca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structureBca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structure
Rai University
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
lakshmidkurup
 
De-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekendsDe-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekends
DSCUSICT
 
Illustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksIllustrative Introductory Neural Networks
Illustrative Introductory Neural Networks
YasutoTamura1
 
Mca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureMca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structure
Rai University
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component Analysis
Sunjeet Jena
 
Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r
Ashwini Mathur
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
Marco Quartulli
 
Complexity
ComplexityComplexity
Complexity
Malainine Zaid
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
IJERA Editor
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5
Sunwoo Kim
 
KNN CLASSIFIER, INTRODUCTION TO K-NEAREST NEIGHBOR ALGORITHM.pptx
KNN CLASSIFIER, INTRODUCTION TO K-NEAREST NEIGHBOR ALGORITHM.pptxKNN CLASSIFIER, INTRODUCTION TO K-NEAREST NEIGHBOR ALGORITHM.pptx
KNN CLASSIFIER, INTRODUCTION TO K-NEAREST NEIGHBOR ALGORITHM.pptx
Nishant83346
 
Sienna 13 limitations
Sienna 13 limitationsSienna 13 limitations
Sienna 13 limitationschidabdu
 
Difference between logistic regression shallow neural network and deep neura...
Difference between logistic regression  shallow neural network and deep neura...Difference between logistic regression  shallow neural network and deep neura...
Difference between logistic regression shallow neural network and deep neura...
Chode Amarnath
 

Similar to Matrix decomposition and_applications_to_nlp (20)

Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
R Programming
R ProgrammingR Programming
R Programming
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
 
Clustering in Machine Learning.pdf
Clustering in Machine Learning.pdfClustering in Machine Learning.pdf
Clustering in Machine Learning.pdf
 
Bca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structureBca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structure
 
[PPT]
[PPT][PPT]
[PPT]
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
De-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekendsDe-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekends
 
Illustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksIllustrative Introductory Neural Networks
Illustrative Introductory Neural Networks
 
Mca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureMca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structure
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component Analysis
 
Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
 
Complexity
ComplexityComplexity
Complexity
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5
 
KNN CLASSIFIER, INTRODUCTION TO K-NEAREST NEIGHBOR ALGORITHM.pptx
KNN CLASSIFIER, INTRODUCTION TO K-NEAREST NEIGHBOR ALGORITHM.pptxKNN CLASSIFIER, INTRODUCTION TO K-NEAREST NEIGHBOR ALGORITHM.pptx
KNN CLASSIFIER, INTRODUCTION TO K-NEAREST NEIGHBOR ALGORITHM.pptx
 
Sienna 13 limitations
Sienna 13 limitationsSienna 13 limitations
Sienna 13 limitations
 
Difference between logistic regression shallow neural network and deep neura...
Difference between logistic regression  shallow neural network and deep neura...Difference between logistic regression  shallow neural network and deep neura...
Difference between logistic regression shallow neural network and deep neura...
 

More from ankit_ppt

Deep learning summary
Deep learning summaryDeep learning summary
Deep learning summary
ankit_ppt
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
ankit_ppt
 
07 learning
07 learning07 learning
07 learning
ankit_ppt
 
06 image features
06 image features06 image features
06 image features
ankit_ppt
 
05 contours seg_matching
05 contours seg_matching05 contours seg_matching
05 contours seg_matching
ankit_ppt
 
04 image transformations_ii
04 image transformations_ii04 image transformations_ii
04 image transformations_ii
ankit_ppt
 
03 image transformations_i
03 image transformations_i03 image transformations_i
03 image transformations_i
ankit_ppt
 
02 image processing
02 image processing02 image processing
02 image processing
ankit_ppt
 
01 foundations
01 foundations01 foundations
01 foundations
ankit_ppt
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
ankit_ppt
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
ankit_ppt
 
Text generation and_advanced_topics
Text generation and_advanced_topicsText generation and_advanced_topics
Text generation and_advanced_topics
ankit_ppt
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
ankit_ppt
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
ankit_ppt
 
Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
ankit_ppt
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
ankit_ppt
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descent
ankit_ppt
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
ankit_ppt
 
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methodsMl9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methods
ankit_ppt
 
Ml8 boosting and-stacking
Ml8 boosting and-stackingMl8 boosting and-stacking
Ml8 boosting and-stacking
ankit_ppt
 

More from ankit_ppt (20)

Deep learning summary
Deep learning summaryDeep learning summary
Deep learning summary
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
07 learning
07 learning07 learning
07 learning
 
06 image features
06 image features06 image features
06 image features
 
05 contours seg_matching
05 contours seg_matching05 contours seg_matching
05 contours seg_matching
 
04 image transformations_ii
04 image transformations_ii04 image transformations_ii
04 image transformations_ii
 
03 image transformations_i
03 image transformations_i03 image transformations_i
03 image transformations_i
 
02 image processing
02 image processing02 image processing
02 image processing
 
01 foundations
01 foundations01 foundations
01 foundations
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Text generation and_advanced_topics
Text generation and_advanced_topicsText generation and_advanced_topics
Text generation and_advanced_topics
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descent
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
 
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methodsMl9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methods
 
Ml8 boosting and-stacking
Ml8 boosting and-stackingMl8 boosting and-stacking
Ml8 boosting and-stacking
 

Recently uploaded

Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
ShahidSultan24
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
PrashantGoswami42
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
Kamal Acharya
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 

Recently uploaded (20)

Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 

Matrix decomposition and_applications_to_nlp

  • 2. Matrix Decomposition ● Last week we examined the idea of latent spaces and how we could use Latent Dirichlet Allocation to create a ”topic space.” ● LDA is not the only method to create latent spaces, so today we’ll investigate some more “mathematically rigorous” ways to accomplish the same task. ● Let’s start by reviewing latent spaces…
  • 4. “Rabbit” “Dish” “Pet” From 3D to a 2D space made of a bit of each of the original dimensions.
  • 5. Latent Spaces – A Reminder ● Latent spaces are a “hidden” set of axes that we use to look at the information in a different way. ● For NLP, we typically want to go from a “word space,” which may be 1000s of dimensions, to a smaller dimensionality “combinations of words space.” We called this a “topic space” last week. ● We used LDA to decide on how the topics were defined last week – it’s not our only option. ● Clustering can work MUCH better in these lower dimensional spaces due to the curse of dimensionality. ● Let’s look at some other ways to get to latent spaces.
  • 6. Dimensionality Reduction without NLP ● Dimensionality reduction is an “old” technique in terms of mathematical problems. The most famous version is Principal Component Analysis (PCA).
  • 7. Dimensionality Reduction without NLP ● A principle component analysis shifts the axes around to find a set of axes that capture most of the variance in the data. In this example, with just this one axes, we’re getting the majority of how the data changes. Principle Component 1
  • 8. Dimensionality Reduction without NLP ● If we add back the same number of components we capture the data fully. However, we don’t HAVE to add that last component back. We could just convert to a lower dimensionality representation of the data. Principle Component 1 Principle Component 2
  • 9. Dimensionality Reduction without NLP ● If we stick with one dimension in this example, we’ve captured MOST of what makes each data point special, and we’re only storing half as much data (PC1 vs [x,y]). Principle Component 1
  • 10. Simple PCA Example in Python Output: Input: from sklearn.datasets import load_iris iris_data = iris.data iris_data = pd.DataFrame(iris.data) iris_data.columns = [‘Sepal_Length', ‘Sepal_Width', ‘Petal_Length', ‘Petal_Width'] print(iris_data.head()) # print function requires Python 3
  • 11. Simple PCA Example in Python Output: Input: from sklearn.decomposition import PCA pca_1 = PCA(n_components=1) iris_1_component = pca_1.fit_transform(iris_data) print(pca_1.components_[0]) # print function only works in Python 3 array([ 0.36158968, -0.08226889, 0.85657211, 0.35884393]) First principal component is: 0.36 * Sepal_length + (-0.08 * Sepal_width) + 0.86 * Petal_length + 0.36 * Petal_width
  • 12. Simple PCA Example in Python Output: Input: from sklearn.decomposition import PCA iris_1_component_df = pd.DataFrame(iris_reduced, columns=[“new_component”]) print(iris_1_component_df.head()) # print function only works in Python 3 Values of observations on this new component
  • 13. Principle Component Analysis: The Math ● In PCA, we start by finding the covariance matrix. Then we can find eigenvalues and eigenvectors on that matrix. ● Let’s imagine that we have some covariance matrix we’re calling A. We can imagine there is some linear transformation X that can be equivalent to applying a scaler to A. ● This last line is a set of equations that we can solve for lambda’s – the eigenvalues. This also gives us the eigenvectors, with equation 1.
  • 14. Principle Component Analysis ● The eigenvectors will define the new axis set, since the eigen- decomposition is asking “what axes can we find that make the covariance matrix as small (and diagonal) as possible.” ● The new space is a latent space. Great! ● So, can we just use this with text?
  • 15. Principle Component Analysis ● Not really. ● One way to calculate principal components uses the covariance matrix of the data. ● A covariance matrix requires that the variance of each data point be truly meaningful. For example, does the appearance of x1 precipitate a change in x2? For something like CountVectorizer, we’re approximating those relationships in a multinomial way, but we’re not really setup to truly understand the covariance.
  • 16. Enter: Singular Value Decomposition (SVD) ● Singular value decomposition is another method of finding the principle components, but without relying on the covariance matrix. We can interact with the data directly to get out resulting latent space. We also don’t need a square matrix like in PCA. ● SVD results in taking one matrix: our data, and converting it to a product of three matrices. ● This separation is called a “matrix factorization.” We’re taking one matrix and breaking it into sub-matrices or “factors.”
  • 17. Enter: Singular Value Decomposition (SVD) DATA = U 𝛴 V*
  • 18. Singular Value Decomposition (SVD) Why does converting to 3 matrices help? ● The columns of U are the eigenvectors of MM*, meaning those eigenvectors are informed about the data. They know something about our data set. U
  • 19. Singular Value Decomposition (SVD) Why does converting to 3 matrices help? ● The rows of V are the eigenvectors of M*M, so they also know something about our data. V*
  • 20. Singular Value Decomposition (SVD) Why does converting to 3 matrices help? ● 𝛴 is made up of the square roots of the eigenvalues from both MM* and M*M, placed along the diagonal. These are known as the “Singular Values.” 𝛴
  • 21. Singular Value Decomposition (SVD) ● If all the vectors and singular values are aligned properly, these three matrices multiply together to give us exactly our data back. ● However, what if we chopped off the vectors and singular values that are small, and only contribute a little bit to giving us our data back? We’d then be approximating our data and getting close to the right answer… but we’d have fewer dimensions! ● The remaining eigenvectors can help us determine our latent space axes as well.
  • 23. SVD in Python Output: Input: import numpy as np U, sigma, V = np.linalg.svd(iris_data) iris_reconst = np.matrix(U)[:, :4] * np.diag(sigma) * np.matrix(V) print(np.array(iris_reconst[:5, :])) # first 5 rows of reconstructed data print(np.array(iris_data)[:5,]) # first 5 rows of original data array([[ 5.1, 3.5, 1.4, 0.2], [ 4.9, 3. , 1.4, 0.2], [ 4.7, 3.2, 1.3, 0.2], [ 4.6, 3.1, 1.5, 0.2], [ 5. , 3.6, 1.4, 0.2]]) array([[ 5.1, 3.5, 1.4, 0.2], [ 4.9, 3. , 1.4, 0.2], [ 4.7, 3.2, 1.3, 0.2], [ 4.6, 3.1, 1.5, 0.2], [ 5. , 3.6, 1.4, 0.2]]) Identical
  • 24. SVD in Python (reducing dimensions) Output: Input: import numpy as np U, sigma, V = np.linalg.svd(iris_data) iris_reconst = np.matrix(U)[:, :2] * np.diag(sigma[:2]) * np.matrix(V[:2]) print(np.array(iris_reconst[:5, :])) # first 5 rows of reconstructed data print(np.array(iris_data)[:5,]) # first 5 rows of original data array([[ 5.1, 3.5, 1.4, 0.2], [ 4.9, 3. , 1.4, 0.2], [ 4.7, 3.2, 1.3, 0.2], [ 4.6, 3.1, 1.5, 0.2], [ 5. , 3.6, 1.4, 0.2]]) array([[ 5.1, 3.5, 1.4, 0.2], [ 4.7, 3.2, 1.5, 0.3], [ 4.7, 3.2, 1.3, 0.2], [ 4.6, 3.1, 1.5, 0.3], [ 5.1, 3.5, 1.4, 0.2]]) Close, with only one dimension
  • 25. Latent Semantic Analysis (LSA) ● If we apply SVD, and keep track of the how the words (features) are combined in the latent space, it’s called Latent Semantic Analysis (LSA). ● Essentially, we do SVD with reducing the dimensionality (“Truncated SVD”). Then we keep track that the new 1st dimension is made up of mostly old dimensions that were called “bat, glove, base, and outfield.” ● That sounds an awful lot like topic modeling.
  • 26. Pet Rabbits Eating Rabbit “I love my pet rabbit.” 87% 13% “That’s my pet rabbit’s favorite dish.” 42% 58% “She cooks the best rabbit dish.” 14% 84% Topic Modeling ● Documents are not required to be one topic or another. They will be an overlap of many topics. ● For each document, we will have a ‘percentage breakdown’ of how much of each topic is involved. ● The topics will not be defined by the machine, we must interpret the word groupings.
  • 27. Why does LSA work? DATA Words Documents = U Words New Dimensions 𝛴 New Dimensions NewDimensions V* NewDimensions Documents We end up creating a system where the words and documents are both understood in our latent space individually. We can then recombine them to get our original data back.
  • 28. Why does LSA work? DATA Words Documents = U Words New Dimensions 𝛴 Weights New Dimensions NewDimensions V* NewDimensions Documents The 𝛴 matrix acts as a set of weights to determine how the words and documents combine together in the latent space to correctly represent the real data.
  • 29. Why does LSA work? DATA Words Documents = U Words New Dimensions 𝛴 New Dimensions NewDimensions V* NewDimensions Documents If we truncate our analysis to a certain number of topics, we’re essentially saying let’s only take the most USEFUL hidden dimensions. That’s akin to setting everything in the red region of 𝛴 to 0. That means the word*document combinations no longer contribute.
  • 30. Why does LSA work? DATA Words Documents = U Words New Dimensions 𝛴 New Dimensions NewDimensions V* NewDimensions Documents If these dimensions don’t contribute, we’re now limited to a smaller number of “topics” to represent our data. Which means we won’t be able to reproduce the data exactly, but as long as we don’t cut too many topics, we should still get a good representation of the data.
  • 31. Latent Semantic Analysis (LSA) So an LSA pipeline for topic modeling would look like: 1. Preprocess the text 2. Vectorize the text (TF-IDF, Binomial, Multinomial…) 3. Reduce the dimensionality with SVD, keeping track of the feature composition of the new latent space 4. Investigate the output ”topics”
  • 32. LSA with Python - Dataset Creation Input: # Dataset creation same as in prior weeks from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.datasets import fetch_20newsgroups ng_train = fetch_20newsgroups(subset=‘train’, remove=(‘headers’, ‘footers’, ‘quotes’)) tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words='english', token_pattern=“b[a-z][a-z]+b”, lowercase=True, max_df = 0.6) tfidf_data = tfidf_vectorizer.fit_transform(ng_train.data)
  • 33. LSA with Python - Modeling Output: Input: from sklearn.decomposition import TruncatedSVD lsa_tfidf = TruncatedSVD(n_components=10) lsa_tfidf_data = lsa_tfidf.fit_transform(tfidf_data) show_topics(lsa_tfidf,tfidf.get_feature_names()) # a function I wrote Topic 0: graphics, image, files, file, thanks, program, format, ftp, windows, gif Topic 1: god, atheism, people, atheists, say, exist, belief, religion, believe, bible Topic 2: just, aspects graphics, aspects, group, graphics, groups, split, posts week, week group, think
  • 34. Non-Negative Matrix Factorization (NMF) ● SVD isn’t the only way to do matrix factorization. That means LSA isn’t the only way to do topic modeling with matrix factorization. ● Non-negative Matrix Factorization is another mathematical technique to decompose a matrix into sub-matrices. ● NMF assumes there are no negative numbers in your original matrix.
  • 35. ● In NMF, we break the data matrix M down into two component matrices. ● W is the features matrix. It will be of the shape (number of words, number of topics). So a system with 5000 words and 10 topics will have W be (5000,10). ● H is the “coefficients matrix.” It’s of the shape (number of topics, number of documents). ● The combination of these two tells us how each word contributes to each document, via some assumed middle dimension of size 10 (in this example). That hidden dimension is our topic space! Non-Negative Matrix Factorization (NMF)
  • 36. ● Since NMF assumes no negative values, so we have to be careful about any specialized document-vectorization we do. For instance, depending on how you normalize TF-IDF, it may be possible to end up with a negative value. ● NMF and LSA are very similar in technique, but can end up with different results due to different underlying assumptions. ● Always try both, to see which is working better for your dataset! Non-Negative Matrix Factorization (NMF)
  • 37. NMF with Python Output: Input: from sklearn.decomposition import NMF nmf_tfidf = NMF(n_components=10) nmf_tfidf_data = lsa_tfidf.fit_transform(tfidf_data) show_topics(nmf_tfidf,tfidf.get_feature_names()) # a function I wrote Topic 0: jpeg, image, gif, file, color, images, format, quality, version, files Topic 1: edu, graphics, pub, mail, ray, send, ftp, com, objects, server Topic 2: jesus, matthew, prophecy, people, said, messiah, isaiah, psalm, david, king
  • 38. Some rules of thumb, that you will break: ● In practice, LSA/NMF tends to outperform LDA on small documents. Things like tweets, forum posts, etc. That’s not always true, but worth keeping in mind. ● In contrast, doing LSA/NMF on HUGE documents can lead to needing 100s of topics to make sense of things. LDA usually does a better job on big documents with many topics. ● LSA and NMF are similar techniques and one doesn’t systematically outperform the other. ● LSA and NMF both perform differently with TF-IDF vs Count Vectorizer vs Binomial Vectorizers. Try them all.
  • 39. Some rules of thumb, that you will break: ● If you’re getting topics that make sense, and your code isn’t buggy, your process is good. There’s a lot of art in NLP, and especially in topic modeling.
  • 40. Similarity ● Two similar documents should be near each other in the latent space, since they should have overlapping topics. ● We can’t rely on Euclidean distance in this space, because we want to count documents as similar whether they have 900 mentions of a certain topic, or only 600 mentions of a certain topic. ● Cosine distance measures this similarity well.
  • 41. Similarity Cars Dogs No dogs, only cars No cars, only dogs Mostly dogs
  • 42. Similarity Cars Dogs No dogs, only cars No cars, only dogs Mostly dogs
  • 43. Similarity Cars Dogs No dogs, only cars No cars, only dogs Mostly dogs Most similar by Euclidean Distance
  • 44. Similarity Cars Dogs No dogs, only cars No cars, only dogs Mostly dogs Most similar by Euclidean Distance
  • 45. Similarity Cars Dogs No dogs, only cars No cars, only dogs Mostly dogs Most similar by Euclidean Distance
  • 46. Similarity Cars Dogs No dogs, only cars No cars, only dogs Mostly dogs Most similar by Cosine Distance
  • 47. Similarity ● By using cosine distance in the topic space, the length of the documents doesn’t matter – either relative or absolute. ● Comparisons in the latent space can allow us to make recommendations: “If you like this, you’ll probably like this!” ● If we know each documents breakdown in the latent space, we can find documents about the same topics.
  • 48. Clustering ● In our latent spaces, if we’ve done our dimensionality reduction, we should get much more meaningful clusters. ● LSA/NMF allow us to use clustering, just like LDA does. ● We also know that distances are more meaningful. So a two documents that are very similar will have very similar place in the topic space. Example: two articles about baseball should be very high on the “bat, glove, outfield” topic and very low on the “gif, jpg, png” topic. ● We can use this to document recommendation too!
  • 49. Clustering with Python Input: from sklearn.cluster import Means clus = KMeans(n_clusters=25,random_state=42) # Note: similarity in KMeans determined by Euclidean distance labels = clus.fit_predict(lsa_tfidf_data)
  • 50. Clustering with Python ● 2D visualization of resulting 25 clusters:
  • 51. Topic Modeling Summary – A Reminder ● Choose some algorithm to figure out how words are related (LDA, LSA, NMF) ● Create a latent space that makes use of those in-document correlations to transition from a “word space” to a latent “topic space” ● Each axis in the latent space represents a topic ● We, as the humans, must understand the meaning of the topics. The machine doesn’t understand the meanings, just the correlations.
  • 52.
  • 53. Appendix Output: Input: from gensim import corpora, models, utils lsi = models.LsiModel(corpus, id2word=id2word, num_topics=200) lsi.print_topics(num_topics=2, num_words=5) topic #0(3.341): 0.644*"system" + 0.404*"user" + 0.301*"eps" + 0.265*"time" + 0.265*"response" topic #1(2.542): 0.623*"graph" + 0.490*"trees" + 0.451*"minors" + 0.274*"survey" + - 0.167*"system"
  • 54. LSA with Python (GenSim) Output: Input: from gensim import corpora, models, utils lsi = models.LsiModel(corpus, id2word=id2word, num_topics=200) lsi.print_topics(num_topics=2, num_words=5) topic #0(3.341): 0.644*"system" + 0.404*"user" + 0.301*"eps" + 0.265*"time" + 0.265*"response" topic #1(2.542): 0.623*"graph" + 0.490*"trees" + 0.451*"minors" + 0.274*"survey" + - 0.167*"system"

Editor's Notes

  1. Recall that a latent space is a space defined by latent features: features that are linear combinations of our original features. A classic example is “size” being a latent feature for our original features of “height” and “weight”. In NLP, the most common context for latent features is “topics” being the latent features, as linear combinations of “words”, which are our original features.
  2. Finding latent features also often allows us to do dimensionality reduction: for example, if we start with describing each document as made up of 1000 words from our corpus, we could reduce that down to each document being made up of 10 topics from our corpus. Note: Dimensionality Reduction is technique used to reduce the number features/dimensions in a many/high dimensional dataset (i.e transforms the data in the high-dimensional space to a space of fewer dimensions).
  3. Clustering is a an unsupervised learning technique of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups.
  4. For context: the next few slides after this one will walk through how we can use singular value decomposition for dimensionality reduction.
  5. Recall that M is our original data. Thus, the eigenvectors of MM* can be related back to our original data.
  6. Similarly, the eigenvectors of M*M can also be related back to our original data.
  7. Approximating our data with fewer dimensions is exactly what we want to do with dimensionality reduction!
  8. The key point of this slide is just: SVD - what we just walked through, a technique that can be applied to any matrix - applied to text data specifically, is called “Latent Semantic Analysis”.
  9. The result ends up being very similar to topic modeling! Each document is broken down into “latent features”, and each latent feature is a linear combination of individual words.
  10. Even though it’s mostly clear from the slide, we’ll reiterate: this implements the “LSA” technique that splits topics into words, and documents into topics, shown before at a conceptual level. “Under the hood”, this is just using the truncated SVD matrix decomposition procedure.
  11. Note that this code uses the same “lsa_tfidf” and “tfidf_data” objects from before.
  12. Motivation for this slide: once we have converted our documents into the latent space - recall that this means the “topic” space - how can we cluster our documents, or more generally, tell if the documents are similar to each other? We could just use Euclidean distance. Recall, though, that if our original data is represented by the output of a “CountVectorizer” function, the resulting topic values will be a weighted sum of the number of times words from that topic appear in that document. So, longer documents will be counted differently than shorter documents. What we want, however, is for documents to be clustered based on meaning - not document length! Using Cosine distance instead of Euclidean distance can be a good option in this scenario.
  13. Here, “Cars” and “Dogs” are topics, and the dots are documents.
  14. Key point of these next few slides is to illustrate concretely: our conclusions about which documents are “close to each other” can be different depending on whether we use Euclidean distance or Cosine distance.
  15. A real world application: websites use these techniques to give you recommend “articles you might like”.
  16. Note: lsa_tfidf_data the same as before.
  17. Visualization made using T-SNE, a technique for visualizing high dimensional representations of data in two dimensions. Read more about T-SNE here: https://lvdmaaten.github.io/tsne/#implementations To see the specific implementation used to make this visualization, see here: http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
  18. Code requires a corpus and Topic ID to word mapping(id2word)
  19. Code requires a corpus and Topic ID to word mapping(id2word)