This document provides an overview of matrix decomposition techniques for dimensionality reduction and topic modeling, specifically principal component analysis (PCA), singular value decomposition (SVD), latent semantic analysis (LSA), and non-negative matrix factorization (NMF). PCA and SVD are introduced as mathematical techniques to reduce dimensions while preserving variance/information. LSA and NMF are described as applying SVD and NMF respectively to text data to derive topic models from the latent semantic space. Examples of implementing these techniques in Python are also provided.
5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...Ceni Babaoglu, PhD
The seminar series will focus on the mathematical background needed for machine learning. The first set of the seminars will be on "Linear Algebra for Machine Learning". Here are the slides of the fifth part which is discussing singular value decomposition and principal component analysis.
Here are the slides of the first part which was discussing linear systems: https://www.slideshare.net/CeniBabaogluPhDinMat/linear-algebra-for-machine-learning-linear-systems/1
Here are the slides of the second part which was discussing basis and dimension:
https://www.slideshare.net/CeniBabaogluPhDinMat/2-linear-algebra-for-machine-learning-basis-and-dimension
Here are the slides of the third part which is discussing factorization and linear transformations.
https://www.slideshare.net/CeniBabaogluPhDinMat/3-linear-algebra-for-machine-learning-factorization-and-linear-transformations-130813437
Here are the slides of the fourth part which is discussing eigenvalues and eigenvectors.
https://www.slideshare.net/CeniBabaogluPhDinMat/4-linear-algebra-for-machine-learning-eigenvalues-eigenvectors-and-diagonalization
Machine learning and linear regression programmingSoumya Mukherjee
Overview of AI and ML
Terminology awareness
Applications in real world
Use cases within Nokia
Types of Learning
Regression
Classification
Clustering
Linear Regression Single Variable with python
5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...Ceni Babaoglu, PhD
The seminar series will focus on the mathematical background needed for machine learning. The first set of the seminars will be on "Linear Algebra for Machine Learning". Here are the slides of the fifth part which is discussing singular value decomposition and principal component analysis.
Here are the slides of the first part which was discussing linear systems: https://www.slideshare.net/CeniBabaogluPhDinMat/linear-algebra-for-machine-learning-linear-systems/1
Here are the slides of the second part which was discussing basis and dimension:
https://www.slideshare.net/CeniBabaogluPhDinMat/2-linear-algebra-for-machine-learning-basis-and-dimension
Here are the slides of the third part which is discussing factorization and linear transformations.
https://www.slideshare.net/CeniBabaogluPhDinMat/3-linear-algebra-for-machine-learning-factorization-and-linear-transformations-130813437
Here are the slides of the fourth part which is discussing eigenvalues and eigenvectors.
https://www.slideshare.net/CeniBabaogluPhDinMat/4-linear-algebra-for-machine-learning-eigenvalues-eigenvectors-and-diagonalization
Machine learning and linear regression programmingSoumya Mukherjee
Overview of AI and ML
Terminology awareness
Applications in real world
Use cases within Nokia
Types of Learning
Regression
Classification
Clustering
Linear Regression Single Variable with python
Introduction to linear regression and the maths behind it like line of best fit, regression matrics. Other concepts include cost function, gradient descent, overfitting and underfitting, r squared.
Linear Regression vs Logistic Regression | EdurekaEdureka!
YouTube: https://youtu.be/OCwZyYH14uw
** Data Science Certification using R: https://www.edureka.co/data-science **
This Edureka PPT on Linear Regression Vs Logistic Regression covers the basic concepts of linear and logistic models. The following topics are covered in this session:
Types of Machine Learning
Regression Vs Classification
What is Linear Regression?
What is Logistic Regression?
Linear Regression Use Case
Logistic Regression Use Case
Linear Regression Vs Logistic Regression
Blog Series: http://bit.ly/data-science-blogs
Data Science Training Playlist: http://bit.ly/data-science-playlist
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Naive Bayes is a kind of classifier which uses the Bayes Theorem. It predicts membership probabilities for each class such as the probability that given record or data point belongs to a particular class.
Abstract: This PDSG workshop introduces basic concepts of simple linear regression in machine learning. Concepts covered are Slope of a Line, Loss Function, and Solving Simple Linear Regression Equation, with examples.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
Data preprocessing techniques
See my Paris applied psychology conference paper here
https://www.slideshare.net/jasonrodrigues/paris-conference-on-applied-psychology
or
https://prezi.com/view/KBP8JnekVH9LkLOiKY3w/
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimentional space this hyperplane is a line dividing a plane in two parts where in each class lay in either side.
Principal Component Analysis, or PCA, is a factual method that permits you to sum up the data contained in enormous information tables by methods for a littler arrangement of "synopsis files" that can be all the more handily envisioned and broke down.
Abstract: This PDSG workshop introduces basic concepts of multiple linear regression in machine learning. Concepts covered are Feature Elimination and Backward Elimination, with examples in Python.
Level: Fundamental
Requirements: Should have some experience with Python programming.
Introduction to linear regression and the maths behind it like line of best fit, regression matrics. Other concepts include cost function, gradient descent, overfitting and underfitting, r squared.
Linear Regression vs Logistic Regression | EdurekaEdureka!
YouTube: https://youtu.be/OCwZyYH14uw
** Data Science Certification using R: https://www.edureka.co/data-science **
This Edureka PPT on Linear Regression Vs Logistic Regression covers the basic concepts of linear and logistic models. The following topics are covered in this session:
Types of Machine Learning
Regression Vs Classification
What is Linear Regression?
What is Logistic Regression?
Linear Regression Use Case
Logistic Regression Use Case
Linear Regression Vs Logistic Regression
Blog Series: http://bit.ly/data-science-blogs
Data Science Training Playlist: http://bit.ly/data-science-playlist
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Naive Bayes is a kind of classifier which uses the Bayes Theorem. It predicts membership probabilities for each class such as the probability that given record or data point belongs to a particular class.
Abstract: This PDSG workshop introduces basic concepts of simple linear regression in machine learning. Concepts covered are Slope of a Line, Loss Function, and Solving Simple Linear Regression Equation, with examples.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
Data preprocessing techniques
See my Paris applied psychology conference paper here
https://www.slideshare.net/jasonrodrigues/paris-conference-on-applied-psychology
or
https://prezi.com/view/KBP8JnekVH9LkLOiKY3w/
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimentional space this hyperplane is a line dividing a plane in two parts where in each class lay in either side.
Principal Component Analysis, or PCA, is a factual method that permits you to sum up the data contained in enormous information tables by methods for a littler arrangement of "synopsis files" that can be all the more handily envisioned and broke down.
Abstract: This PDSG workshop introduces basic concepts of multiple linear regression in machine learning. Concepts covered are Feature Elimination and Backward Elimination, with examples in Python.
Level: Fundamental
Requirements: Should have some experience with Python programming.
TECHWEEKENDS Presents;
De-cluttering Machine Learning in collaboration with IEEE GGSIPU
Are you clueless when you hear people saying words like Unsupervised Learning and Regression? Worry not❗, GDSC USICT is there for you!!
We are organizing a session on Machine Learning where you will Learn the basics of Machine Learning while developing a Hands-On Project from Scratch and seeing the results in real time. You will also learn different algorithms and models and various Data Preparation Techniques.
An introductory/illustrative but precise slide on mathematics on neural networks (densely connected layers).
Please download it and see its animations with PowerPoint.
*This slide is not finished yet. If you like it, please give me some feedback to motivate me.
I made this slide as an intern in DATANOMIQ Gmbh
URL: https://www.datanomiq.de/
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...IJERA Editor
An optimal data partitioning in parallel/distributed implementation of clustering algorithms is a necessary
computation as it ensures independent task completion, fair distribution, less number of affected points and
better & faster merging. Though partitioning using Kd-Tree is being conventionally used in academia, it suffers
from performance drenches and bias (non equal distribution) as dimensionality of data increases and hence is
not suitable for practical use in industry where dimensionality can be of order of 100’s to 1000’s. To address
these issues we propose two new partitioning techniques using existing mathematical models & study their
feasibility, performance (bias and partitioning speed) & possible variants in choosing initial seeds. First method
uses an n-dimensional hashed grid based approach which is based on mapping the points in space to a set of
cubes which hashes the points. Second method uses a tree of voronoi planes where each plane corresponds to a
partition. We found that grid based approach was computationally impractical, while using a tree of voronoi
planes (using scalable K-Means++ initial seeds) drastically outperformed the Kd-tree tree method as
dimensionality increased.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
2. Matrix Decomposition
● Last week we examined the idea of latent spaces and how
we could use Latent Dirichlet Allocation to create a ”topic
space.”
● LDA is not the only method to create latent spaces, so
today we’ll investigate some more “mathematically
rigorous” ways to accomplish the same task.
● Let’s start by reviewing latent spaces…
5. Latent Spaces – A Reminder
● Latent spaces are a “hidden” set of axes that we use to look at the information
in a different way.
● For NLP, we typically want to go from a “word space,” which may be 1000s of
dimensions, to a smaller dimensionality “combinations of words space.” We
called this a “topic space” last week.
● We used LDA to decide on how the topics were defined last week – it’s not our
only option.
● Clustering can work MUCH better in these lower dimensional spaces due to
the curse of dimensionality.
● Let’s look at some other ways to get to latent spaces.
6. Dimensionality Reduction without NLP
● Dimensionality reduction is an “old” technique in terms of mathematical
problems. The most famous version is Principal Component Analysis
(PCA).
7. Dimensionality Reduction without NLP
● A principle component analysis shifts the axes around to find a set of
axes that capture most of the variance in the data. In this example,
with just this one axes, we’re getting the majority of how the data
changes. Principle Component 1
8. Dimensionality Reduction without NLP
● If we add back the same number of components we capture the data
fully. However, we don’t HAVE to add that last component back. We
could just convert to a lower dimensionality representation of the data.
Principle Component 1
Principle Component 2
9. Dimensionality Reduction without NLP
● If we stick with one dimension in this example, we’ve captured MOST
of what makes each data point special, and we’re only storing half as
much data (PC1 vs [x,y]).
Principle Component 1
10. Simple PCA Example in Python
Output:
Input:
from sklearn.datasets import load_iris
iris_data = iris.data
iris_data = pd.DataFrame(iris.data)
iris_data.columns = [‘Sepal_Length', ‘Sepal_Width', ‘Petal_Length', ‘Petal_Width']
print(iris_data.head()) # print function requires Python 3
11. Simple PCA Example in Python
Output:
Input:
from sklearn.decomposition import PCA
pca_1 = PCA(n_components=1)
iris_1_component = pca_1.fit_transform(iris_data)
print(pca_1.components_[0]) # print function only works in Python 3
array([ 0.36158968, -0.08226889, 0.85657211, 0.35884393])
First principal component is:
0.36 * Sepal_length + (-0.08 * Sepal_width) + 0.86 * Petal_length + 0.36 * Petal_width
12. Simple PCA Example in Python
Output:
Input:
from sklearn.decomposition import PCA
iris_1_component_df = pd.DataFrame(iris_reduced,
columns=[“new_component”])
print(iris_1_component_df.head()) # print function only works in Python 3
Values of observations on
this new component
13. Principle Component Analysis: The Math
● In PCA, we start by finding the covariance matrix. Then we can find
eigenvalues and eigenvectors on that matrix.
● Let’s imagine that we have some covariance matrix we’re calling A.
We can imagine there is some linear transformation X that can be
equivalent to applying a scaler to A.
● This last line is a set of equations that we can solve for lambda’s – the
eigenvalues. This also gives us the eigenvectors, with equation 1.
14. Principle Component Analysis
● The eigenvectors will define the new axis set, since the eigen-
decomposition is asking “what axes can we find that make the
covariance matrix as small (and diagonal) as possible.”
● The new space is a latent space. Great!
● So, can we just use this with text?
15. Principle Component Analysis
● Not really.
● One way to calculate principal components uses the covariance matrix
of the data.
● A covariance matrix requires that the variance of each data point be
truly meaningful. For example, does the appearance of x1 precipitate a
change in x2? For something like CountVectorizer, we’re
approximating those relationships in a multinomial way, but we’re not
really setup to truly understand the covariance.
16. Enter: Singular Value Decomposition (SVD)
● Singular value decomposition is another method of finding the principle
components, but without relying on the covariance matrix. We can
interact with the data directly to get out resulting latent space. We also
don’t need a square matrix like in PCA.
● SVD results in taking one matrix: our data, and converting it to a
product of three matrices.
● This separation is called a “matrix factorization.” We’re taking one
matrix and breaking it into sub-matrices or “factors.”
18. Singular Value Decomposition (SVD)
Why does converting to 3 matrices help?
● The columns of U are the eigenvectors of MM*, meaning those
eigenvectors are informed about the data. They know something about
our data set.
U
19. Singular Value Decomposition (SVD)
Why does converting to 3 matrices help?
● The rows of V are the eigenvectors of M*M, so they also know
something about our data.
V*
20. Singular Value Decomposition (SVD)
Why does converting to 3 matrices help?
● 𝛴 is made up of the square roots of the eigenvalues from both MM*
and M*M, placed along the diagonal. These are known as the
“Singular Values.”
𝛴
21. Singular Value Decomposition (SVD)
● If all the vectors and singular values are aligned properly, these three
matrices multiply together to give us exactly our data back.
● However, what if we chopped off the vectors and singular values that
are small, and only contribute a little bit to giving us our data back?
We’d then be approximating our data and getting close to the right
answer… but we’d have fewer dimensions!
● The remaining eigenvectors can help us determine our latent space
axes as well.
23. SVD in Python
Output:
Input:
import numpy as np
U, sigma, V = np.linalg.svd(iris_data)
iris_reconst = np.matrix(U)[:, :4] * np.diag(sigma) * np.matrix(V)
print(np.array(iris_reconst[:5, :])) # first 5 rows of reconstructed data
print(np.array(iris_data)[:5,]) # first 5 rows of original data
array([[ 5.1, 3.5, 1.4, 0.2],
[ 4.9, 3. , 1.4, 0.2],
[ 4.7, 3.2, 1.3, 0.2],
[ 4.6, 3.1, 1.5, 0.2],
[ 5. , 3.6, 1.4, 0.2]])
array([[ 5.1, 3.5, 1.4, 0.2],
[ 4.9, 3. , 1.4, 0.2],
[ 4.7, 3.2, 1.3, 0.2],
[ 4.6, 3.1, 1.5, 0.2],
[ 5. , 3.6, 1.4, 0.2]])
Identical
24. SVD in Python (reducing dimensions)
Output:
Input:
import numpy as np
U, sigma, V = np.linalg.svd(iris_data)
iris_reconst = np.matrix(U)[:, :2] * np.diag(sigma[:2]) * np.matrix(V[:2])
print(np.array(iris_reconst[:5, :])) # first 5 rows of reconstructed data
print(np.array(iris_data)[:5,]) # first 5 rows of original data
array([[ 5.1, 3.5, 1.4, 0.2],
[ 4.9, 3. , 1.4, 0.2],
[ 4.7, 3.2, 1.3, 0.2],
[ 4.6, 3.1, 1.5, 0.2],
[ 5. , 3.6, 1.4, 0.2]])
array([[ 5.1, 3.5, 1.4, 0.2],
[ 4.7, 3.2, 1.5, 0.3],
[ 4.7, 3.2, 1.3, 0.2],
[ 4.6, 3.1, 1.5, 0.3],
[ 5.1, 3.5, 1.4, 0.2]])
Close, with only
one dimension
25. Latent Semantic Analysis (LSA)
● If we apply SVD, and keep track of the how the words (features) are
combined in the latent space, it’s called Latent Semantic Analysis
(LSA).
● Essentially, we do SVD with reducing the dimensionality (“Truncated
SVD”). Then we keep track that the new 1st dimension is made up of
mostly old dimensions that were called “bat, glove, base, and outfield.”
● That sounds an awful lot like topic modeling.
26. Pet Rabbits Eating Rabbit
“I love my pet rabbit.” 87% 13%
“That’s my pet rabbit’s favorite dish.” 42% 58%
“She cooks the best rabbit dish.” 14% 84%
Topic Modeling
● Documents are not required to be one topic or another. They will be an overlap
of many topics.
● For each document, we will have a ‘percentage breakdown’ of how much of
each topic is involved.
● The topics will not be defined by the machine, we must interpret the word
groupings.
27. Why does LSA work?
DATA
Words
Documents
=
U
Words
New Dimensions
𝛴
New Dimensions
NewDimensions
V*
NewDimensions
Documents
We end up creating a system where the words and documents are both understood in our
latent space individually. We can then recombine them to get our original data back.
28. Why does LSA work?
DATA
Words
Documents
=
U
Words
New Dimensions
𝛴
Weights
New Dimensions
NewDimensions
V*
NewDimensions
Documents
The 𝛴 matrix acts as a set of weights to determine how the words and documents combine
together in the latent space to correctly represent the real data.
29. Why does LSA work?
DATA
Words
Documents
=
U
Words
New Dimensions
𝛴
New Dimensions
NewDimensions
V*
NewDimensions
Documents
If we truncate our analysis to a certain number of topics, we’re essentially saying let’s only
take the most USEFUL hidden dimensions. That’s akin to setting everything in the red region
of 𝛴 to 0. That means the word*document combinations no longer contribute.
30. Why does LSA work?
DATA
Words
Documents
=
U
Words
New Dimensions
𝛴
New Dimensions
NewDimensions
V*
NewDimensions
Documents
If these dimensions don’t contribute, we’re now limited to a smaller number of “topics” to
represent our data. Which means we won’t be able to reproduce the data exactly, but as long
as we don’t cut too many topics, we should still get a good representation of the data.
31. Latent Semantic Analysis (LSA)
So an LSA pipeline for topic modeling would look like:
1. Preprocess the text
2. Vectorize the text (TF-IDF, Binomial, Multinomial…)
3. Reduce the dimensionality with SVD, keeping track of the feature
composition of the new latent space
4. Investigate the output ”topics”
32. LSA with Python - Dataset Creation
Input:
# Dataset creation same as in prior weeks
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
ng_train = fetch_20newsgroups(subset=‘train’,
remove=(‘headers’, ‘footers’, ‘quotes’))
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2),
stop_words='english',
token_pattern=“b[a-z][a-z]+b”,
lowercase=True,
max_df = 0.6)
tfidf_data = tfidf_vectorizer.fit_transform(ng_train.data)
33. LSA with Python - Modeling
Output:
Input:
from sklearn.decomposition import TruncatedSVD
lsa_tfidf = TruncatedSVD(n_components=10)
lsa_tfidf_data = lsa_tfidf.fit_transform(tfidf_data)
show_topics(lsa_tfidf,tfidf.get_feature_names()) # a function I wrote
Topic 0: graphics, image, files, file, thanks, program, format, ftp, windows, gif
Topic 1: god, atheism, people, atheists, say, exist, belief, religion, believe,
bible
Topic 2: just, aspects graphics, aspects, group, graphics, groups, split, posts
week, week group, think
34. Non-Negative Matrix Factorization (NMF)
● SVD isn’t the only way to do matrix factorization. That means LSA isn’t
the only way to do topic modeling with matrix factorization.
● Non-negative Matrix Factorization is another mathematical technique
to decompose a matrix into sub-matrices.
● NMF assumes there are no negative numbers in your original matrix.
35. ● In NMF, we break the data matrix M down into two component
matrices.
● W is the features matrix. It will be of the shape (number of words,
number of topics). So a system with 5000 words and 10 topics will
have W be (5000,10).
● H is the “coefficients matrix.” It’s of the shape (number of topics,
number of documents).
● The combination of these two tells us how each word contributes to
each document, via some assumed middle dimension of size 10 (in
this example). That hidden dimension is our topic space!
Non-Negative Matrix Factorization (NMF)
36. ● Since NMF assumes no negative values, so we have to be careful
about any specialized document-vectorization we do. For instance,
depending on how you normalize TF-IDF, it may be possible to end up
with a negative value.
● NMF and LSA are very similar in technique, but can end up with
different results due to different underlying assumptions.
● Always try both, to see which is working better for your dataset!
Non-Negative Matrix Factorization (NMF)
37. NMF with Python
Output:
Input:
from sklearn.decomposition import NMF
nmf_tfidf = NMF(n_components=10)
nmf_tfidf_data = lsa_tfidf.fit_transform(tfidf_data)
show_topics(nmf_tfidf,tfidf.get_feature_names()) # a function I wrote
Topic 0: jpeg, image, gif, file, color, images, format, quality, version, files
Topic 1: edu, graphics, pub, mail, ray, send, ftp, com, objects, server
Topic 2: jesus, matthew, prophecy, people, said, messiah, isaiah, psalm, david, king
38. Some rules of thumb, that you will break:
● In practice, LSA/NMF tends to outperform LDA on small documents.
Things like tweets, forum posts, etc. That’s not always true, but worth
keeping in mind.
● In contrast, doing LSA/NMF on HUGE documents can lead to needing
100s of topics to make sense of things. LDA usually does a better job
on big documents with many topics.
● LSA and NMF are similar techniques and one doesn’t systematically
outperform the other.
● LSA and NMF both perform differently with TF-IDF vs Count Vectorizer
vs Binomial Vectorizers. Try them all.
39. Some rules of thumb, that you will break:
● If you’re getting topics that make
sense, and your code isn’t buggy, your
process is good. There’s a lot of art in
NLP, and especially in topic modeling.
40. Similarity
● Two similar documents should be near each other in the latent space,
since they should have overlapping topics.
● We can’t rely on Euclidean distance in this space, because we want to
count documents as similar whether they have 900 mentions of a
certain topic, or only 600 mentions of a certain topic.
● Cosine distance measures this similarity well.
47. Similarity
● By using cosine distance in the topic space, the length of the
documents doesn’t matter – either relative or absolute.
● Comparisons in the latent space can allow us to make
recommendations: “If you like this, you’ll probably like this!”
● If we know each documents breakdown in the latent space, we can
find documents about the same topics.
48. Clustering
● In our latent spaces, if we’ve done our dimensionality reduction, we
should get much more meaningful clusters.
● LSA/NMF allow us to use clustering, just like LDA does.
● We also know that distances are more meaningful. So a two
documents that are very similar will have very similar place in the topic
space. Example: two articles about baseball should be very high on
the “bat, glove, outfield” topic and very low on the “gif, jpg, png” topic.
● We can use this to document recommendation too!
49. Clustering with Python
Input:
from sklearn.cluster import Means
clus = KMeans(n_clusters=25,random_state=42)
# Note: similarity in KMeans determined by Euclidean distance
labels = clus.fit_predict(lsa_tfidf_data)
51. Topic Modeling Summary – A Reminder
● Choose some algorithm to figure out how words are related
(LDA, LSA, NMF)
● Create a latent space that makes use of those in-document correlations to
transition from a “word space” to a latent “topic space”
● Each axis in the latent space represents a topic
● We, as the humans, must understand the meaning of the topics. The machine
doesn’t understand the meanings, just the correlations.
Recall that a latent space is a space defined by latent features: features that are linear combinations of our original features. A classic example is “size” being a latent feature for our original features of “height” and “weight”.
In NLP, the most common context for latent features is “topics” being the latent features, as linear combinations of “words”, which are our original features.
Finding latent features also often allows us to do dimensionality reduction: for example, if we start with describing each document as made up of 1000 words from our corpus, we could reduce that down to each document being made up of 10 topics from our corpus.
Note: Dimensionality Reduction is technique used to reduce the number features/dimensions in a many/high dimensional dataset (i.e transforms the data in the high-dimensional space to a space of fewer dimensions).
Clustering is a an unsupervised learning technique of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups.
For context: the next few slides after this one will walk through how we can use singular value decomposition for dimensionality reduction.
Recall that M is our original data. Thus, the eigenvectors of MM* can be related back to our original data.
Similarly, the eigenvectors of M*M can also be related back to our original data.
Approximating our data with fewer dimensions is exactly what we want to do with dimensionality reduction!
The key point of this slide is just: SVD - what we just walked through, a technique that can be applied to any matrix - applied to text data specifically, is called “Latent Semantic Analysis”.
The result ends up being very similar to topic modeling! Each document is broken down into “latent features”, and each latent feature is a linear combination of individual words.
Even though it’s mostly clear from the slide, we’ll reiterate: this implements the “LSA” technique that splits topics into words, and documents into topics, shown before at a conceptual level. “Under the hood”, this is just using the truncated SVD matrix decomposition procedure.
Note that this code uses the same “lsa_tfidf” and “tfidf_data” objects from before.
Motivation for this slide: once we have converted our documents into the latent space - recall that this means the “topic” space - how can we cluster our documents, or more generally, tell if the documents are similar to each other?
We could just use Euclidean distance. Recall, though, that if our original data is represented by the output of a “CountVectorizer” function, the resulting topic values will be a weighted sum of the number of times words from that topic appear in that document. So, longer documents will be counted differently than shorter documents. What we want, however, is for documents to be clustered based on meaning - not document length!
Using Cosine distance instead of Euclidean distance can be a good option in this scenario.
Here, “Cars” and “Dogs” are topics, and the dots are documents.
Key point of these next few slides is to illustrate concretely: our conclusions about which documents are “close to each other” can be different depending on whether we use Euclidean distance or Cosine distance.
A real world application: websites use these techniques to give you recommend “articles you might like”.
Note: lsa_tfidf_data the same as before.
Visualization made using T-SNE, a technique for visualizing high dimensional representations of data in two dimensions. Read more about T-SNE here: https://lvdmaaten.github.io/tsne/#implementations
To see the specific implementation used to make this visualization, see here: http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
Code requires a corpus and Topic ID to word mapping(id2word)
Code requires a corpus and Topic ID to word mapping(id2word)