SlideShare a Scribd company logo
Latent Dirichlet Allocation
Soojung Hong
Feb 6. 2017
Contents
❏ Introduction : Text corpora modeling
❏ Background and Terminology
❏ Latent Dirichlet Allocation
❏ Comparison with other latent variable models
❏ Inference and Parameter Estimation
❏ Applications and Empirical Results
❏ Summary
Text corpora modeling
❏ Goal
Finding short descriptions of the members of the collections
(e.g. Finding topics from documents in corpora)
In particular, descriptions preserve essential statistical relationships
❏ Application areas
classification, novelty detection, summarization and collaborative filtering
❏ Relevant approaches for text modeling
tf-idf scheme, LSI, pLSI and latent Dirichlet allocation (LDA)
Summary of other approaches
Advantage Disadvantage
tf-idf Reduces documents from arbitrary length to
fixed-length lists of numbers
Relatively small amount of reduction. Reveal little
about inter-intra document’s statistical structure
LSI Reduces the dimensionality and learns latent topics
by performing a matrix composition (SVD) on
term-document matrix
Not clear why one should use LSI for generative
model of text
pLSI
Significant step forward probabilistic modeling of text No probabilistic model at the level of documents
If number of parameter grows linearly with size of
corpus, it leads overfitting
Not clear how to assign probability to document
outside of the training set
Details of tf-idf
● Term Frequency - Inverse Document Frequency :
● Term Frequency : How frequently a word occurs in a document
● Inverse Document Frequency : The inverse document frequency for any given term
http://www.joyofdata.de/blog/tf-idf-statistic-keyword-extraction/
High tf-idf scored words
-> occur frequently
-> provide more important information
Details of LSI (also called LSA)
❏ Analysis of latent semantics in a corpora of text (by using Singular Value Decomposition)
❏ A collection of documents can be represented as a term-document matrix
❏ Similarity of doc-doc, term-term, term-doc can be measured by cosine similarity
❏ Not easy to find polysemy and synonymy
❏ LSI transforms the original data in a different space so that two documents/words about
the same concept are mapped close
https://simonpaarlberg.com/post/latent-semantic-analyses/
Details of pLSI
❏ pLSI models each word in a document as a sample from mixture model
❏ The mixture components of mixture model are multinomial random variables that can be
viewed as representations of `topics’. Thus each word is generated from a single topic
❏ Each document is represented as a list of mixing proportions for these mixture components
“topics”.
d : the document index variable
c : a word's topic drawn from the document's topic distribution P(c|d)
w : a word drawn from the word distribution of this word's topic P(w|c)
d and w : observable variables
topic c : a latent variable (represented as z in the above)
Plate notation representing the pLSA model
Two assumptions
❏ Bag-of-words (Fundamental probabilistic assumption for dimensionality reduction)
The order of words in document can be neglected.
This is an assumption of exchangeability for the words in a document.
❏ Exchangeability (Documents are exchangeable)
The specific ordering of the documents in a corpus can be neglected
❏ Exchangeability is not equivalent to an assumption that the random variables are
independent and identically distributed, it is rather “Conditionally independent and
identically distributed”. This condition is with respect to an underlying latent parameter
of a probability distribution.
Notation and Terminology
❏ Latent variables aim to capture abstract notions such as topics
❏ Word is a basic unit of discrete data, defined as an item from a vocabulary indexed by
Words is unit-basis vectors that have a single component equal to one and all other
components equal to zero. v-th word in the vocabulary is represented by a V-vector w
such that and for
❏ Document is a sequence of N words denoted by
❏ Corpus is a collection of M documents denoted by
Latent Dirichlet Allocation
❏ Generative probabilistic model of a corpus
The basic idea is that documents are represented as random mixtures over latent topics
Each topic is characterized by a distribution over words.
❏ LDA generative process for each document w in a corpus D:
1. Choose N ∼ Poisson(ξ)
2. Choose θ ∼ Dir(α)
3. For each of the N words wn:
(a) Choose a topic zn ∼ Multinomial(θ).
(b) Choose a word wn from p(wn |zn,β), a multinomial probability conditioned on the topic zn.
❏ Several simplifying assumption : [1] The dimensionality k of the Dirichlet distribution (and
thus the dimensionality of topic variable z) is assumed to be known and fixed [2] The word
probabilities are parameterized by k ×V matrix β where which for now we
treat as a fixed quantity to be estimated. [3] Poisson assumption is not critical to anything.
[4] N is independent to all other data generating θ and z
LDA (continue)
● k-dimensional Dirichlet random variable θ can take values in (k-1)-simplex
(k vector θ lies in the k-1 simplex if )
● The probability density on this simplex :
● Parameter α is a k-vector with components with and where is gamma function.
● Given the parameters α and β, the joint distribution of a topic mixture θ, a set of N topics
z, and a set of N words w is given by:
LDA (continue)
● Where p(zn | θ) is simply θi for the unique i such that
Integrating over θ and summing over z, we obtain the marginal distribution of a document
● Probability of a corpus by taking the product of the marginal probabilities of single documents
Graphical Model Representation of LDA
Three levels to the LDA representation
1. The parameter α and β are corpus level parameters, assumed to be sampled once in the
process of generating a corpus
2. Variable θd are document-level variables, sampled per document
3. Variables zdn and wdn are word-level variables and sampled once for each word in each
document
LDA and Exchangeability
● If finite set of random variables {z1 , , , , zn} is said to be exchangeable, if the joint
distribution is invariant to permutation. If π is a permutation of the integers from 1 to N.
● An infinite sequence of random variables is infinitely exchangeable, if every infinite
subsequence is exchangeable.
● De Finetti’s representation theorem states that “the joint distribution of an infinitely
exchangeable sequence of random variables is as if a random parameter were drawn from some
distribution and then the random variables in question were independent and identically distributed,
conditioned on that parameter”.
● By de Finetti’s theorem, the probability of a sequence of words and topics must have the
form where θ is the random parameter of a multinomial over topic
Continuous mixture of unigrams
● LDA model is more elaborate than the two-level models often studied in the classical
hierarchical Bayesian literature
● By marginalizing over hidden topic variable z, LDA can be understood as a two-level model
● The word distribution p (w | θ, β)
Continuous mixture of unigrams (continue)
Given word distribution generative process for a document w :
1. Choose θ ~ Dir(α)
2. For each of the N words wn :
Choose a word wn from p(wn | θ, β) ,
This process defines the marginal distribution of a document as a continuous mixture
distribution,
p(θ|α) are the mixture weights p(wn | θ, β) are mixture components marginal out with θ
Interpretation of LDA
Example density on unigram distributions p(w|θ,β) under LDA for three words and four topics
The triangle embedded in the
x-y plane is the 2-D simplex
representing all possible
multinomial distributions over
three words.
Each of the vertices corresponds
to a deterministic distribution
that assigns probability one to
one of the words
Midpoint of an edge gives
probability 0.5 to two of the
words
Centroid of the triangle is the
uniform distribution over all
three words
Locations of the multinomial
distributions p(w|z) for each of the
four topics, and the surface shown on
top of the simplex is an example of a
density over the (V−1)-simplex
(multinomial distributions of words)
given by LDA.
Comparison : LDA and other latent variable models
(a) Unigram model (b) Mixture of unigram (c) pLSI model
Augment the unigram model with a
discrete random topic variable z and
obtain a mixture of unigrams model.
Each document is generated by the
first choosing topic z and then
generating N words independently
from the conditional multinomial
p(w|z).
Posits that a document label d and
a word wn are conditionally
independent given an unobserved
topic z.
The words of every document
are drawn independently from
single multinomial distribution.
Drawback of pLSI
❏ pLSI does capture the possibility that a document may contain multiple topics
- p(z|d) serves as the mixture weights of the topics for a particular document d
- However, d is a dummy index into the list of documents in the training set. Thus d is a
multinomial random variable restricted in training documents
❏ pLSI is not well-defined generative model of documents
- There is no natural way to assign probability to a previously unseen document.
- The parameters for a k-topic pLSI model are K multinomial distributions of size V and
M mixtures over the k-hidden topics. This gives kV + kM parameters and therefore
linear growth in M. The linear growth in parameters suggests that the model is prone
to overfitting.
LDA overcomes pLSI problem
❏ By treating the topic mixture weights as a k-parameter hidden random variable rather
than a large set of individual parameters which are explicitly linked to the training set
❏ LDA is a well-defined generative model and generalize easily to new documents.
Furthermore, the k+kV parameters in k-topic LDA model do not grow with the size of
training corpus, therefore doesn’t suffer from the issue of overfitting.
pLSI
LDA
Geometric Interpretation of latent space
Difference between LDA and other latent topic models (unigram, mixture of unigrams, pLSI)
The unigram model finds a single point on
the word simplex and posits that all words
in the corpus come from the
corresponding distribution.
Mixture of unigrams model posits that
for each document, one of the k points
on the word simplex (that is, one of the
corners of the topic simplex) is chosen
randomly and all the words of the
documents are drawn from the
distribution corresponding to that
point.
pLSI model posits that each word
of a training document comes
from randomly selected chosen
topic. The topics are themselves
drawn from a document-specific
distribution over topics.
LDA posit that each word of both
observed and unseen documents is
generated by a randomly chosen
topic which is drawn from a
distribution with a randomly
chosen parameter.
http://parkcu.com/blog/
Inference and Parameter Estimation
● Key inferential problem to solve in order to use LDA is computing the posterior distribution of
the hidden variables given a document. This distribution is intractable to compute in general.
Conditional density can be written as
● To normalize the distribution, we marginalize over the hidden variables and write following
equation in terms of model parameter as
● Problem : The denominator contains the marginal density of the observations (evidence) ,
often this evidence integral is unavailable in closed form or require exponential time to
compute. Therefore, we needs approximate inference algorithm
(*)
Inference and Parameter Estimation (continue)
● Goal : To find the best candidate approximate q(z), the one closest in KL divergence to
exact conditional p(z). Inference now to solving the optimization problem.
Family of densities over latent variables.
Each is candidate approximation to exact conditional
● The problem in (*) is coupling between θ and β arises due to the edges between θ, z and
w. Thus, by dropping these edges and the w nodes, and endowing the resulting
simplified graphical model with free variational parameters
Graphical model of the variational distribution
used to approximate the posterior in LDA
● By dropping these edges between θ and β and the w nodes, and endowing the resulting
simplified graphical model with free variational parameters, we obtain a family of
distributions on the latent variables. This family is characterized by the following variational
distribution:
Dirichlet parameter γ, multinomial parameters (φ1,...,φN) are the free variational parameters.
● The next step is to set up an optimization problem that determines the values of the
variational parameters γ and φ.
● Given a corpus of documents D =
● Find parameters α and β that maximize the (marginal) log likelihood of the data
● Quantity p(w|α,β) cannot be computed tractable
● Using variational inference provides us with a tractable lower bound on the log likelihood,
which we can maximize with respect to α an
● Find approximate empirical Bayes estimates for the LDA model via an alternating
variational EM procedure that maximizes a lower bound with respect to the variational
parameters γ and φ , and then, for fixed values of the variational parameters, maximizes
the lower bound with respect to the model parameters α and β.
Parameter Estimation
Smoothing
● Large vocabulary size is characteristic of document corpora often problems with sparsity
● Maximum likelihood estimates of the multinomial parameters β assign zero probability to
new words, and thus zero probability to new documents
● To avoid this problem using “smooth” the multinomial parameters β
● Assigning positive probability to all vocabulary items whether or not they are observed in
the training set.
⇒ Proposed solution for smoothing is to apply variational inference methods to the extended
model that includes Dirichlet smoothing on the multinomial parameter β
Graphical model representation of the smoothed LDA model.
Example Data
❏ Data : 16,000 documents from a subset of the TREC AP corpus
❏ Preparation : Removing a standard list of stop words and we use EM algorithm to find the
Dirichlet and conditional multinomial parameters for a 100-topic LDA model.
❏ The top words from some of the resulting multinomial distributions p(w|z)
(hopefully) These distributions capture some of the underlying topics in the corpus
Applications and Empirical Results : (1) Document Modeling
❏ Trained a number of latent variable models on two corpora
❏ Comparison of generalization performance (perplexity) of each models
❏ Goal is density estimation which achieve high likelihood on a held-out test set
❏ The perplexity, used by convention in the language modeling, is monotonically decreasing
in the likelihood of the test data.
❏ A lower perplexity score indicates better generalization performance.
❏ Formally, a test set of M documents, the perplexity is
Perplexity Results
Mixture of unigram model and pLSI both suffer serious overfitting issue.
LDA can easily assign probability to a new document without overfitting
● 5225 abstracts
● 28414 unique terms
● 6333 newswire articles
● 23075 unique terms
Applications and Empirical Results : Document classification
❏ Goal : classify a document into two or more mutually exclusive classes. In particular, by
using one LDA module for each class, we obtain a generative model for classification
❏ A challenging aspect of the document classification problem is the choice of features.
Treating individual words as features yields a rich but very large feature set
❏ One way to reduce feature set is to use an LDA model for dimensionality reduction. In
particular, LDA reduces any document to a fixed set of real-valued features - the posterior
Dirichlet parameters ϒ* (w) associated with the document.
❏ Two binary classification experiments using Reuter-21578 dataset
❏ Dataset contains 8000 documents and 15,818 words.
❏ Estimated the parameters of an LDA model on all the documents, without reference to
their true class label.
❏ SVM classification : one trained the low-dimensional representations provided by LDA
and the other with all the word features. (i.e. low-dimensional features by LDA vs. all
features, then use SVM classification)
Applications and Empirical Results : Collaborative filtering
❏ Final experiment on the EachMovie collaborative filtering data
❏ A collection of users indicates their preferred movie choice. A user and the movie chosen
are analogous to a document and the words in the document (respectively)
❏ The collaborative filtering task is as follows :
1. Train a model on fully observed set of users
2. For each unobserved user, we are shown all but one of movies
preferred by that user and are asked to predict what the held-out
movie is. The different algorithms are evaluated according to the
likelihood they assign to the held-out movie.
3. predictive perplexity on M test users to be :
Result for collaborative filtering on EachMovie data
3300 training users and
390 testing users
Mixture of unigram
LDA model
Summary
❏ Latent Dirichlet Allocation is a generative probabilistic model for collections of data.
❏ LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a
finite mixture over an underlying set of topics.
❏ Each topic is in turn, modeled as an infinite mixture over an underlying set of topic probability.
❏ This paper presents efficient approximate inference techniques based on variational methods and EM
algorithm for empirical Bayes parameter estimation.
❏ The paper reports results in document modeling, text classification and collaborative filtering,
compared to mixture of unigrams model and pLSI model.

More Related Content

What's hot

Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache ArrowAchieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache ArrowNeo4j
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesNeo4j
 
Introduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AIIntroduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AISemantic Web Company
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text ClassificationSai Srinivas Kotni
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingBart Vandewoestyne
 
In-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction supportIn-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction supportAlexander Korotkov
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
Introduction to Big Data & Big Data 1.0 System
Introduction to Big Data & Big Data 1.0 SystemIntroduction to Big Data & Big Data 1.0 System
Introduction to Big Data & Big Data 1.0 SystemPetr Novotný
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrievalNanthini Dominique
 
Graph Database Meetup in Seoul #1. What is Graph Database? (그래프 데이터베이스 소개)
Graph Database Meetup in Seoul #1. What is Graph Database? (그래프 데이터베이스 소개)Graph Database Meetup in Seoul #1. What is Graph Database? (그래프 데이터베이스 소개)
Graph Database Meetup in Seoul #1. What is Graph Database? (그래프 데이터베이스 소개)bitnineglobal
 

What's hot (20)

Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Data models in NoSQL
Data models in NoSQLData models in NoSQL
Data models in NoSQL
 
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache ArrowAchieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph Databases
 
Introduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AIIntroduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AI
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Hadoop vs Apache Spark
Hadoop vs Apache SparkHadoop vs Apache Spark
Hadoop vs Apache Spark
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Term weighting
Term weightingTerm weighting
Term weighting
 
Vector space model in information retrieval
Vector space model in information retrievalVector space model in information retrieval
Vector space model in information retrieval
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
In-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction supportIn-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction support
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Text Similarity
Text SimilarityText Similarity
Text Similarity
 
Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
Introduction to Big Data & Big Data 1.0 System
Introduction to Big Data & Big Data 1.0 SystemIntroduction to Big Data & Big Data 1.0 System
Introduction to Big Data & Big Data 1.0 System
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
Graph Database Meetup in Seoul #1. What is Graph Database? (그래프 데이터베이스 소개)
Graph Database Meetup in Seoul #1. What is Graph Database? (그래프 데이터베이스 소개)Graph Database Meetup in Seoul #1. What is Graph Database? (그래프 데이터베이스 소개)
Graph Database Meetup in Seoul #1. What is Graph Database? (그래프 데이터베이스 소개)
 

Viewers also liked

Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationMarco Righini
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's TutorialWayne Lee
 
A Topic Model for Traffic Speed Data Analysis
A Topic Model for Traffic Speed Data AnalysisA Topic Model for Traffic Speed Data Analysis
A Topic Model for Traffic Speed Data AnalysisTomonari Masada
 
WSDM2014
WSDM2014WSDM2014
WSDM2014Jun Yu
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationQuentin Pleplé
 
Goal-based Recommendation utilizing Latent Dirichlet Allocation
Goal-based Recommendation utilizing Latent Dirichlet AllocationGoal-based Recommendation utilizing Latent Dirichlet Allocation
Goal-based Recommendation utilizing Latent Dirichlet AllocationSebastien Louvigne
 
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...Tomonari Masada
 
What's new in scikit-learn 0.17
What's new in scikit-learn 0.17What's new in scikit-learn 0.17
What's new in scikit-learn 0.17Andreas Mueller
 
A Note on Expectation-Propagation for Latent Dirichlet Allocation
A Note on Expectation-Propagation for Latent Dirichlet AllocationA Note on Expectation-Propagation for Latent Dirichlet Allocation
A Note on Expectation-Propagation for Latent Dirichlet AllocationTomonari Masada
 
Visualzing Topic Models
Visualzing Topic ModelsVisualzing Topic Models
Visualzing Topic ModelsTuri, Inc.
 
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic ModelA Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic ModelTomonari Masada
 
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationTomonari Masada
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003Ajay Ohri
 
Topic Models, LDA and all that
Topic Models, LDA and all thatTopic Models, LDA and all that
Topic Models, LDA and all thatZhibo Xiao
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with ContextSteffen Staab
 

Viewers also liked (20)

Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
LDA入門
LDA入門LDA入門
LDA入門
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's Tutorial
 
A Topic Model for Traffic Speed Data Analysis
A Topic Model for Traffic Speed Data AnalysisA Topic Model for Traffic Speed Data Analysis
A Topic Model for Traffic Speed Data Analysis
 
WSDM2014
WSDM2014WSDM2014
WSDM2014
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet Allocation
 
Goal-based Recommendation utilizing Latent Dirichlet Allocation
Goal-based Recommendation utilizing Latent Dirichlet AllocationGoal-based Recommendation utilizing Latent Dirichlet Allocation
Goal-based Recommendation utilizing Latent Dirichlet Allocation
 
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
 
What's new in scikit-learn 0.17
What's new in scikit-learn 0.17What's new in scikit-learn 0.17
What's new in scikit-learn 0.17
 
A Note on Expectation-Propagation for Latent Dirichlet Allocation
A Note on Expectation-Propagation for Latent Dirichlet AllocationA Note on Expectation-Propagation for Latent Dirichlet Allocation
A Note on Expectation-Propagation for Latent Dirichlet Allocation
 
LDA
LDALDA
LDA
 
Visualzing Topic Models
Visualzing Topic ModelsVisualzing Topic Models
Visualzing Topic Models
 
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic ModelA Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
 
PCA vs LDA
PCA vs LDAPCA vs LDA
PCA vs LDA
 
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
 
Understandig PCA and LDA
Understandig PCA and LDAUnderstandig PCA and LDA
Understandig PCA and LDA
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
C4.5
C4.5C4.5
C4.5
 
Topic Models, LDA and all that
Topic Models, LDA and all thatTopic Models, LDA and all that
Topic Models, LDA and all that
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with Context
 

Similar to Latent dirichletallocation presentation

Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsClaudia Wagner
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)Sihan Chen
 
Learning to summarize using coherence
Learning to summarize using coherenceLearning to summarize using coherence
Learning to summarize using coherenceContent Savvy
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingTomonari Masada
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...Aaron Li
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learningfridolin.wild
 
Topic models
Topic modelsTopic models
Topic modelsAjay Ohri
 
Search Engines
Search EnginesSearch Engines
Search Enginesbutest
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationEugene Nho
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015rusbase
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Roman Stanchak
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
LDA/TagLDA In Slow Motion
LDA/TagLDA In Slow MotionLDA/TagLDA In Slow Motion
LDA/TagLDA In Slow MotionPradipto Das
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distancesGanesh Borle
 

Similar to Latent dirichletallocation presentation (20)

Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)
 
Learning to summarize using coherence
Learning to summarize using coherenceLearning to summarize using coherence
Learning to summarize using coherence
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
Canini09a
Canini09aCanini09a
Canini09a
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
Topic models
Topic modelsTopic models
Topic models
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
 
Author Topic Model
Author Topic ModelAuthor Topic Model
Author Topic Model
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008
 
Topic modelling
Topic modellingTopic modelling
Topic modelling
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
LDA/TagLDA In Slow Motion
LDA/TagLDA In Slow MotionLDA/TagLDA In Slow Motion
LDA/TagLDA In Slow Motion
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distances
 

Recently uploaded

Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdfKamal Acharya
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Krakówbim.edu.pl
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGKOUSTAV SARKAR
 
Furniture showroom management system project.pdf
Furniture showroom management system project.pdfFurniture showroom management system project.pdf
Furniture showroom management system project.pdfKamal Acharya
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfPipe Restoration Solutions
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectRased Khan
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdfKamal Acharya
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamDr. Radhey Shyam
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopEmre Günaydın
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdfKamal Acharya
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxMd. Shahidul Islam Prodhan
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf884710SadaqatAli
 
Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfAyahmorsy
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industriesMuhammadTufail242431
 
Pharmacy management system project report..pdf
Pharmacy management system project report..pdfPharmacy management system project report..pdf
Pharmacy management system project report..pdfKamal Acharya
 
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docxThe Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docxCenterEnamel
 
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsRS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsAtif Razi
 
IT-601 Lecture Notes-UNIT-2.pdf Data Analysis
IT-601 Lecture Notes-UNIT-2.pdf Data AnalysisIT-601 Lecture Notes-UNIT-2.pdf Data Analysis
IT-601 Lecture Notes-UNIT-2.pdf Data AnalysisDr. Radhey Shyam
 

Recently uploaded (20)

Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
Furniture showroom management system project.pdf
Furniture showroom management system project.pdfFurniture showroom management system project.pdf
Furniture showroom management system project.pdf
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdf
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering Workshop
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf
 
Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdf
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
Pharmacy management system project report..pdf
Pharmacy management system project report..pdfPharmacy management system project report..pdf
Pharmacy management system project report..pdf
 
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docxThe Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
 
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsRS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
 
IT-601 Lecture Notes-UNIT-2.pdf Data Analysis
IT-601 Lecture Notes-UNIT-2.pdf Data AnalysisIT-601 Lecture Notes-UNIT-2.pdf Data Analysis
IT-601 Lecture Notes-UNIT-2.pdf Data Analysis
 

Latent dirichletallocation presentation

  • 2. Contents ❏ Introduction : Text corpora modeling ❏ Background and Terminology ❏ Latent Dirichlet Allocation ❏ Comparison with other latent variable models ❏ Inference and Parameter Estimation ❏ Applications and Empirical Results ❏ Summary
  • 3. Text corpora modeling ❏ Goal Finding short descriptions of the members of the collections (e.g. Finding topics from documents in corpora) In particular, descriptions preserve essential statistical relationships ❏ Application areas classification, novelty detection, summarization and collaborative filtering ❏ Relevant approaches for text modeling tf-idf scheme, LSI, pLSI and latent Dirichlet allocation (LDA)
  • 4. Summary of other approaches Advantage Disadvantage tf-idf Reduces documents from arbitrary length to fixed-length lists of numbers Relatively small amount of reduction. Reveal little about inter-intra document’s statistical structure LSI Reduces the dimensionality and learns latent topics by performing a matrix composition (SVD) on term-document matrix Not clear why one should use LSI for generative model of text pLSI Significant step forward probabilistic modeling of text No probabilistic model at the level of documents If number of parameter grows linearly with size of corpus, it leads overfitting Not clear how to assign probability to document outside of the training set
  • 5. Details of tf-idf ● Term Frequency - Inverse Document Frequency : ● Term Frequency : How frequently a word occurs in a document ● Inverse Document Frequency : The inverse document frequency for any given term http://www.joyofdata.de/blog/tf-idf-statistic-keyword-extraction/ High tf-idf scored words -> occur frequently -> provide more important information
  • 6. Details of LSI (also called LSA) ❏ Analysis of latent semantics in a corpora of text (by using Singular Value Decomposition) ❏ A collection of documents can be represented as a term-document matrix ❏ Similarity of doc-doc, term-term, term-doc can be measured by cosine similarity ❏ Not easy to find polysemy and synonymy ❏ LSI transforms the original data in a different space so that two documents/words about the same concept are mapped close https://simonpaarlberg.com/post/latent-semantic-analyses/
  • 7. Details of pLSI ❏ pLSI models each word in a document as a sample from mixture model ❏ The mixture components of mixture model are multinomial random variables that can be viewed as representations of `topics’. Thus each word is generated from a single topic ❏ Each document is represented as a list of mixing proportions for these mixture components “topics”. d : the document index variable c : a word's topic drawn from the document's topic distribution P(c|d) w : a word drawn from the word distribution of this word's topic P(w|c) d and w : observable variables topic c : a latent variable (represented as z in the above) Plate notation representing the pLSA model
  • 8. Two assumptions ❏ Bag-of-words (Fundamental probabilistic assumption for dimensionality reduction) The order of words in document can be neglected. This is an assumption of exchangeability for the words in a document. ❏ Exchangeability (Documents are exchangeable) The specific ordering of the documents in a corpus can be neglected ❏ Exchangeability is not equivalent to an assumption that the random variables are independent and identically distributed, it is rather “Conditionally independent and identically distributed”. This condition is with respect to an underlying latent parameter of a probability distribution.
  • 9. Notation and Terminology ❏ Latent variables aim to capture abstract notions such as topics ❏ Word is a basic unit of discrete data, defined as an item from a vocabulary indexed by Words is unit-basis vectors that have a single component equal to one and all other components equal to zero. v-th word in the vocabulary is represented by a V-vector w such that and for ❏ Document is a sequence of N words denoted by ❏ Corpus is a collection of M documents denoted by
  • 10. Latent Dirichlet Allocation ❏ Generative probabilistic model of a corpus The basic idea is that documents are represented as random mixtures over latent topics Each topic is characterized by a distribution over words. ❏ LDA generative process for each document w in a corpus D: 1. Choose N ∼ Poisson(ξ) 2. Choose θ ∼ Dir(α) 3. For each of the N words wn: (a) Choose a topic zn ∼ Multinomial(θ). (b) Choose a word wn from p(wn |zn,β), a multinomial probability conditioned on the topic zn. ❏ Several simplifying assumption : [1] The dimensionality k of the Dirichlet distribution (and thus the dimensionality of topic variable z) is assumed to be known and fixed [2] The word probabilities are parameterized by k ×V matrix β where which for now we treat as a fixed quantity to be estimated. [3] Poisson assumption is not critical to anything. [4] N is independent to all other data generating θ and z
  • 11. LDA (continue) ● k-dimensional Dirichlet random variable θ can take values in (k-1)-simplex (k vector θ lies in the k-1 simplex if ) ● The probability density on this simplex : ● Parameter α is a k-vector with components with and where is gamma function. ● Given the parameters α and β, the joint distribution of a topic mixture θ, a set of N topics z, and a set of N words w is given by:
  • 12. LDA (continue) ● Where p(zn | θ) is simply θi for the unique i such that Integrating over θ and summing over z, we obtain the marginal distribution of a document ● Probability of a corpus by taking the product of the marginal probabilities of single documents
  • 13. Graphical Model Representation of LDA Three levels to the LDA representation 1. The parameter α and β are corpus level parameters, assumed to be sampled once in the process of generating a corpus 2. Variable θd are document-level variables, sampled per document 3. Variables zdn and wdn are word-level variables and sampled once for each word in each document
  • 14. LDA and Exchangeability ● If finite set of random variables {z1 , , , , zn} is said to be exchangeable, if the joint distribution is invariant to permutation. If π is a permutation of the integers from 1 to N. ● An infinite sequence of random variables is infinitely exchangeable, if every infinite subsequence is exchangeable. ● De Finetti’s representation theorem states that “the joint distribution of an infinitely exchangeable sequence of random variables is as if a random parameter were drawn from some distribution and then the random variables in question were independent and identically distributed, conditioned on that parameter”. ● By de Finetti’s theorem, the probability of a sequence of words and topics must have the form where θ is the random parameter of a multinomial over topic
  • 15. Continuous mixture of unigrams ● LDA model is more elaborate than the two-level models often studied in the classical hierarchical Bayesian literature ● By marginalizing over hidden topic variable z, LDA can be understood as a two-level model ● The word distribution p (w | θ, β)
  • 16. Continuous mixture of unigrams (continue) Given word distribution generative process for a document w : 1. Choose θ ~ Dir(α) 2. For each of the N words wn : Choose a word wn from p(wn | θ, β) , This process defines the marginal distribution of a document as a continuous mixture distribution, p(θ|α) are the mixture weights p(wn | θ, β) are mixture components marginal out with θ
  • 17. Interpretation of LDA Example density on unigram distributions p(w|θ,β) under LDA for three words and four topics The triangle embedded in the x-y plane is the 2-D simplex representing all possible multinomial distributions over three words. Each of the vertices corresponds to a deterministic distribution that assigns probability one to one of the words Midpoint of an edge gives probability 0.5 to two of the words Centroid of the triangle is the uniform distribution over all three words Locations of the multinomial distributions p(w|z) for each of the four topics, and the surface shown on top of the simplex is an example of a density over the (V−1)-simplex (multinomial distributions of words) given by LDA.
  • 18. Comparison : LDA and other latent variable models (a) Unigram model (b) Mixture of unigram (c) pLSI model Augment the unigram model with a discrete random topic variable z and obtain a mixture of unigrams model. Each document is generated by the first choosing topic z and then generating N words independently from the conditional multinomial p(w|z). Posits that a document label d and a word wn are conditionally independent given an unobserved topic z. The words of every document are drawn independently from single multinomial distribution.
  • 19. Drawback of pLSI ❏ pLSI does capture the possibility that a document may contain multiple topics - p(z|d) serves as the mixture weights of the topics for a particular document d - However, d is a dummy index into the list of documents in the training set. Thus d is a multinomial random variable restricted in training documents ❏ pLSI is not well-defined generative model of documents - There is no natural way to assign probability to a previously unseen document. - The parameters for a k-topic pLSI model are K multinomial distributions of size V and M mixtures over the k-hidden topics. This gives kV + kM parameters and therefore linear growth in M. The linear growth in parameters suggests that the model is prone to overfitting.
  • 20. LDA overcomes pLSI problem ❏ By treating the topic mixture weights as a k-parameter hidden random variable rather than a large set of individual parameters which are explicitly linked to the training set ❏ LDA is a well-defined generative model and generalize easily to new documents. Furthermore, the k+kV parameters in k-topic LDA model do not grow with the size of training corpus, therefore doesn’t suffer from the issue of overfitting. pLSI LDA
  • 21. Geometric Interpretation of latent space Difference between LDA and other latent topic models (unigram, mixture of unigrams, pLSI) The unigram model finds a single point on the word simplex and posits that all words in the corpus come from the corresponding distribution. Mixture of unigrams model posits that for each document, one of the k points on the word simplex (that is, one of the corners of the topic simplex) is chosen randomly and all the words of the documents are drawn from the distribution corresponding to that point. pLSI model posits that each word of a training document comes from randomly selected chosen topic. The topics are themselves drawn from a document-specific distribution over topics. LDA posit that each word of both observed and unseen documents is generated by a randomly chosen topic which is drawn from a distribution with a randomly chosen parameter.
  • 23. Inference and Parameter Estimation ● Key inferential problem to solve in order to use LDA is computing the posterior distribution of the hidden variables given a document. This distribution is intractable to compute in general. Conditional density can be written as ● To normalize the distribution, we marginalize over the hidden variables and write following equation in terms of model parameter as ● Problem : The denominator contains the marginal density of the observations (evidence) , often this evidence integral is unavailable in closed form or require exponential time to compute. Therefore, we needs approximate inference algorithm (*)
  • 24. Inference and Parameter Estimation (continue) ● Goal : To find the best candidate approximate q(z), the one closest in KL divergence to exact conditional p(z). Inference now to solving the optimization problem. Family of densities over latent variables. Each is candidate approximation to exact conditional ● The problem in (*) is coupling between θ and β arises due to the edges between θ, z and w. Thus, by dropping these edges and the w nodes, and endowing the resulting simplified graphical model with free variational parameters Graphical model of the variational distribution used to approximate the posterior in LDA
  • 25. ● By dropping these edges between θ and β and the w nodes, and endowing the resulting simplified graphical model with free variational parameters, we obtain a family of distributions on the latent variables. This family is characterized by the following variational distribution: Dirichlet parameter γ, multinomial parameters (φ1,...,φN) are the free variational parameters. ● The next step is to set up an optimization problem that determines the values of the variational parameters γ and φ.
  • 26. ● Given a corpus of documents D = ● Find parameters α and β that maximize the (marginal) log likelihood of the data ● Quantity p(w|α,β) cannot be computed tractable ● Using variational inference provides us with a tractable lower bound on the log likelihood, which we can maximize with respect to α an ● Find approximate empirical Bayes estimates for the LDA model via an alternating variational EM procedure that maximizes a lower bound with respect to the variational parameters γ and φ , and then, for fixed values of the variational parameters, maximizes the lower bound with respect to the model parameters α and β. Parameter Estimation
  • 27. Smoothing ● Large vocabulary size is characteristic of document corpora often problems with sparsity ● Maximum likelihood estimates of the multinomial parameters β assign zero probability to new words, and thus zero probability to new documents ● To avoid this problem using “smooth” the multinomial parameters β ● Assigning positive probability to all vocabulary items whether or not they are observed in the training set. ⇒ Proposed solution for smoothing is to apply variational inference methods to the extended model that includes Dirichlet smoothing on the multinomial parameter β Graphical model representation of the smoothed LDA model.
  • 28. Example Data ❏ Data : 16,000 documents from a subset of the TREC AP corpus ❏ Preparation : Removing a standard list of stop words and we use EM algorithm to find the Dirichlet and conditional multinomial parameters for a 100-topic LDA model. ❏ The top words from some of the resulting multinomial distributions p(w|z) (hopefully) These distributions capture some of the underlying topics in the corpus
  • 29. Applications and Empirical Results : (1) Document Modeling ❏ Trained a number of latent variable models on two corpora ❏ Comparison of generalization performance (perplexity) of each models ❏ Goal is density estimation which achieve high likelihood on a held-out test set ❏ The perplexity, used by convention in the language modeling, is monotonically decreasing in the likelihood of the test data. ❏ A lower perplexity score indicates better generalization performance. ❏ Formally, a test set of M documents, the perplexity is
  • 30. Perplexity Results Mixture of unigram model and pLSI both suffer serious overfitting issue. LDA can easily assign probability to a new document without overfitting ● 5225 abstracts ● 28414 unique terms ● 6333 newswire articles ● 23075 unique terms
  • 31. Applications and Empirical Results : Document classification ❏ Goal : classify a document into two or more mutually exclusive classes. In particular, by using one LDA module for each class, we obtain a generative model for classification ❏ A challenging aspect of the document classification problem is the choice of features. Treating individual words as features yields a rich but very large feature set ❏ One way to reduce feature set is to use an LDA model for dimensionality reduction. In particular, LDA reduces any document to a fixed set of real-valued features - the posterior Dirichlet parameters ϒ* (w) associated with the document.
  • 32. ❏ Two binary classification experiments using Reuter-21578 dataset ❏ Dataset contains 8000 documents and 15,818 words. ❏ Estimated the parameters of an LDA model on all the documents, without reference to their true class label. ❏ SVM classification : one trained the low-dimensional representations provided by LDA and the other with all the word features. (i.e. low-dimensional features by LDA vs. all features, then use SVM classification)
  • 33. Applications and Empirical Results : Collaborative filtering ❏ Final experiment on the EachMovie collaborative filtering data ❏ A collection of users indicates their preferred movie choice. A user and the movie chosen are analogous to a document and the words in the document (respectively) ❏ The collaborative filtering task is as follows : 1. Train a model on fully observed set of users 2. For each unobserved user, we are shown all but one of movies preferred by that user and are asked to predict what the held-out movie is. The different algorithms are evaluated according to the likelihood they assign to the held-out movie. 3. predictive perplexity on M test users to be :
  • 34. Result for collaborative filtering on EachMovie data 3300 training users and 390 testing users Mixture of unigram LDA model
  • 35. Summary ❏ Latent Dirichlet Allocation is a generative probabilistic model for collections of data. ❏ LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. ❏ Each topic is in turn, modeled as an infinite mixture over an underlying set of topic probability. ❏ This paper presents efficient approximate inference techniques based on variational methods and EM algorithm for empirical Bayes parameter estimation. ❏ The paper reports results in document modeling, text classification and collaborative filtering, compared to mixture of unigrams model and pLSI model.