Probabilistic Topic Models

Probabilistic Topic Models
Surveying a suite of algorithms that offer a solution to managing large document archives
by David M. Blei
http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf
Presented by Steve Follmer
Bay Area NLP Reading Group
presents

Origins
Today we are reading a survey paper from 2012 by David Blei, the leading exponent of probabilistic topic
modeling.
Blei kicked things off in a 2003 paper co-authored with Michael Jordan and Andrew Ng, Latent Dirichlet
Allocation.
As Blei himself notes, LDA was ﬁrst published in 2000 in a paper on population genetics.
Further, probabilistic topic modeling does not require LDA, as we will see later in the presentation.

DOI:10.1145/2133806.2133826
Surveying a suite of algorithms that offer a
solution to managing large document archives.
BY DAVID M. BLEI
Probabilistic
Topic Models
AS OUR COLLECTIVE knowledge continues to be
digitized and stored—in the form of news, blogs, Web
pages, scientific articles, books, images, sound, video,
and social networks—it becomes more difficult to
find and discover what we are looking for. We need
new computational tools to help organize, search, and
understand these vast amounts of information.
Right now, we work with online information using
two main tools—search and links. We type keywords
into a search engine and find a set of documents
related to them. We look at the documents in that
For example, consider using themes
to explore the complete history of the
New York Times. At a broad level, some
of the themes might correspond to
the sections of the newspaper—for-
eign policy, national affairs, sports.
We could zoom in on a theme of in-
terest, such as foreign policy, to reveal
various aspects of it—Chinese foreign
policy, the conflict in the Middle East,
the U.S.’s relationship with Russia. We
could then navigate through time to
reveal how these specific themes have
changed, tracking, for example, the
changes in the conflict in the Middle
East over the last 50 years. And, in all of
this exploration, we would be pointed
to the original articles relevant to the
themes. The thematic structure would
beanewkindofwindowthroughwhich
to explore and digest the collection.
But we do not interact with elec-
tronic archives in this way. While more
and more texts are available online, we
simply do not have the human power
to read and study them to provide the
kind of browsing experience described
above. To this end, machine learning
researchers have developed probabilis-
tic topic modeling, a suite of algorithms
that aim to discover and annotate large
archives of documents with thematic
information. Topic modeling algo-
rithms are statistical methods that ana-
lyze the words of the original texts to
discover the themes that run through
them, how those themes are connected

Journal of Machine Learning Research 3 (2003) 993-1022 Submitted 2/02; Published 1/03
Latent Dirichlet Allocation
David M. Blei BLEI@CS.BERKELEY.EDU
Computer Science Division
University of California
Berkeley, CA 94720, USA
Andrew Y. Ng ANG@CS.STANFORD.EDU
Computer Science Department
Stanford University
Stanford, CA 94305, USA
Michael I. Jordan JORDAN@CS.BERKELEY.EDU
Computer Science Division and Department of Statistics
University of California
Berkeley, CA 94720, USA
Editor: John Lafferty
Abstract
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of
discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each
item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in
turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of
text modeling, the topic probabilities provide an explicit representation of a document. We present
efficient approximate inference techniques based on variational methods and an EM algorithm for

Copyright  2000 by the Genetics Society of America
Inference of Population Structure Using Multilocus Genotype Data
Jonathan K. Pritchard, Matthew Stephens and Peter Donnelly
Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom
Manuscript received September 23, 1999
Accepted for publication February 18, 2000
ABSTRACT
We describe a model-based clustering method for using multilocus genotype data to infer population
structure and assign individuals to populations. We assume a model in which there are K populations
(where K may be unknown), each of which is characterized by a set of allele frequencies at each locus.
Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more popula-
tions if their genotypes indicate that they are admixed. Our model does not assume a particular mutation
process, and it can be applied to most of the commonly used genetic markers, provided that they are not
closely linked. Applications of our method include demonstrating the presence of population structure,
assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individu-
als. We showthat the method can produce highlyaccurate assignments using modest numbers of loci—e.g.,
seven microsatellite loci in an example using genotype data from an endangered bird species. The software
used for this article is available from http:// www.stats.ox.ac.uk/ zpritch/ home.html.
IN applications of population genetics, it is often use- populationsbased on these subjective criteria represents
a natural assignment in genetic terms, and it would beful to classify individuals in a sample into popula-
tions. In one scenario, the investigator begins with a useful to be able to confirm that subjective classifications
are consistent with genetic information and hence ap-sample of individuals and wants to say something about
the properties of populations. For example, in studies propriate for studying the questions of interest. Further,
there are situations where one is interested in “cryptic”of human evolution, the population is often considered
to be the unit of interest, and a great deal of work has population structure—i.e., population structure that is
difficult to detect using visible characters, but may befocused on learning about the evolutionary relation-
ships of modern populations (e.g., Caval l i et al. 1994). significant in genetic terms. For example, when associa-
tion mapping is used to find disease genes, the presenceIn a second scenario, the investigator begins with a set
of predefined populations and wishes to classifyindivid- of undetected population structure can lead to spurious
associations and thus invalidate standard tests (Ewensuals of unknown origin. This type of problem arises
in many contexts (reviewed by Davies et al. 1999). A and Spiel man 1995). The problem of cryptic population
structure also arises in the context of DNA fingerprint-standard approach involves sampling DNA from mem-
bers of a number of potential source populations and ing for forensics, where it is important to assess the

Who
David Blei
video
classes
Mimno
Hoffmann
Topic Modeling mailing list

9/2017 www.cs.columbia.edu/~blei/
I am a professor of Statistics and Computer Science at
Columbia University. I am also a member of the Columbia Data
Science Institute. I work in the ﬁelds of machine learning and
Bayesian statistics.
See my CV and publications .
My research interests include:
Topic models
Probabilistic modeling
Approximate Bayesian inference
Here are two recent talks:
Probabilistic Topic Models and User Behavior
Variational Inference: Foundations and Innovations
Most of our publications are attached to open-source software.
See our GitHub page.
We released Edward: A library for probabilistic modeling,
inference, and criticism.
Machine Learning at Columbia
Columbia has a thriving machine learning community, with
many faculty and researchers across departments. The
MLColumbia Google group is a good source of information
about talks and other events on campus.
Teaching
In Spring 2017, I am teaching Applied Causality.
My previous courses are here.
David M. Blei
Columbia University
david.blei@columbia.edu
About
Topic modeling
Courses
Publications

4/18/2017 david blei - Google Scholar
David Blei
Professor of Statistics and Computer Science, Columbia University
Verified email at columbia.edu
Cited by 45111
DM Blei, AY Ng, MI Jordan Journal of machine Learning research, 2003 jmlr.org
Abstract We describe latent Dirichlet allocation (LDA), a generative probabilistic model for
collections of discrete data such as text corpora. LDA is a threelevel hierarchical Bayesian
model, in which each item of a collection is modeled as a finite mixture over an underlying
Cited by 18352 Related articles All 123 versions Cite Save
YW Teh, MI Jordan, MJ Beal, DM Blei NIPS, 2004 papers.nips.cc
Abstract We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian
model for clustering problems involving multiple groups of data. Each group of data is
modeled with a mixture, with the number of components being openended and inferred
Cited by 2896 Related articles All 94 versions Cite Save More
JD Mcauliffe, DM Blei Advances in neural information processing …, 2008 papers.nips.cc
Abstract We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of
labelled documents. The model accommodates a variety of response types. We derive a
maximumlikelihood procedure for parameter estimation, which relies on variational
…, P Duygulu, D Forsyth, N Freitas, DM Blei… Journal of machine …, 2003 jmlr.org
Abstract We present a new approach for modeling multimodal data sets, focusing on the
specific case of segmented images with associated text. Learning the joint distribution of
image regions and words has many applications. We consider in detail predicting words
DM Blei Communications of the ACM, 2012 dl.acm.org
as OUr COLLeCTive knowledge continues to be digitized and stored—in the form of news,
blogs, Web pages, scientific articles, books, images, sound, video, and social networks—it
User profiles for david blei
Latent dirichlet allocation
Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes.
Supervised topic models
Matching words and pictures
Probabilistic topic models

4/18/2017 Topic modeling bibliography
Bibliometrics
Cross-language
Evaluation
Implementations
Inference
NLP
Networks
Non-parametric
Scalability
Social media
Temporal
Theory
User interface
Vision
Where to start
Topic Modeling Bibliography
Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, Eric P.
Xing. Mixed Membership Stochastic Blockmodels. JMLR (9)
2008 pp. 1981-2014.
Networks
[BibTeX]
Loulwah AlSumait, Daniel Barbará, James Gentle, Carlotta
Domeniconi. Topic Signiﬁcance Ranking of LDA Generative
Models. ECML (2009).
Evaluation
[BibTeX]
David Andrzejewski, Anne Mulhern, Ben Liblit, Xiaojin Zhu.
Statistical Debugging using Latent Topic Models. ECML
(2007).
[BibTeX]
David Andrzejewski, Xiaojin Zhu, Mark Craven. Incorporating
domain knowledge into topic modeling via Dirichlet Forest
priors. ICML (2009).
[BibTeX]
David Andrzejewski, Xiaojin Zhu, Mark Craven, Ben Recht. A
Framework for Incorporating General Domain Knowledge into
Latent Dirichlet Allocation using First-Order Logic. IJCAI

What
"Topic modeling algorithms are statistical
methods that analyze the words of the
original texts to discover the themes that run
through them, how those themes are
connected to each other, and how they
change over time."

Why
Algorithms for managing large document archives
There's lots of data. And its growing exponentially. A lot of it is
unstructured.
We can better understand, access, search, and use this data if we can
organize it. And given the growing scale, we need to do that
algorithmically on computers.
Topic Models can ﬁnd themes and organize documents
automatically, with results that can be superior to keyword searching.
Also, can organize these results across time.

Topic Modeling in Action
Lets look at an example of topic modeling. From"A
Correlated Topic Model of Science" (Annals of Applied
Statistics, 2007)
The goal is to take an unstructured corpora of text
documents, and infer topics to group the documents
together.
Can similarly model genetics, images, social networks.
dance steps.

7 www.cs.cmu.edu/~lemur/science/2.html
WORDS
gene
dna
mutations
mutation
mutant
yeast
cells
genes
mutants
type
wild
recombination
fig
protein
growth
telomerase
strains
strain
repair
phenotype
plasmid
allele
human
cell
two
cerevisiae
RELATED TOPICS
gene dna mutations mutation mutant
plants gene plant genes expression
protein cell kinase activity cycle
cells leukemia cell abl patients
dna replication cell chromosome chromosomes
protein sequence amino cdna fig
gene disease human chromosome cancer
rna dna site structure polymerase
cells cell fig expression human
RELATED DOCUMENTS
"Requirement of the Yeast RTH1 5' to 3' Exonuclease for the Stability
of Simple Repetitive DNA" (1995)
"Adaptive Mutation by Deletions in Small Mononucleotide Repeats"
(1994)
"Recombination in Adaptive Mutation" (1994)
"Evidence That F Plasmid Transfer Replication Underlies Apparent
Adaptive Mutation" (1995)
"Cdc13p: A Single‐Strand Telomeric DNA‐Binding Protein with a Dual
Role in Yeast Telomere Maintenance" (1996)
"Adaptive Mutation in Escherichia coli: A Role for Conjugation" (1995)
"Two Modes of Survival of Fission Yeast Without Telomerase" (1998)
"Association of Increased Spontaneous Mutation Rates with High Levels
of Transcription in Yeast" (1995)
"Removal of Nonhomologous DNA Ends in Double‐Strand Break
Recombination: The Role of the Yeast Ultraviolet Repair Gene RAD1"
(1992)
"Adaptive Mutation: Who's Really in the Garden?" (1995)
"Evidence that Gene Amplification Underlies Adaptive Mutability of the
Bacterial lac Operon" (1998)
"Involvement of the Silencer and UAS Binding Protein RAP1 in
Regulation of Telomere Length" (1990)
"Abnormal Chromosome Behavior in Neurospora Mutants Defective in
DNA Methylation" (1993)
"Telomeres, Telomerase, and Cancer" (1995)
"HSP104 Required for Induced Thermotolerance" (1990)
"Est1 and Cdc13 as Comediators of Telomerase Access" (1999)
"Conditional Mutator Phenotypes in hMSH2‐Deficient Tumor Cell Lines"
(1997)
"Appropriate Partners Make Good Matches" (1995)

© Institute of Mathematical Statistics, 2007
A CORRELATED TOPIC MODEL OF SCIENCE1
BY DAVID M. BLEI AND JOHN D. LAFFERTY
Princeton University and Carnegie Mellon University
Topic models, such as latent Dirichlet allocation (LDA), can be useful
tools for the statistical analysis of document collections and other discrete
data. The LDA model assumes that the words of each document arise from a
mixture of topics, each of which is a distribution over the vocabulary. A lim-
itation of LDA is the inability to model topic correlation even though, for
example, a document about genetics is more likely to also be about disease
than X-ray astronomy. This limitation stems from the use of the Dirichlet dis-
tribution to model the variability among the topic proportions. In this paper
we develop the correlated topic model (CTM), where the topic proportions
exhibit correlation via the logistic normal distribution [J. Roy. Statist. Soc.
Ser. B 44 (1982) 139–177]. We derive a fast variational inference algorithm
for approximate posterior inference in this model, which is complicated by
the fact that the logistic normal is not conjugate to the multinomial. We ap-
ply the CTM to the articles from Science published from 1990–1999, a data
set that comprises 57M words. The CTM gives a better ﬁt of the data than
LDA, and we demonstrate its use as an exploratory tool of large document
collections.
1. Introduction. Large collections of documents are readily available on-
line and widely accessed by diverse communities. As a notable example, schol-
arly articles are increasingly published in electronic form, and historical archives
are being scanned and made accessible. The not-for-proﬁt organization JSTOR
(www.jstor.org) is currently one of the leading providers of journals to the schol-
arly community. These archives are created by scanning old journals and running
an optical character recognizer over the pages. JSTOR provides the original scans
on-line, and uses their noisy version of the text to support keyword search. Since
the data are largely unstructured and comprise millions of articles spanning cen-
turies of scholarly work, automated analysis is essential. The development of new
tools for browsing, searching and allowing the productive use of such archives
is thus an important technological challenge, and provides new opportunities for
statistical modeling.
Received March 2007; revised April 2007.
1Supported in part by NSF Grants IIS-0312814 and IIS-0427206, the DARPA CALO project and
a grant from Google.
Supplementary material and code are available at http://imstat.org/aoas/supplements
Key words and phrases. Hierarchical models, approximate posterior inference, variational meth-
ods, text analysis.

TOPIC MODELS
DAVID M. BLEI
PRINCETON UNIVERSITY
JOHN D. LAFFERTY
CARNEGIE MELLON UNIVERSITY
1. INTRODUCTION
Scientists need new tools to explore and browse large collections of schol-
arly literature. Thanks to organizations such as JSTOR, which scan and
index the original bound archives of many journals, modern scientists can
search digital libraries spanning hundreds of years. A scientist, suddenly
faced with access to millions of articles in her field, is not satisfied with
simple search. Effectively using such collections requires interacting with
them in a more structured way: finding articles similar to those of interest,
and exploring the collection through the underlying topics that run through
it.
The central problem is that this structure—the index of ideas contained
in the articles and which other articles are about the same kinds of ideas—is
not readily available in most modern collections, and the size and growth
rate of these collections preclude us from building it by hand. To develop
the necessary tools for exploring and browsing modern digital libraries, we
require automated methods of organizing, managing, and delivering their
contents.
In this chapter, we describe topic models, probabilistic models for uncov-
ering the underlying semantic structure of a document collection based on a
hierarchical Bayesian analysis of the original texts Blei et al. (2003); Grif-
fiths and Steyvers (2004); Buntine and Jakulin (2004); Hofmann (1999);

LDA
LDA introduced in the Promethean paper
Latent Dirichlet Allocation, by Blei, Ng, and
Jordan, in Journal of Machine Learning 2003

α z wθ
β
M
N
Figure 1: Graphical model representation of LDA. The boxes are “plates” representing replicates.
The outer plate represents documents, while the inner plate represents the repeated choice
of topics and words within a document.
where p(zn |θ) is simply θi for the unique i such that zi
n = 1. Integrating over θ and summing over
z, we obtain the marginal distribution of a document:
p(w|α,β) =
Z
p(θ|α)
N
∏
n=1
∑
zn
p(zn |θ)p(wn |zn,β)
!
dθ. (3)
Finally, taking the product of the marginal probabilities of single documents, we obtain the proba-
bility of a corpus:
p(D|α,β) =
M
∏
d=1
Z
p(θd |α)
Nd
∏
n=1
∑
zdn
p(zdn |θd)p(wdn |zdn,β)
!
dθd.
The LDA model is represented as a probabilistic graphical model in Figure 1. As the ﬁgure
makes clear, there are three levels to the LDA representation. The parameters α and β are corpus-
level parameters, assumed to be sampled once in the process of generating a corpus. The variables
θd are document-level variables, sampled once per document. Finally, the variables zdn and wdn are
word-level variables and are sampled once for each word in each document.
It is important to distinguish LDA from a simple Dirichlet-multinomial clustering model. A
classical clustering model would involve a two-level model in which a Dirichlet is sampled once
for a corpus, a multinomial clustering variable is selected once for each document in the corpus,
and a set of words are selected for the document conditional on the cluster variable. As with many
clustering models, such a model restricts a document to being associated with a single topic. LDA,
on the other hand, involves three levels, and notably the topic node is sampled repeatedly within the
document. Under this model, documents can be associated with multiple topics.
Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling,
where they are referred to as hierarchical models (Gelman et al., 1995), or more precisely as con-
ditionally independent hierarchical models (Kass and Steffey, 1989). Such models are also often
referred to as parametric empirical Bayes models, a term that refers not only to a particular model
structure, but also to the methods used for estimating parameters in the model (Morris, 1983). In-
deed, as we discuss in Section 5, we adopt the empirical Bayes approach to estimating parameters
such as α and β in simple implementations of LDA, but we also consider fuller Bayesian approaches
as well.

“…most implementations of LDA assume the distribution is symmetric. For the symmetric distribution, a
high alpha-value means that each document is likely to contain a mixture of most of the topics, and not any
single topic specifically. A low alpha value puts less such constraints on documents and means that it is more
likely that a document may contain mixture of just a few, or even only one, of the topics. Likewise, a high
beta-value means that each topic is likely to contain a mixture of most of the words, and not any word
specifically, while a low value means that a topic may contain a mixture of just a few of the words. If, on the
other hand, the distribution is asymmetric, a high alpha-value means that a specific topic distribution
(depending on the base measure) is more likely for each document. Similarly, high beta-values means each
topic is more likely to contain a specific word mix defined by the base measure. In practice, a high alpha-
value will lead to documents being more similar in terms of what topics they contain. A high beta-value will
similarly lead to topics being more similar in terms of what words they contain.
More generally, these are concentration parameters for the dirichlet distribution used in the LDA model. To
gain some intuitive understanding of how this works, this presentation contains some nice illustrations, as
well as a good explanation of LDA in general. http://people.cs.umass.edu/~wallach/talks/priors.pdf"

N the number of words in the corpus
1  i, j  N index of words in the corpus
W = {wi} the corpus, and wi denotes a word
Z = {zi} latent topics assigned to words in W
W¬i
= W wi the corpus excluding wi
Z¬i
= Z zi latent topics excluding zi
K the number of topics specified as a parameter
V the number of unique words in the vocabulary
↵ the parameters of topic Dirichlet prior
the parameters of word Dirichlet prior
⌦d,k count of words in d assigned topic k;
⌦d denotes the d-th row of matrix ⌦.
k,v count of word v in corpus assigned k
k denotes the k-th row of matrxi .
⌦¬i
d,k like ⌦d,k but excludes wi and zi
¬i
k,v like k,v but excludes wi and zi
⇥ = {✓d,k} ✓d,k = P(z = k|d), ✓d = P(z|d).
= { k,v} k,v = P(w = v|z = k), k = P(w|z = k).
Table 2.1: Symbols used in the derivation of LDA Gibbs sampling rule.
The Joint Distribution of LDA We start from deriving the joint distribution
4
,
p(Z, W|↵, ) = p(W|Z, )p(Z|↵) , (2.12)
which is the basis of the derivation of the Gibbs updating rule and the parameter
estimation rule. As p(W|Z, ) and p(Z|↵) depend on and ⇥ respectively, we
derive them separately.
According to the definition of LDA, we have
p(W|Z, ) =
Z
p(W|Z, )p( | )d , (2.13)
where p( | ) has Dirichlet distribution:
p( | ) =
KY
k=1
p( k| ) =
KY
k=1
1
B( )
VY
v=1
v 1
k,v , (2.14)
and p(W|Z, ) has multinomial distribution:
p(W|Z, ) =
NY
i=1
zi,wi
=
KY
k=1
VY
v=1
k,v
k,v , (2.15)
where is a K ⇥ V count matrix and k,v is the number of times that topic k is
assigned to word v. With W and Z defined by (2.1) and (2.3) respectively, we can
represent k,v mathematically by
k,v =
NX
i=1
I{wi = v ^ zi = k} , (2.16)
4

LDA Implementation
MCMC
VI - Variational Inference
Variational Inference: A Review for Statisticians (Blei,
Kucukelbir, McAuliffe)
Variational inference for Dirichlet process mixtures
(Blei, Jordan)
Part of a pipeline. pre-LDA, strip stop words, etc. And
post-LDA, can consider LDA as dimensionality reduction.

Variational Inference: A Review for Statisticians
David M. Blei
Department of Computer Science and Statistics
Columbia University
Alp Kucukelbir
Department of Computer Science
Columbia University
Jon D. McAuliffe
Department of Statistics
University of California, Berkeley
November 3, 2016
Abstract
One of the core problems of modern statistics is to approximate difficult-to-compute
probability densities. This problem is especially important in Bayesian statistics, which
frames all inference about unknown quantities as a calculation involving the posterior
density. In this paper, we review variational inference (VI), a method from machine
learning that approximates probability densities through optimization. VI has been used
in many applications and tends to be faster than classical methods, such as Markov
chain Monte Carlo sampling. The idea behind VI is to first posit a family of densities
and then to find the member of that family which is close to the target. Closeness
is measured by Kullback-Leibler divergence. We review the ideas behind mean-field
variational inference, discuss the special case of VI applied to exponential family models,
present a full example with a Bayesian mixture of Gaussians, and derive a variant that
uses stochastic optimization to scale up to massive data. We discuss modern research
in VI and highlight important open problems. VI is powerful, but it is not yet well
understood. Our hope in writing this paper is to catalyze statistical research on this class
of algorithms.
Keywords: Algorithms; Statistical Computing; Computationally Intensive Methods.
arXiv:1601.00670v4[stat.CO]2Nov2016

Bayesian Analysis (2006) 1, Number 1, pp. 121–144
Variational Inference for Dirichlet Process
Mixtures
David M. Blei∗
Michael I. Jordan†
Abstract. Dirichlet process (DP) mixture models are the cornerstone of non-
parametric Bayesian statistics, and the development of Monte-Carlo Markov chain
(MCMC) sampling methods for DP mixtures has enabled the application of non-
parametric Bayesian methods to a variety of practical data analysis problems.
However, MCMC sampling can be prohibitively slow, and it is important to ex-
plore alternatives. One class of alternatives is provided by variational methods, a
class of deterministic algorithms that convert inference problems into optimization
problems (Opper and Saad 2001; Wainwright and Jordan 2003). Thus far, varia-
tional methods have mainly been explored in the parametric setting, in particular
within the formalism of the exponential family (Attias 2000; Ghahramani and Beal
2001; Blei et al. 2003). In this paper, we present a variational inference algorithm
for DP mixtures. We present experiments that compare the algorithm to Gibbs
sampling algorithms for DP mixtures of Gaussians and present an application to
a large-scale image analysis problem.
Keywords: Dirichlet processes, hierarchical models, variational inference, image
processing, Bayesian computation
1 Introduction
The methodology of Monte Carlo Markov chain (MCMC) sampling has energized Bayesian
statistics for more than a decade, providing a systematic approach to the computation
of likelihoods and posterior distributions, and permitting the deployment of Bayesian
methods in a rapidly growing number of applied problems. However, while an unques-
tioned success story, MCMC is not an unqualiﬁed one—MCMC methods can be slow
to converge and their convergence can be diﬃcult to diagnose. While further research
on sampling is needed, it is also important to explore alternatives, particularly in the
context of large-scale problems.
One such class of alternatives is provided by variational inference methods
(Ghahramani and Beal 2001; Jordan et al. 1999; Opper and Saad 2001; Wainwright and Jordan

Topic modeling for the newbie
LDA assumes a probabilistic model for documents. For our purposes, that…
There is some ﬁxed number K of topics.
There is a random variable that assigns each topic an associated probability
distribution over words. You should think of this distribution as the probability of
seeing word w given topic k.
There is another random variable that assigns each document a probability
distribution over topics. You should think of this distribution as the mixture of
topics in document d.
Each word in a document was generated by ﬁrst randomly picking a topic (from
the document’s distribution of topics) and then randomly picking a word (from the
topic’s distribution of words).
https://www.oreilly.com/ideas/topic-modeling-for-the-newbie

Libraries
Spark
https://databricks.com/blog/2015/03/25/topic-modeling-
with-lda-mllib-meets-graphx.html
https://databricks.com/blog/2015/09/22/large-scale-topic-
modeling-improvements-to-lda-on-apache-spark.html
R
Package ‘lda’ implements a fast collapsed Gibbs sampler
written in C
scikit-learn

4/18/2017 Large Scale Topic Modeling: Improvements to LDA on Apache Spark - The Databricks Blog
COMPANY (HTTPS://DATABRICKS.COM/BLOG/CATEGORY/COMPANY) ENGINEERING (HTTPS://DATABRICKS.COM/BLOG/CATEGORY/ENGINEERING) ALL (HTTPS://DATABRICKS.COM/BLOG)
Large Scale Topic Modeling: Improvements to LDA on Apache Spark
September 22, 2015 (https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-improvements-to-lda-on-apache-spark.html) | by Feynman Liang
(https://databricks.com/blog/author/feynman-liang), Yuhao Yang (https://databricks.com/blog/author/yuhao-yang) and Joseph Bradley (https://databricks.com/blog/author/joseph) in
ENGINEERING BLOG (HTTPS://DATABRICKS.COM/BLOG/CATEGORY/ENGINEERING)
(https://twitter.com/home?status=https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-improvements-to-lda-
on-apache-spark.html)
(https://www.linkedin.com/shareArticle?mini=true&url=https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-
improvements-to-lda-on-apache-spark.html&title=Large Scale Topic Modeling: Improvements to LDA on Apache
Spark&summary=&source=)
(https://www.facebook.com/sharer/sharer.php?u=https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-
improvements-to-lda-on-apache-spark.html)
This blog was written by Feynman Liang and Joseph Bradley from Databricks, and Yuhao Yang from Intel.
To get started using LDA, download Apache Spark 1.5 (http://spark.apache.org/downloads.html) or sign up for a 14-day free trial of Databricks today
(https://accounts.cloud.databricks.com/registration.html#signup).
What are people discussing on Twitter? To catch up on distributed computing, what news articles should I read? These are questions that can be
answered by topic models, a technique for analyzing the topics present in collections of documents. This blog post discusses improvements in Apache
Spark 1.4 and 1.5 for topic modeling using the powerful Latent Dirichlet Allocation (LDA) algorithm.
Spark 1.4 and 1.5 introduced an online algorithm for running LDA incrementally, support for more queries on trained LDA models, and performance
metrics such as likelihood and perplexity. We give an example here of training a topic model over a dataset of 4.5 million Wikipedia articles.
Topic models and LDA
Topic models take a collection of documents and automatically infer the topics being discussed. For example, when we run Spark’s LDA on a dataset of
4.5 million Wikipedia articles, we can obtain topics like those in the table below.
Table 1: Example LDA topics learned from Wikipedia articles dataset
In addition, LDA tells us which topics each document is about; document X might be 30% about Topic 1 (“politics”) and 70% about Topic 5 (“airlines”).
Latent Dirichlet Allocation (LDA) has been one of the most successful topic models in practice. See our previous blog post on LDA
(https://databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html) to learn more.
A new online variational learning algorithm
Online variational inference is a technique for learning an LDA model by processing the data incrementally in small batches. By processing in small
batches, we are able to easily scale to very large datasets. MLlib implements an algorithm for performing online variational inference originally described
by Ho man et al (https://www.cs.princeton.edu/~blei/papers/Ho manBleiBach2010b.pdf).
Performance comparison
The table of topics shown previously were learned using the newly developed online variational learning algorithm. If we compare timing results, we can
see a significant speedup in using the new online algorithm over the old EM algorithm:
Stay up to date on Apache Spark. ×

Eight to Late
Sensemaking and Analytics for Organizations
A gentle introduction to topic modeling using R
with 36 comments
Introduction
The standard way to search for documents on the internet is via keywords or keyphrases. This is pretty
much what Google and other search engines do routinely…and they do it well. However, as useful as
this is, it has its limitations. Consider, for example, a situation in which you are confronted with a large
collection of documents but have no idea what they are about. One of the ﬁrst things you might want to
do is to classify these documents into topics or themes. Among other things this would help you ﬁgure
out if there’s anything interest while also directing you to the relevant subset(s) of the corpus. For small
collections, one could do this by simply going through each document but this is clearly infeasible for
corpuses containing thousands of documents.
Topic modeling – the theme of this post – deals with the problem of automatically classifying sets of
documents into themes

Aneesha Bakharia
Data Science, Learning Analytics, Electronics — Brisbane, Australia
Sep 1, 2016 · 5 min read
Follow

Conclusions
LDA is part of the evolving set of topic
modeling algorithms, which have advanced
from tf-idf, NMF, LSI, pLSI, LDA.
LDA is also being elaborated, to supervised
models, and weighing in the index, abstract,
and bibliography.

Questions
This paper proposes deep learning for NLP. In some intuitive sense, are we leaving
information on the table. Can we use deep learning for topic modeling. How beneficial,
how practical is it to fold it into our NLP? "For a long time, core NLP techniques were
dominated by machine-learning approaches that used linear models such as support
vector machines or logistic regression, trained over very high dimensional yet very sparse
feature vectors. Recently, the field has seen some success in switching from such linear
models over sparse inputs to non-linear neural-network models over dense inputs."
http://u.cs.biu.ac.il/~yogo/nnlp.pdf
Would LDA benefit from adding 2-gram features to each document.
Google is not just keyword searching. Can we use probabilistic topic modeling as part of a
web search algorithm.
In the introduction, Blei seems to suggest that we could index the web using topic models.
I wonder what the practical constraints to this are.
Where have you used topic modeling? Where could you imagine it used?

D. Blei (2012), Probabilistic Topic Models  
http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf
D. Blei, A. Ng, M. Jordan (2003), Latent Dirichlet Allocation  
http://www.cs.columbia.edu/~blei/papers/BleiNgJordan2003.pdf
J. Pritchard, M. Stephens, P. Donnelly, Inference of Population Structure Using Multilocus
Genotype Data (2000), http://www.genetics.org/content/genetics/155/2/945.full.pdf
D. Blei (2017), home page at Columbia University  
http://www.cs.columbia.edu/~blei/
D. Blei (2009), Topic Models, videos from Machine Learning Summer School, Cambridge 
http://videolectures.net/mlss09uk_cambridge/
D. Blei (2017), Google Scholar  
https://scholar.google.com/scholar?hl=en&q=dm+blei
D. Mimno (2017), Topic Modeling Bibliography  
https://mimno.infosci.cornell.edu/topics.html
D. Blei, J. Lafferty (2007), A Correlated Topic Model of Science  
http://www.cs.columbia.edu/~blei/papers/BleiLafferty2007.pdf

D. Blei, J. Lafferty (2009), Topic Models  
http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf
Y. Wang (2008), Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details  
https://cxwangyi.ﬁles.wordpress.com/2012/01/llt.pdf
D. Blei, A. Kucukelbir, J. McAuliffe (2016), Variational Inference: A Review for Statisticians,  
https://arxiv.org/abs/1601.00670
D. Blei, M. Jordan (2004), Variational Inference for Dirichlet Process Mixtures  
http://www.cs.columbia.edu/~blei/papers/BleiJordan2004.pdf
M. Beaugureau (2015), Topic Modeling for the Newbie  
https://www.oreilly.com/ideas/topic-modeling-for-the-newbie
F. Liang, Y. Yang, J. Bradley (2015), Large Scale Topic Modeling: Improvements to LDA on Apache
Spark  
https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-improvements-to-lda-on-
apache-spark.html
K. Awati (2015), A gentle introduction to topic modeling using R 
https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/
A. Bakharia (2016), Topic Modeling with Scikit Learn  
https://medium.com/@aneesha/topic-modeling-with-scikit-learn-e80d33668730

Probabilistic Topic Models

More Related Content

What's hot

Similar to Probabilistic Topic Models

Recently uploaded

Probabilistic Topic Models