Probabilistic Topic Models
Surveying a suite of algorithms that offer a solution to managing large document archives
by David M. Blei
http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf
Presented by Steve Follmer
Bay Area NLP Reading Group
presents
Origins
Today we are reading a survey paper from 2012 by David Blei, the leading exponent of probabilistic topic
modeling.
Blei kicked things off in a 2003 paper co-authored with Michael Jordan and Andrew Ng, Latent Dirichlet
Allocation.
As Blei himself notes, LDA was first published in 2000 in a paper on population genetics.
Further, probabilistic topic modeling does not require LDA, as we will see later in the presentation.
DOI:10.1145/2133806.2133826
Surveying a suite of algorithms that offer a
solution to managing large document archives.
BY DAVID M. BLEI
Probabilistic
Topic Models
AS OUR COLLECTIVE knowledge continues to be
digitized and stored—in the form of news, blogs, Web
pages, scientific articles, books, images, sound, video,
and social networks—it becomes more difficult to
find and discover what we are looking for. We need
new computational tools to help organize, search, and
understand these vast amounts of information.
Right now, we work with online information using
two main tools—search and links. We type keywords
into a search engine and find a set of documents
related to them. We look at the documents in that
For example, consider using themes
to explore the complete history of the
New York Times. At a broad level, some
of the themes might correspond to
the sections of the newspaper—for-
eign policy, national affairs, sports.
We could zoom in on a theme of in-
terest, such as foreign policy, to reveal
various aspects of it—Chinese foreign
policy, the conflict in the Middle East,
the U.S.’s relationship with Russia. We
could then navigate through time to
reveal how these specific themes have
changed, tracking, for example, the
changes in the conflict in the Middle
East over the last 50 years. And, in all of
this exploration, we would be pointed
to the original articles relevant to the
themes. The thematic structure would
beanewkindofwindowthroughwhich
to explore and digest the collection.
But we do not interact with elec-
tronic archives in this way. While more
and more texts are available online, we
simply do not have the human power
to read and study them to provide the
kind of browsing experience described
above. To this end, machine learning
researchers have developed probabilis-
tic topic modeling, a suite of algorithms
that aim to discover and annotate large
archives of documents with thematic
information. Topic modeling algo-
rithms are statistical methods that ana-
lyze the words of the original texts to
discover the themes that run through
them, how those themes are connected
Journal of Machine Learning Research 3 (2003) 993-1022 Submitted 2/02; Published 1/03
Latent Dirichlet Allocation
David M. Blei BLEI@CS.BERKELEY.EDU
Computer Science Division
University of California
Berkeley, CA 94720, USA
Andrew Y. Ng ANG@CS.STANFORD.EDU
Computer Science Department
Stanford University
Stanford, CA 94305, USA
Michael I. Jordan JORDAN@CS.BERKELEY.EDU
Computer Science Division and Department of Statistics
University of California
Berkeley, CA 94720, USA
Editor: John Lafferty
Abstract
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of
discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each
item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in
turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of
text modeling, the topic probabilities provide an explicit representation of a document. We present
efficient approximate inference techniques based on variational methods and an EM algorithm for
Copyright  2000 by the Genetics Society of America
Inference of Population Structure Using Multilocus Genotype Data
Jonathan K. Pritchard, Matthew Stephens and Peter Donnelly
Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom
Manuscript received September 23, 1999
Accepted for publication February 18, 2000
ABSTRACT
We describe a model-based clustering method for using multilocus genotype data to infer population
structure and assign individuals to populations. We assume a model in which there are K populations
(where K may be unknown), each of which is characterized by a set of allele frequencies at each locus.
Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more popula-
tions if their genotypes indicate that they are admixed. Our model does not assume a particular mutation
process, and it can be applied to most of the commonly used genetic markers, provided that they are not
closely linked. Applications of our method include demonstrating the presence of population structure,
assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individu-
als. We showthat the method can produce highlyaccurate assignments using modest numbers of loci—e.g.,
seven microsatellite loci in an example using genotype data from an endangered bird species. The software
used for this article is available from http:// www.stats.ox.ac.uk/ zpritch/ home.html.
IN applications of population genetics, it is often use- populationsbased on these subjective criteria represents
a natural assignment in genetic terms, and it would beful to classify individuals in a sample into popula-
tions. In one scenario, the investigator begins with a useful to be able to confirm that subjective classifications
are consistent with genetic information and hence ap-sample of individuals and wants to say something about
the properties of populations. For example, in studies propriate for studying the questions of interest. Further,
there are situations where one is interested in “cryptic”of human evolution, the population is often considered
to be the unit of interest, and a great deal of work has population structure—i.e., population structure that is
difficult to detect using visible characters, but may befocused on learning about the evolutionary relation-
ships of modern populations (e.g., Caval l i et al. 1994). significant in genetic terms. For example, when associa-
tion mapping is used to find disease genes, the presenceIn a second scenario, the investigator begins with a set
of predefined populations and wishes to classifyindivid- of undetected population structure can lead to spurious
associations and thus invalidate standard tests (Ewensuals of unknown origin. This type of problem arises
in many contexts (reviewed by Davies et al. 1999). A and Spiel man 1995). The problem of cryptic population
structure also arises in the context of DNA fingerprint-standard approach involves sampling DNA from mem-
bers of a number of potential source populations and ing for forensics, where it is important to assess the
Who
David Blei
video
classes
Mimno
Hoffmann
Topic Modeling mailing list
9/2017 www.cs.columbia.edu/~blei/
I am a professor of Statistics and Computer Science at
Columbia University. I am also a member of the Columbia Data
Science Institute. I work in the fields of machine learning and
Bayesian statistics.
See my CV and publications .
My research interests include:
Topic models
Probabilistic modeling
Approximate Bayesian inference
Here are two recent talks:
Probabilistic Topic Models and User Behavior
Variational Inference: Foundations and Innovations
Most of our publications are attached to open-source software.
See our GitHub page.
We released Edward: A library for probabilistic modeling,
inference, and criticism.
Machine Learning at Columbia
Columbia has a thriving machine learning community, with
many faculty and researchers across departments. The
MLColumbia Google group is a good source of information
about talks and other events on campus.
Teaching
In Spring 2017, I am teaching Applied Causality.
My previous courses are here.
David M. Blei
Columbia University
david.blei@columbia.edu
About
Topic modeling
Courses
Publications
4/18/2017 david blei - Google Scholar
David Blei
Professor of Statistics and Computer Science, Columbia University
Verified email at columbia.edu
Cited by 45111
DM Blei, AY Ng, MI Jordan ­ Journal of machine Learning research, 2003 ­ jmlr.org
Abstract We describe latent Dirichlet allocation (LDA), a generative probabilistic model for  
collections of discrete data such as text corpora. LDA is a three­level hierarchical Bayesian  
model, in which each item of a collection is modeled as a finite mixture over an underlying  
Cited by 18352  Related articles  All 123 versions  Cite  Save
YW Teh, MI Jordan, MJ Beal, DM Blei ­ NIPS, 2004 ­ papers.nips.cc
Abstract We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian  
model for clustering problems involving multiple groups of data. Each group of data is  
modeled with a mixture, with the number of components being open­ended and inferred  
Cited by 2896  Related articles  All 94 versions  Cite  Save  More
JD Mcauliffe, DM Blei ­ Advances in neural information processing …, 2008 ­ papers.nips.cc
Abstract We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of  
labelled documents. The model accommodates a variety of response types. We derive a  
maximum­likelihood procedure for parameter estimation, which relies on variational  
Cited by 1739  Related articles  All 44 versions  Cite  Save
…, P Duygulu, D Forsyth, N Freitas, DM Blei… ­ Journal of machine …, 2003 ­ jmlr.org
Abstract We present a new approach for modeling multi­modal data sets, focusing on the  
specific case of segmented images with associated text. Learning the joint distribution of  
image regions and words has many applications. We consider in detail predicting words  
Cited by 1694  Related articles  All 45 versions  Cite  Save
DM Blei ­ Communications of the ACM, 2012 ­ dl.acm.org
as OUr COLLeCTive knowledge continues to be digitized and stored—in the form of news,  
blogs, Web pages, scientific articles, books, images, sound, video, and social networks—it  
User profiles for david blei
Latent dirichlet allocation
Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes.
Supervised topic models
Matching words and pictures
Probabilistic topic models
4/18/2017 Topic modeling bibliography
Bibliometrics
Cross-language
Evaluation
Implementations
Inference
NLP
Networks
Non-parametric
Scalability
Social media
Temporal
Theory
User interface
Vision
Where to start
Topic Modeling Bibliography
Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, Eric P.
Xing. Mixed Membership Stochastic Blockmodels. JMLR (9)
2008 pp. 1981-2014.
Networks
[BibTeX]
Loulwah AlSumait, Daniel Barbará, James Gentle, Carlotta
Domeniconi. Topic Significance Ranking of LDA Generative
Models. ECML (2009).
Evaluation
[BibTeX]
David Andrzejewski, Anne Mulhern, Ben Liblit, Xiaojin Zhu.
Statistical Debugging using Latent Topic Models. ECML
(2007).
[BibTeX]
David Andrzejewski, Xiaojin Zhu, Mark Craven. Incorporating
domain knowledge into topic modeling via Dirichlet Forest
priors. ICML (2009).
[BibTeX]
David Andrzejewski, Xiaojin Zhu, Mark Craven, Ben Recht. A
Framework for Incorporating General Domain Knowledge into
Latent Dirichlet Allocation using First-Order Logic. IJCAI
What
"Topic modeling algorithms are statistical
methods that analyze the words of the
original texts to discover the themes that run
through them, how those themes are
connected to each other, and how they
change over time."
Why
Algorithms for managing large document archives
There's lots of data. And its growing exponentially. A lot of it is
unstructured.
We can better understand, access, search, and use this data if we can
organize it. And given the growing scale, we need to do that
algorithmically on computers.
Topic Models can find themes and organize documents
automatically, with results that can be superior to keyword searching.
Also, can organize these results across time.

Topic Modeling in Action
Lets look at an example of topic modeling. From"A
Correlated Topic Model of Science" (Annals of Applied
Statistics, 2007)
The goal is to take an unstructured corpora of text
documents, and infer topics to group the documents
together.
Can similarly model genetics, images, social networks.
dance steps.

7 www.cs.cmu.edu/~lemur/science/2.html
WORDS
gene
dna
mutations
mutation
mutant
yeast
cells
genes
mutants
type
wild
recombination
fig
protein
growth
telomerase
strains
strain
repair
phenotype
plasmid
allele
human
cell
two
cerevisiae
RELATED TOPICS
gene dna mutations mutation mutant
plants gene plant genes expression
protein cell kinase activity cycle
cells leukemia cell abl patients
dna replication cell chromosome chromosomes
protein sequence amino cdna fig
gene disease human chromosome cancer
rna dna site structure polymerase
cells cell fig expression human
RELATED DOCUMENTS
"Requirement of the Yeast RTH1 5' to 3' Exonuclease for the Stability
of Simple Repetitive DNA" (1995)
"Adaptive Mutation by Deletions in Small Mononucleotide Repeats"
(1994)
"Recombination in Adaptive Mutation" (1994)
"Evidence That F Plasmid Transfer Replication Underlies Apparent
Adaptive Mutation" (1995)
"Cdc13p: A Single‐Strand Telomeric DNA‐Binding Protein with a Dual
Role in Yeast Telomere Maintenance" (1996)
"Adaptive Mutation in Escherichia coli: A Role for Conjugation" (1995)
"Two Modes of Survival of Fission Yeast Without Telomerase" (1998)
"Association of Increased Spontaneous Mutation Rates with High Levels
of Transcription in Yeast" (1995)
"Removal of Nonhomologous DNA Ends in Double‐Strand Break
Recombination: The Role of the Yeast Ultraviolet Repair Gene RAD1"
(1992)
"Adaptive Mutation: Who's Really in the Garden?" (1995)
"Evidence that Gene Amplification Underlies Adaptive Mutability of the
Bacterial lac Operon" (1998)
"Involvement of the Silencer and UAS Binding Protein RAP1 in
Regulation of Telomere Length" (1990)
"Abnormal Chromosome Behavior in Neurospora Mutants Defective in
DNA Methylation" (1993)
"Telomeres, Telomerase, and Cancer" (1995)
"HSP104 Required for Induced Thermotolerance" (1990)
"Est1 and Cdc13 as Comediators of Telomerase Access" (1999)
"Conditional Mutator Phenotypes in hMSH2‐Deficient Tumor Cell Lines"
(1997)
"Appropriate Partners Make Good Matches" (1995)
© Institute of Mathematical Statistics, 2007
A CORRELATED TOPIC MODEL OF SCIENCE1
BY DAVID M. BLEI AND JOHN D. LAFFERTY
Princeton University and Carnegie Mellon University
Topic models, such as latent Dirichlet allocation (LDA), can be useful
tools for the statistical analysis of document collections and other discrete
data. The LDA model assumes that the words of each document arise from a
mixture of topics, each of which is a distribution over the vocabulary. A lim-
itation of LDA is the inability to model topic correlation even though, for
example, a document about genetics is more likely to also be about disease
than X-ray astronomy. This limitation stems from the use of the Dirichlet dis-
tribution to model the variability among the topic proportions. In this paper
we develop the correlated topic model (CTM), where the topic proportions
exhibit correlation via the logistic normal distribution [J. Roy. Statist. Soc.
Ser. B 44 (1982) 139–177]. We derive a fast variational inference algorithm
for approximate posterior inference in this model, which is complicated by
the fact that the logistic normal is not conjugate to the multinomial. We ap-
ply the CTM to the articles from Science published from 1990–1999, a data
set that comprises 57M words. The CTM gives a better fit of the data than
LDA, and we demonstrate its use as an exploratory tool of large document
collections.
1. Introduction. Large collections of documents are readily available on-
line and widely accessed by diverse communities. As a notable example, schol-
arly articles are increasingly published in electronic form, and historical archives
are being scanned and made accessible. The not-for-profit organization JSTOR
(www.jstor.org) is currently one of the leading providers of journals to the schol-
arly community. These archives are created by scanning old journals and running
an optical character recognizer over the pages. JSTOR provides the original scans
on-line, and uses their noisy version of the text to support keyword search. Since
the data are largely unstructured and comprise millions of articles spanning cen-
turies of scholarly work, automated analysis is essential. The development of new
tools for browsing, searching and allowing the productive use of such archives
is thus an important technological challenge, and provides new opportunities for
statistical modeling.
Received March 2007; revised April 2007.
1Supported in part by NSF Grants IIS-0312814 and IIS-0427206, the DARPA CALO project and
a grant from Google.
Supplementary material and code are available at http://imstat.org/aoas/supplements
Key words and phrases. Hierarchical models, approximate posterior inference, variational meth-
ods, text analysis.
TOPIC MODELS
DAVID M. BLEI
PRINCETON UNIVERSITY
JOHN D. LAFFERTY
CARNEGIE MELLON UNIVERSITY
1. INTRODUCTION
Scientists need new tools to explore and browse large collections of schol-
arly literature. Thanks to organizations such as JSTOR, which scan and
index the original bound archives of many journals, modern scientists can
search digital libraries spanning hundreds of years. A scientist, suddenly
faced with access to millions of articles in her field, is not satisfied with
simple search. Effectively using such collections requires interacting with
them in a more structured way: finding articles similar to those of interest,
and exploring the collection through the underlying topics that run through
it.
The central problem is that this structure—the index of ideas contained
in the articles and which other articles are about the same kinds of ideas—is
not readily available in most modern collections, and the size and growth
rate of these collections preclude us from building it by hand. To develop
the necessary tools for exploring and browsing modern digital libraries, we
require automated methods of organizing, managing, and delivering their
contents.
In this chapter, we describe topic models, probabilistic models for uncov-
ering the underlying semantic structure of a document collection based on a
hierarchical Bayesian analysis of the original texts Blei et al. (2003); Grif-
fiths and Steyvers (2004); Buntine and Jakulin (2004); Hofmann (1999);
LDA
LDA introduced in the Promethean paper
Latent Dirichlet Allocation, by Blei, Ng, and
Jordan, in Journal of Machine Learning 2003
α z wθ
β
M
N
Figure 1: Graphical model representation of LDA. The boxes are “plates” representing replicates.
The outer plate represents documents, while the inner plate represents the repeated choice
of topics and words within a document.
where p(zn |θ) is simply θi for the unique i such that zi
n = 1. Integrating over θ and summing over
z, we obtain the marginal distribution of a document:
p(w|α,β) =
Z
p(θ|α)
N
∏
n=1
∑
zn
p(zn |θ)p(wn |zn,β)
!
dθ. (3)
Finally, taking the product of the marginal probabilities of single documents, we obtain the proba-
bility of a corpus:
p(D|α,β) =
M
∏
d=1
Z
p(θd |α)
Nd
∏
n=1
∑
zdn
p(zdn |θd)p(wdn |zdn,β)
!
dθd.
The LDA model is represented as a probabilistic graphical model in Figure 1. As the figure
makes clear, there are three levels to the LDA representation. The parameters α and β are corpus-
level parameters, assumed to be sampled once in the process of generating a corpus. The variables
θd are document-level variables, sampled once per document. Finally, the variables zdn and wdn are
word-level variables and are sampled once for each word in each document.
It is important to distinguish LDA from a simple Dirichlet-multinomial clustering model. A
classical clustering model would involve a two-level model in which a Dirichlet is sampled once
for a corpus, a multinomial clustering variable is selected once for each document in the corpus,
and a set of words are selected for the document conditional on the cluster variable. As with many
clustering models, such a model restricts a document to being associated with a single topic. LDA,
on the other hand, involves three levels, and notably the topic node is sampled repeatedly within the
document. Under this model, documents can be associated with multiple topics.
Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling,
where they are referred to as hierarchical models (Gelman et al., 1995), or more precisely as con-
ditionally independent hierarchical models (Kass and Steffey, 1989). Such models are also often
referred to as parametric empirical Bayes models, a term that refers not only to a particular model
structure, but also to the methods used for estimating parameters in the model (Morris, 1983). In-
deed, as we discuss in Section 5, we adopt the empirical Bayes approach to estimating parameters
such as α and β in simple implementations of LDA, but we also consider fuller Bayesian approaches
as well.
“…most implementations of LDA assume the distribution is symmetric. For the symmetric distribution, a
high alpha-value means that each document is likely to contain a mixture of most of the topics, and not any
single topic specifically. A low alpha value puts less such constraints on documents and means that it is more
likely that a document may contain mixture of just a few, or even only one, of the topics. Likewise, a high
beta-value means that each topic is likely to contain a mixture of most of the words, and not any word
specifically, while a low value means that a topic may contain a mixture of just a few of the words. If, on the
other hand, the distribution is asymmetric, a high alpha-value means that a specific topic distribution
(depending on the base measure) is more likely for each document. Similarly, high beta-values means each
topic is more likely to contain a specific word mix defined by the base measure. In practice, a high alpha-
value will lead to documents being more similar in terms of what topics they contain. A high beta-value will
similarly lead to topics being more similar in terms of what words they contain.
More generally, these are concentration parameters for the dirichlet distribution used in the LDA model. To
gain some intuitive understanding of how this works, this presentation contains some nice illustrations, as
well as a good explanation of LDA in general. http://people.cs.umass.edu/~wallach/talks/priors.pdf"
N the number of words in the corpus
1  i, j  N index of words in the corpus
W = {wi} the corpus, and wi denotes a word
Z = {zi} latent topics assigned to words in W
W¬i
= W  wi the corpus excluding wi
Z¬i
= Z  zi latent topics excluding zi
K the number of topics specified as a parameter
V the number of unique words in the vocabulary
↵ the parameters of topic Dirichlet prior
the parameters of word Dirichlet prior
⌦d,k count of words in d assigned topic k;
⌦d denotes the d-th row of matrix ⌦.
k,v count of word v in corpus assigned k
k denotes the k-th row of matrxi .
⌦¬i
d,k like ⌦d,k but excludes wi and zi
¬i
k,v like k,v but excludes wi and zi
⇥ = {✓d,k} ✓d,k = P(z = k|d), ✓d = P(z|d).
= { k,v} k,v = P(w = v|z = k), k = P(w|z = k).
Table 2.1: Symbols used in the derivation of LDA Gibbs sampling rule.
The Joint Distribution of LDA We start from deriving the joint distribution
4
,
p(Z, W|↵, ) = p(W|Z, )p(Z|↵) , (2.12)
which is the basis of the derivation of the Gibbs updating rule and the parameter
estimation rule. As p(W|Z, ) and p(Z|↵) depend on and ⇥ respectively, we
derive them separately.
According to the definition of LDA, we have
p(W|Z, ) =
Z
p(W|Z, )p( | )d , (2.13)
where p( | ) has Dirichlet distribution:
p( | ) =
KY
k=1
p( k| ) =
KY
k=1
1
B( )
VY
v=1
v 1
k,v , (2.14)
and p(W|Z, ) has multinomial distribution:
p(W|Z, ) =
NY
i=1
zi,wi
=
KY
k=1
VY
v=1
k,v
k,v , (2.15)
where is a K ⇥ V count matrix and k,v is the number of times that topic k is
assigned to word v. With W and Z defined by (2.1) and (2.3) respectively, we can
represent k,v mathematically by
k,v =
NX
i=1
I{wi = v ^ zi = k} , (2.16)
4
LDA Implementation
MCMC
VI - Variational Inference
Variational Inference: A Review for Statisticians (Blei,
Kucukelbir, McAuliffe)
Variational inference for Dirichlet process mixtures
(Blei, Jordan)
Part of a pipeline. pre-LDA, strip stop words, etc. And
post-LDA, can consider LDA as dimensionality reduction.
Variational Inference: A Review for Statisticians
David M. Blei
Department of Computer Science and Statistics
Columbia University
Alp Kucukelbir
Department of Computer Science
Columbia University
Jon D. McAuliffe
Department of Statistics
University of California, Berkeley
November 3, 2016
Abstract
One of the core problems of modern statistics is to approximate difficult-to-compute
probability densities. This problem is especially important in Bayesian statistics, which
frames all inference about unknown quantities as a calculation involving the posterior
density. In this paper, we review variational inference (VI), a method from machine
learning that approximates probability densities through optimization. VI has been used
in many applications and tends to be faster than classical methods, such as Markov
chain Monte Carlo sampling. The idea behind VI is to first posit a family of densities
and then to find the member of that family which is close to the target. Closeness
is measured by Kullback-Leibler divergence. We review the ideas behind mean-field
variational inference, discuss the special case of VI applied to exponential family models,
present a full example with a Bayesian mixture of Gaussians, and derive a variant that
uses stochastic optimization to scale up to massive data. We discuss modern research
in VI and highlight important open problems. VI is powerful, but it is not yet well
understood. Our hope in writing this paper is to catalyze statistical research on this class
of algorithms.
Keywords: Algorithms; Statistical Computing; Computationally Intensive Methods.
arXiv:1601.00670v4[stat.CO]2Nov2016
Bayesian Analysis (2006) 1, Number 1, pp. 121–144
Variational Inference for Dirichlet Process
Mixtures
David M. Blei∗
Michael I. Jordan†
Abstract. Dirichlet process (DP) mixture models are the cornerstone of non-
parametric Bayesian statistics, and the development of Monte-Carlo Markov chain
(MCMC) sampling methods for DP mixtures has enabled the application of non-
parametric Bayesian methods to a variety of practical data analysis problems.
However, MCMC sampling can be prohibitively slow, and it is important to ex-
plore alternatives. One class of alternatives is provided by variational methods, a
class of deterministic algorithms that convert inference problems into optimization
problems (Opper and Saad 2001; Wainwright and Jordan 2003). Thus far, varia-
tional methods have mainly been explored in the parametric setting, in particular
within the formalism of the exponential family (Attias 2000; Ghahramani and Beal
2001; Blei et al. 2003). In this paper, we present a variational inference algorithm
for DP mixtures. We present experiments that compare the algorithm to Gibbs
sampling algorithms for DP mixtures of Gaussians and present an application to
a large-scale image analysis problem.
Keywords: Dirichlet processes, hierarchical models, variational inference, image
processing, Bayesian computation
1 Introduction
The methodology of Monte Carlo Markov chain (MCMC) sampling has energized Bayesian
statistics for more than a decade, providing a systematic approach to the computation
of likelihoods and posterior distributions, and permitting the deployment of Bayesian
methods in a rapidly growing number of applied problems. However, while an unques-
tioned success story, MCMC is not an unqualified one—MCMC methods can be slow
to converge and their convergence can be difficult to diagnose. While further research
on sampling is needed, it is also important to explore alternatives, particularly in the
context of large-scale problems.
One such class of alternatives is provided by variational inference methods
(Ghahramani and Beal 2001; Jordan et al. 1999; Opper and Saad 2001; Wainwright and Jordan
Topic modeling for the newbie
LDA assumes a probabilistic model for documents. For our purposes, that…
There is some fixed number K of topics.
There is a random variable that assigns each topic an associated probability
distribution over words. You should think of this distribution as the probability of
seeing word w given topic k.
There is another random variable that assigns each document a probability
distribution over topics. You should think of this distribution as the mixture of
topics in document d.
Each word in a document was generated by first randomly picking a topic (from
the document’s distribution of topics) and then randomly picking a word (from the
topic’s distribution of words).
https://www.oreilly.com/ideas/topic-modeling-for-the-newbie
Libraries
Spark
https://databricks.com/blog/2015/03/25/topic-modeling-
with-lda-mllib-meets-graphx.html
https://databricks.com/blog/2015/09/22/large-scale-topic-
modeling-improvements-to-lda-on-apache-spark.html
R
Package ‘lda’ implements a fast collapsed Gibbs sampler
written in C
scikit-learn
4/18/2017 Large Scale Topic Modeling: Improvements to LDA on Apache Spark - The Databricks Blog
COMPANY (HTTPS://DATABRICKS.COM/BLOG/CATEGORY/COMPANY) ENGINEERING (HTTPS://DATABRICKS.COM/BLOG/CATEGORY/ENGINEERING) ALL (HTTPS://DATABRICKS.COM/BLOG)
Large Scale Topic Modeling: Improvements to LDA on Apache Spark
September 22, 2015 (https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-improvements-to-lda-on-apache-spark.html) | by Feynman Liang
(https://databricks.com/blog/author/feynman-liang), Yuhao Yang (https://databricks.com/blog/author/yuhao-yang) and Joseph Bradley (https://databricks.com/blog/author/joseph) in
ENGINEERING BLOG (HTTPS://DATABRICKS.COM/BLOG/CATEGORY/ENGINEERING)
(https://twitter.com/home?status=https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-improvements-to-lda-
on-apache-spark.html)
(https://www.linkedin.com/shareArticle?mini=true&url=https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-
improvements-to-lda-on-apache-spark.html&title=Large Scale Topic Modeling: Improvements to LDA on Apache
Spark&summary=&source=)
(https://www.facebook.com/sharer/sharer.php?u=https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-
improvements-to-lda-on-apache-spark.html)
This blog was written by Feynman Liang and Joseph Bradley from Databricks, and Yuhao Yang from Intel.
To get started using LDA, download Apache Spark 1.5 (http://spark.apache.org/downloads.html) or sign up for a 14-day free trial of Databricks today
(https://accounts.cloud.databricks.com/registration.html#signup).
What are people discussing on Twitter? To catch up on distributed computing, what news articles should I read? These are questions that can be
answered by topic models, a technique for analyzing the topics present in collections of documents. This blog post discusses improvements in Apache
Spark 1.4 and 1.5 for topic modeling using the powerful Latent Dirichlet Allocation (LDA) algorithm.
Spark 1.4 and 1.5 introduced an online algorithm for running LDA incrementally, support for more queries on trained LDA models, and performance
metrics such as likelihood and perplexity. We give an example here of training a topic model over a dataset of 4.5 million Wikipedia articles.
Topic models and LDA
Topic models take a collection of documents and automatically infer the topics being discussed. For example, when we run Spark’s LDA on a dataset of
4.5 million Wikipedia articles, we can obtain topics like those in the table below.
Table 1: Example LDA topics learned from Wikipedia articles dataset
In addition, LDA tells us which topics each document is about; document X might be 30% about Topic 1 (“politics”) and 70% about Topic 5 (“airlines”).
Latent Dirichlet Allocation (LDA) has been one of the most successful topic models in practice. See our previous blog post on LDA
(https://databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html) to learn more.
A new online variational learning algorithm
Online variational inference is a technique for learning an LDA model by processing the data incrementally in small batches. By processing in small
batches, we are able to easily scale to very large datasets. MLlib implements an algorithm for performing online variational inference originally described
by Ho man et al (https://www.cs.princeton.edu/~blei/papers/Ho manBleiBach2010b.pdf).
Performance comparison
The table of topics shown previously were learned using the newly developed online variational learning algorithm. If we compare timing results, we can
see a significant speedup in using the new online algorithm over the old EM algorithm:
Stay up to date on Apache Spark. ×
Eight to Late
Sensemaking and Analytics for Organizations
A gentle introduction to topic modeling using R
with 36 comments
Introduction
The standard way to search for documents on the internet is via keywords or keyphrases. This is pretty
much what Google and other search engines do routinely…and they do it well.  However, as useful as
this is, it has its limitations. Consider, for example, a situation in which you are confronted with a large
collection of documents but have no idea what they are about. One of the first things you might want to
do is to classify these documents into topics or themes. Among other things this would help you figure
out if there’s anything interest while also directing you to the relevant subset(s) of the corpus. For small
collections, one could do this by simply going through each document but this is clearly infeasible for
corpuses containing thousands of documents.
Topic modeling – the theme of this post – deals with the problem of automatically classifying sets of
documents into themes
Aneesha Bakharia
Data Science, Learning Analytics, Electronics — Brisbane, Australia
Sep 1, 2016 · 5 min read
Follow
Conclusions
LDA is part of the evolving set of topic
modeling algorithms, which have advanced
from tf-idf, NMF, LSI, pLSI, LDA.
LDA is also being elaborated, to supervised
models, and weighing in the index, abstract,
and bibliography.
Questions
This paper proposes deep learning for NLP. In some intuitive sense, are we leaving
information on the table. Can we use deep learning for topic modeling. How beneficial,
how practical is it to fold it into our NLP? "For a long time, core NLP techniques were
dominated by machine-learning approaches that used linear models such as support
vector machines or logistic regression, trained over very high dimensional yet very sparse
feature vectors. Recently, the field has seen some success in switching from such linear
models over sparse inputs to non-linear neural-network models over dense inputs."
http://u.cs.biu.ac.il/~yogo/nnlp.pdf
Would LDA benefit from adding 2-gram features to each document.
Google is not just keyword searching. Can we use probabilistic topic modeling as part of a
web search algorithm.
In the introduction, Blei seems to suggest that we could index the web using topic models.
I wonder what the practical constraints to this are.
Where have you used topic modeling? Where could you imagine it used?
D. Blei (2012), Probabilistic Topic Models 

http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf
D. Blei, A. Ng, M. Jordan (2003), Latent Dirichlet Allocation 

http://www.cs.columbia.edu/~blei/papers/BleiNgJordan2003.pdf
J. Pritchard, M. Stephens, P. Donnelly, Inference of Population Structure Using Multilocus
Genotype Data (2000), http://www.genetics.org/content/genetics/155/2/945.full.pdf
D. Blei (2017), home page at Columbia University 

http://www.cs.columbia.edu/~blei/
D. Blei (2009), Topic Models, videos from Machine Learning Summer School, Cambridge

http://videolectures.net/mlss09uk_cambridge/
D. Blei (2017), Google Scholar 

https://scholar.google.com/scholar?hl=en&q=dm+blei
D. Mimno (2017), Topic Modeling Bibliography 

https://mimno.infosci.cornell.edu/topics.html
D. Blei, J. Lafferty (2007), A Correlated Topic Model of Science 

http://www.cs.columbia.edu/~blei/papers/BleiLafferty2007.pdf
D. Blei, J. Lafferty (2009), Topic Models 

http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf
Y. Wang (2008), Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details 

https://cxwangyi.files.wordpress.com/2012/01/llt.pdf
D. Blei, A. Kucukelbir, J. McAuliffe (2016), Variational Inference: A Review for Statisticians, 

https://arxiv.org/abs/1601.00670
D. Blei, M. Jordan (2004), Variational Inference for Dirichlet Process Mixtures 

http://www.cs.columbia.edu/~blei/papers/BleiJordan2004.pdf
M. Beaugureau (2015), Topic Modeling for the Newbie 

https://www.oreilly.com/ideas/topic-modeling-for-the-newbie
F. Liang, Y. Yang, J. Bradley (2015), Large Scale Topic Modeling: Improvements to LDA on Apache
Spark 

https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-improvements-to-lda-on-
apache-spark.html
K. Awati (2015), A gentle introduction to topic modeling using R

https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/
A. Bakharia (2016), Topic Modeling with Scikit Learn 

https://medium.com/@aneesha/topic-modeling-with-scikit-learn-e80d33668730

Probabilistic Topic Models

  • 1.
    Probabilistic Topic Models Surveyinga suite of algorithms that offer a solution to managing large document archives by David M. Blei http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf Presented by Steve Follmer Bay Area NLP Reading Group presents
  • 2.
    Origins Today we arereading a survey paper from 2012 by David Blei, the leading exponent of probabilistic topic modeling. Blei kicked things off in a 2003 paper co-authored with Michael Jordan and Andrew Ng, Latent Dirichlet Allocation. As Blei himself notes, LDA was first published in 2000 in a paper on population genetics. Further, probabilistic topic modeling does not require LDA, as we will see later in the presentation.
  • 3.
    DOI:10.1145/2133806.2133826 Surveying a suiteof algorithms that offer a solution to managing large document archives. BY DAVID M. BLEI Probabilistic Topic Models AS OUR COLLECTIVE knowledge continues to be digitized and stored—in the form of news, blogs, Web pages, scientific articles, books, images, sound, video, and social networks—it becomes more difficult to find and discover what we are looking for. We need new computational tools to help organize, search, and understand these vast amounts of information. Right now, we work with online information using two main tools—search and links. We type keywords into a search engine and find a set of documents related to them. We look at the documents in that For example, consider using themes to explore the complete history of the New York Times. At a broad level, some of the themes might correspond to the sections of the newspaper—for- eign policy, national affairs, sports. We could zoom in on a theme of in- terest, such as foreign policy, to reveal various aspects of it—Chinese foreign policy, the conflict in the Middle East, the U.S.’s relationship with Russia. We could then navigate through time to reveal how these specific themes have changed, tracking, for example, the changes in the conflict in the Middle East over the last 50 years. And, in all of this exploration, we would be pointed to the original articles relevant to the themes. The thematic structure would beanewkindofwindowthroughwhich to explore and digest the collection. But we do not interact with elec- tronic archives in this way. While more and more texts are available online, we simply do not have the human power to read and study them to provide the kind of browsing experience described above. To this end, machine learning researchers have developed probabilis- tic topic modeling, a suite of algorithms that aim to discover and annotate large archives of documents with thematic information. Topic modeling algo- rithms are statistical methods that ana- lyze the words of the original texts to discover the themes that run through them, how those themes are connected
  • 4.
    Journal of MachineLearning Research 3 (2003) 993-1022 Submitted 2/02; Published 1/03 Latent Dirichlet Allocation David M. Blei BLEI@CS.BERKELEY.EDU Computer Science Division University of California Berkeley, CA 94720, USA Andrew Y. Ng ANG@CS.STANFORD.EDU Computer Science Department Stanford University Stanford, CA 94305, USA Michael I. Jordan JORDAN@CS.BERKELEY.EDU Computer Science Division and Department of Statistics University of California Berkeley, CA 94720, USA Editor: John Lafferty Abstract We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for
  • 5.
    Copyright  2000by the Genetics Society of America Inference of Population Structure Using Multilocus Genotype Data Jonathan K. Pritchard, Matthew Stephens and Peter Donnelly Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom Manuscript received September 23, 1999 Accepted for publication February 18, 2000 ABSTRACT We describe a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations. We assume a model in which there are K populations (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more popula- tions if their genotypes indicate that they are admixed. Our model does not assume a particular mutation process, and it can be applied to most of the commonly used genetic markers, provided that they are not closely linked. Applications of our method include demonstrating the presence of population structure, assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individu- als. We showthat the method can produce highlyaccurate assignments using modest numbers of loci—e.g., seven microsatellite loci in an example using genotype data from an endangered bird species. The software used for this article is available from http:// www.stats.ox.ac.uk/ zpritch/ home.html. IN applications of population genetics, it is often use- populationsbased on these subjective criteria represents a natural assignment in genetic terms, and it would beful to classify individuals in a sample into popula- tions. In one scenario, the investigator begins with a useful to be able to confirm that subjective classifications are consistent with genetic information and hence ap-sample of individuals and wants to say something about the properties of populations. For example, in studies propriate for studying the questions of interest. Further, there are situations where one is interested in “cryptic”of human evolution, the population is often considered to be the unit of interest, and a great deal of work has population structure—i.e., population structure that is difficult to detect using visible characters, but may befocused on learning about the evolutionary relation- ships of modern populations (e.g., Caval l i et al. 1994). significant in genetic terms. For example, when associa- tion mapping is used to find disease genes, the presenceIn a second scenario, the investigator begins with a set of predefined populations and wishes to classifyindivid- of undetected population structure can lead to spurious associations and thus invalidate standard tests (Ewensuals of unknown origin. This type of problem arises in many contexts (reviewed by Davies et al. 1999). A and Spiel man 1995). The problem of cryptic population structure also arises in the context of DNA fingerprint-standard approach involves sampling DNA from mem- bers of a number of potential source populations and ing for forensics, where it is important to assess the
  • 6.
  • 7.
    9/2017 www.cs.columbia.edu/~blei/ I ama professor of Statistics and Computer Science at Columbia University. I am also a member of the Columbia Data Science Institute. I work in the fields of machine learning and Bayesian statistics. See my CV and publications . My research interests include: Topic models Probabilistic modeling Approximate Bayesian inference Here are two recent talks: Probabilistic Topic Models and User Behavior Variational Inference: Foundations and Innovations Most of our publications are attached to open-source software. See our GitHub page. We released Edward: A library for probabilistic modeling, inference, and criticism. Machine Learning at Columbia Columbia has a thriving machine learning community, with many faculty and researchers across departments. The MLColumbia Google group is a good source of information about talks and other events on campus. Teaching In Spring 2017, I am teaching Applied Causality. My previous courses are here. David M. Blei Columbia University david.blei@columbia.edu About Topic modeling Courses Publications
  • 8.
    4/18/2017 david blei- Google Scholar David Blei Professor of Statistics and Computer Science, Columbia University Verified email at columbia.edu Cited by 45111 DM Blei, AY Ng, MI Jordan ­ Journal of machine Learning research, 2003 ­ jmlr.org Abstract We describe latent Dirichlet allocation (LDA), a generative probabilistic model for   collections of discrete data such as text corpora. LDA is a three­level hierarchical Bayesian   model, in which each item of a collection is modeled as a finite mixture over an underlying   Cited by 18352  Related articles  All 123 versions  Cite  Save YW Teh, MI Jordan, MJ Beal, DM Blei ­ NIPS, 2004 ­ papers.nips.cc Abstract We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian   model for clustering problems involving multiple groups of data. Each group of data is   modeled with a mixture, with the number of components being open­ended and inferred   Cited by 2896  Related articles  All 94 versions  Cite  Save  More JD Mcauliffe, DM Blei ­ Advances in neural information processing …, 2008 ­ papers.nips.cc Abstract We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of   labelled documents. The model accommodates a variety of response types. We derive a   maximum­likelihood procedure for parameter estimation, which relies on variational   Cited by 1739  Related articles  All 44 versions  Cite  Save …, P Duygulu, D Forsyth, N Freitas, DM Blei… ­ Journal of machine …, 2003 ­ jmlr.org Abstract We present a new approach for modeling multi­modal data sets, focusing on the   specific case of segmented images with associated text. Learning the joint distribution of   image regions and words has many applications. We consider in detail predicting words   Cited by 1694  Related articles  All 45 versions  Cite  Save DM Blei ­ Communications of the ACM, 2012 ­ dl.acm.org as OUr COLLeCTive knowledge continues to be digitized and stored—in the form of news,   blogs, Web pages, scientific articles, books, images, sound, video, and social networks—it   User profiles for david blei Latent dirichlet allocation Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes. Supervised topic models Matching words and pictures Probabilistic topic models
  • 9.
    4/18/2017 Topic modelingbibliography Bibliometrics Cross-language Evaluation Implementations Inference NLP Networks Non-parametric Scalability Social media Temporal Theory User interface Vision Where to start Topic Modeling Bibliography Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, Eric P. Xing. Mixed Membership Stochastic Blockmodels. JMLR (9) 2008 pp. 1981-2014. Networks [BibTeX] Loulwah AlSumait, Daniel Barbará, James Gentle, Carlotta Domeniconi. Topic Significance Ranking of LDA Generative Models. ECML (2009). Evaluation [BibTeX] David Andrzejewski, Anne Mulhern, Ben Liblit, Xiaojin Zhu. Statistical Debugging using Latent Topic Models. ECML (2007). [BibTeX] David Andrzejewski, Xiaojin Zhu, Mark Craven. Incorporating domain knowledge into topic modeling via Dirichlet Forest priors. ICML (2009). [BibTeX] David Andrzejewski, Xiaojin Zhu, Mark Craven, Ben Recht. A Framework for Incorporating General Domain Knowledge into Latent Dirichlet Allocation using First-Order Logic. IJCAI
  • 10.
    What "Topic modeling algorithmsare statistical methods that analyze the words of the original texts to discover the themes that run through them, how those themes are connected to each other, and how they change over time."
  • 11.
    Why Algorithms for managinglarge document archives There's lots of data. And its growing exponentially. A lot of it is unstructured. We can better understand, access, search, and use this data if we can organize it. And given the growing scale, we need to do that algorithmically on computers. Topic Models can find themes and organize documents automatically, with results that can be superior to keyword searching. Also, can organize these results across time.

  • 12.
    Topic Modeling inAction Lets look at an example of topic modeling. From"A Correlated Topic Model of Science" (Annals of Applied Statistics, 2007) The goal is to take an unstructured corpora of text documents, and infer topics to group the documents together. Can similarly model genetics, images, social networks. dance steps.

  • 13.
    7 www.cs.cmu.edu/~lemur/science/2.html WORDS gene dna mutations mutation mutant yeast cells genes mutants type wild recombination fig protein growth telomerase strains strain repair phenotype plasmid allele human cell two cerevisiae RELATED TOPICS genedna mutations mutation mutant plants gene plant genes expression protein cell kinase activity cycle cells leukemia cell abl patients dna replication cell chromosome chromosomes protein sequence amino cdna fig gene disease human chromosome cancer rna dna site structure polymerase cells cell fig expression human RELATED DOCUMENTS "Requirement of the Yeast RTH1 5' to 3' Exonuclease for the Stability of Simple Repetitive DNA" (1995) "Adaptive Mutation by Deletions in Small Mononucleotide Repeats" (1994) "Recombination in Adaptive Mutation" (1994) "Evidence That F Plasmid Transfer Replication Underlies Apparent Adaptive Mutation" (1995) "Cdc13p: A Single‐Strand Telomeric DNA‐Binding Protein with a Dual Role in Yeast Telomere Maintenance" (1996) "Adaptive Mutation in Escherichia coli: A Role for Conjugation" (1995) "Two Modes of Survival of Fission Yeast Without Telomerase" (1998) "Association of Increased Spontaneous Mutation Rates with High Levels of Transcription in Yeast" (1995) "Removal of Nonhomologous DNA Ends in Double‐Strand Break Recombination: The Role of the Yeast Ultraviolet Repair Gene RAD1" (1992) "Adaptive Mutation: Who's Really in the Garden?" (1995) "Evidence that Gene Amplification Underlies Adaptive Mutability of the Bacterial lac Operon" (1998) "Involvement of the Silencer and UAS Binding Protein RAP1 in Regulation of Telomere Length" (1990) "Abnormal Chromosome Behavior in Neurospora Mutants Defective in DNA Methylation" (1993) "Telomeres, Telomerase, and Cancer" (1995) "HSP104 Required for Induced Thermotolerance" (1990) "Est1 and Cdc13 as Comediators of Telomerase Access" (1999) "Conditional Mutator Phenotypes in hMSH2‐Deficient Tumor Cell Lines" (1997) "Appropriate Partners Make Good Matches" (1995)
  • 14.
    © Institute ofMathematical Statistics, 2007 A CORRELATED TOPIC MODEL OF SCIENCE1 BY DAVID M. BLEI AND JOHN D. LAFFERTY Princeton University and Carnegie Mellon University Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A lim- itation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than X-ray astronomy. This limitation stems from the use of the Dirichlet dis- tribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [J. Roy. Statist. Soc. Ser. B 44 (1982) 139–177]. We derive a fast variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. We ap- ply the CTM to the articles from Science published from 1990–1999, a data set that comprises 57M words. The CTM gives a better fit of the data than LDA, and we demonstrate its use as an exploratory tool of large document collections. 1. Introduction. Large collections of documents are readily available on- line and widely accessed by diverse communities. As a notable example, schol- arly articles are increasingly published in electronic form, and historical archives are being scanned and made accessible. The not-for-profit organization JSTOR (www.jstor.org) is currently one of the leading providers of journals to the schol- arly community. These archives are created by scanning old journals and running an optical character recognizer over the pages. JSTOR provides the original scans on-line, and uses their noisy version of the text to support keyword search. Since the data are largely unstructured and comprise millions of articles spanning cen- turies of scholarly work, automated analysis is essential. The development of new tools for browsing, searching and allowing the productive use of such archives is thus an important technological challenge, and provides new opportunities for statistical modeling. Received March 2007; revised April 2007. 1Supported in part by NSF Grants IIS-0312814 and IIS-0427206, the DARPA CALO project and a grant from Google. Supplementary material and code are available at http://imstat.org/aoas/supplements Key words and phrases. Hierarchical models, approximate posterior inference, variational meth- ods, text analysis.
  • 15.
    TOPIC MODELS DAVID M.BLEI PRINCETON UNIVERSITY JOHN D. LAFFERTY CARNEGIE MELLON UNIVERSITY 1. INTRODUCTION Scientists need new tools to explore and browse large collections of schol- arly literature. Thanks to organizations such as JSTOR, which scan and index the original bound archives of many journals, modern scientists can search digital libraries spanning hundreds of years. A scientist, suddenly faced with access to millions of articles in her field, is not satisfied with simple search. Effectively using such collections requires interacting with them in a more structured way: finding articles similar to those of interest, and exploring the collection through the underlying topics that run through it. The central problem is that this structure—the index of ideas contained in the articles and which other articles are about the same kinds of ideas—is not readily available in most modern collections, and the size and growth rate of these collections preclude us from building it by hand. To develop the necessary tools for exploring and browsing modern digital libraries, we require automated methods of organizing, managing, and delivering their contents. In this chapter, we describe topic models, probabilistic models for uncov- ering the underlying semantic structure of a document collection based on a hierarchical Bayesian analysis of the original texts Blei et al. (2003); Grif- fiths and Steyvers (2004); Buntine and Jakulin (2004); Hofmann (1999);
  • 16.
    LDA LDA introduced inthe Promethean paper Latent Dirichlet Allocation, by Blei, Ng, and Jordan, in Journal of Machine Learning 2003
  • 17.
    α z wθ β M N Figure1: Graphical model representation of LDA. The boxes are “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document. where p(zn |θ) is simply θi for the unique i such that zi n = 1. Integrating over θ and summing over z, we obtain the marginal distribution of a document: p(w|α,β) = Z p(θ|α) N ∏ n=1 ∑ zn p(zn |θ)p(wn |zn,β) ! dθ. (3) Finally, taking the product of the marginal probabilities of single documents, we obtain the proba- bility of a corpus: p(D|α,β) = M ∏ d=1 Z p(θd |α) Nd ∏ n=1 ∑ zdn p(zdn |θd)p(wdn |zdn,β) ! dθd. The LDA model is represented as a probabilistic graphical model in Figure 1. As the figure makes clear, there are three levels to the LDA representation. The parameters α and β are corpus- level parameters, assumed to be sampled once in the process of generating a corpus. The variables θd are document-level variables, sampled once per document. Finally, the variables zdn and wdn are word-level variables and are sampled once for each word in each document. It is important to distinguish LDA from a simple Dirichlet-multinomial clustering model. A classical clustering model would involve a two-level model in which a Dirichlet is sampled once for a corpus, a multinomial clustering variable is selected once for each document in the corpus, and a set of words are selected for the document conditional on the cluster variable. As with many clustering models, such a model restricts a document to being associated with a single topic. LDA, on the other hand, involves three levels, and notably the topic node is sampled repeatedly within the document. Under this model, documents can be associated with multiple topics. Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling, where they are referred to as hierarchical models (Gelman et al., 1995), or more precisely as con- ditionally independent hierarchical models (Kass and Steffey, 1989). Such models are also often referred to as parametric empirical Bayes models, a term that refers not only to a particular model structure, but also to the methods used for estimating parameters in the model (Morris, 1983). In- deed, as we discuss in Section 5, we adopt the empirical Bayes approach to estimating parameters such as α and β in simple implementations of LDA, but we also consider fuller Bayesian approaches as well.
  • 18.
    “…most implementations ofLDA assume the distribution is symmetric. For the symmetric distribution, a high alpha-value means that each document is likely to contain a mixture of most of the topics, and not any single topic specifically. A low alpha value puts less such constraints on documents and means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics. Likewise, a high beta-value means that each topic is likely to contain a mixture of most of the words, and not any word specifically, while a low value means that a topic may contain a mixture of just a few of the words. If, on the other hand, the distribution is asymmetric, a high alpha-value means that a specific topic distribution (depending on the base measure) is more likely for each document. Similarly, high beta-values means each topic is more likely to contain a specific word mix defined by the base measure. In practice, a high alpha- value will lead to documents being more similar in terms of what topics they contain. A high beta-value will similarly lead to topics being more similar in terms of what words they contain. More generally, these are concentration parameters for the dirichlet distribution used in the LDA model. To gain some intuitive understanding of how this works, this presentation contains some nice illustrations, as well as a good explanation of LDA in general. http://people.cs.umass.edu/~wallach/talks/priors.pdf"
  • 19.
    N the numberof words in the corpus 1  i, j  N index of words in the corpus W = {wi} the corpus, and wi denotes a word Z = {zi} latent topics assigned to words in W W¬i = W wi the corpus excluding wi Z¬i = Z zi latent topics excluding zi K the number of topics specified as a parameter V the number of unique words in the vocabulary ↵ the parameters of topic Dirichlet prior the parameters of word Dirichlet prior ⌦d,k count of words in d assigned topic k; ⌦d denotes the d-th row of matrix ⌦. k,v count of word v in corpus assigned k k denotes the k-th row of matrxi . ⌦¬i d,k like ⌦d,k but excludes wi and zi ¬i k,v like k,v but excludes wi and zi ⇥ = {✓d,k} ✓d,k = P(z = k|d), ✓d = P(z|d). = { k,v} k,v = P(w = v|z = k), k = P(w|z = k). Table 2.1: Symbols used in the derivation of LDA Gibbs sampling rule. The Joint Distribution of LDA We start from deriving the joint distribution 4 , p(Z, W|↵, ) = p(W|Z, )p(Z|↵) , (2.12) which is the basis of the derivation of the Gibbs updating rule and the parameter estimation rule. As p(W|Z, ) and p(Z|↵) depend on and ⇥ respectively, we derive them separately. According to the definition of LDA, we have p(W|Z, ) = Z p(W|Z, )p( | )d , (2.13) where p( | ) has Dirichlet distribution: p( | ) = KY k=1 p( k| ) = KY k=1 1 B( ) VY v=1 v 1 k,v , (2.14) and p(W|Z, ) has multinomial distribution: p(W|Z, ) = NY i=1 zi,wi = KY k=1 VY v=1 k,v k,v , (2.15) where is a K ⇥ V count matrix and k,v is the number of times that topic k is assigned to word v. With W and Z defined by (2.1) and (2.3) respectively, we can represent k,v mathematically by k,v = NX i=1 I{wi = v ^ zi = k} , (2.16) 4
  • 20.
    LDA Implementation MCMC VI -Variational Inference Variational Inference: A Review for Statisticians (Blei, Kucukelbir, McAuliffe) Variational inference for Dirichlet process mixtures (Blei, Jordan) Part of a pipeline. pre-LDA, strip stop words, etc. And post-LDA, can consider LDA as dimensionality reduction.
  • 21.
    Variational Inference: AReview for Statisticians David M. Blei Department of Computer Science and Statistics Columbia University Alp Kucukelbir Department of Computer Science Columbia University Jon D. McAuliffe Department of Statistics University of California, Berkeley November 3, 2016 Abstract One of the core problems of modern statistics is to approximate difficult-to-compute probability densities. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation involving the posterior density. In this paper, we review variational inference (VI), a method from machine learning that approximates probability densities through optimization. VI has been used in many applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of densities and then to find the member of that family which is close to the target. Closeness is measured by Kullback-Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data. We discuss modern research in VI and highlight important open problems. VI is powerful, but it is not yet well understood. Our hope in writing this paper is to catalyze statistical research on this class of algorithms. Keywords: Algorithms; Statistical Computing; Computationally Intensive Methods. arXiv:1601.00670v4[stat.CO]2Nov2016
  • 22.
    Bayesian Analysis (2006)1, Number 1, pp. 121–144 Variational Inference for Dirichlet Process Mixtures David M. Blei∗ Michael I. Jordan† Abstract. Dirichlet process (DP) mixture models are the cornerstone of non- parametric Bayesian statistics, and the development of Monte-Carlo Markov chain (MCMC) sampling methods for DP mixtures has enabled the application of non- parametric Bayesian methods to a variety of practical data analysis problems. However, MCMC sampling can be prohibitively slow, and it is important to ex- plore alternatives. One class of alternatives is provided by variational methods, a class of deterministic algorithms that convert inference problems into optimization problems (Opper and Saad 2001; Wainwright and Jordan 2003). Thus far, varia- tional methods have mainly been explored in the parametric setting, in particular within the formalism of the exponential family (Attias 2000; Ghahramani and Beal 2001; Blei et al. 2003). In this paper, we present a variational inference algorithm for DP mixtures. We present experiments that compare the algorithm to Gibbs sampling algorithms for DP mixtures of Gaussians and present an application to a large-scale image analysis problem. Keywords: Dirichlet processes, hierarchical models, variational inference, image processing, Bayesian computation 1 Introduction The methodology of Monte Carlo Markov chain (MCMC) sampling has energized Bayesian statistics for more than a decade, providing a systematic approach to the computation of likelihoods and posterior distributions, and permitting the deployment of Bayesian methods in a rapidly growing number of applied problems. However, while an unques- tioned success story, MCMC is not an unqualified one—MCMC methods can be slow to converge and their convergence can be difficult to diagnose. While further research on sampling is needed, it is also important to explore alternatives, particularly in the context of large-scale problems. One such class of alternatives is provided by variational inference methods (Ghahramani and Beal 2001; Jordan et al. 1999; Opper and Saad 2001; Wainwright and Jordan
  • 23.
    Topic modeling forthe newbie LDA assumes a probabilistic model for documents. For our purposes, that… There is some fixed number K of topics. There is a random variable that assigns each topic an associated probability distribution over words. You should think of this distribution as the probability of seeing word w given topic k. There is another random variable that assigns each document a probability distribution over topics. You should think of this distribution as the mixture of topics in document d. Each word in a document was generated by first randomly picking a topic (from the document’s distribution of topics) and then randomly picking a word (from the topic’s distribution of words). https://www.oreilly.com/ideas/topic-modeling-for-the-newbie
  • 24.
  • 25.
    4/18/2017 Large ScaleTopic Modeling: Improvements to LDA on Apache Spark - The Databricks Blog COMPANY (HTTPS://DATABRICKS.COM/BLOG/CATEGORY/COMPANY) ENGINEERING (HTTPS://DATABRICKS.COM/BLOG/CATEGORY/ENGINEERING) ALL (HTTPS://DATABRICKS.COM/BLOG) Large Scale Topic Modeling: Improvements to LDA on Apache Spark September 22, 2015 (https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-improvements-to-lda-on-apache-spark.html) | by Feynman Liang (https://databricks.com/blog/author/feynman-liang), Yuhao Yang (https://databricks.com/blog/author/yuhao-yang) and Joseph Bradley (https://databricks.com/blog/author/joseph) in ENGINEERING BLOG (HTTPS://DATABRICKS.COM/BLOG/CATEGORY/ENGINEERING) (https://twitter.com/home?status=https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-improvements-to-lda- on-apache-spark.html) (https://www.linkedin.com/shareArticle?mini=true&url=https://databricks.com/blog/2015/09/22/large-scale-topic-modeling- improvements-to-lda-on-apache-spark.html&title=Large Scale Topic Modeling: Improvements to LDA on Apache Spark&summary=&source=) (https://www.facebook.com/sharer/sharer.php?u=https://databricks.com/blog/2015/09/22/large-scale-topic-modeling- improvements-to-lda-on-apache-spark.html) This blog was written by Feynman Liang and Joseph Bradley from Databricks, and Yuhao Yang from Intel. To get started using LDA, download Apache Spark 1.5 (http://spark.apache.org/downloads.html) or sign up for a 14-day free trial of Databricks today (https://accounts.cloud.databricks.com/registration.html#signup). What are people discussing on Twitter? To catch up on distributed computing, what news articles should I read? These are questions that can be answered by topic models, a technique for analyzing the topics present in collections of documents. This blog post discusses improvements in Apache Spark 1.4 and 1.5 for topic modeling using the powerful Latent Dirichlet Allocation (LDA) algorithm. Spark 1.4 and 1.5 introduced an online algorithm for running LDA incrementally, support for more queries on trained LDA models, and performance metrics such as likelihood and perplexity. We give an example here of training a topic model over a dataset of 4.5 million Wikipedia articles. Topic models and LDA Topic models take a collection of documents and automatically infer the topics being discussed. For example, when we run Spark’s LDA on a dataset of 4.5 million Wikipedia articles, we can obtain topics like those in the table below. Table 1: Example LDA topics learned from Wikipedia articles dataset In addition, LDA tells us which topics each document is about; document X might be 30% about Topic 1 (“politics”) and 70% about Topic 5 (“airlines”). Latent Dirichlet Allocation (LDA) has been one of the most successful topic models in practice. See our previous blog post on LDA (https://databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html) to learn more. A new online variational learning algorithm Online variational inference is a technique for learning an LDA model by processing the data incrementally in small batches. By processing in small batches, we are able to easily scale to very large datasets. MLlib implements an algorithm for performing online variational inference originally described by Ho man et al (https://www.cs.princeton.edu/~blei/papers/Ho manBleiBach2010b.pdf). Performance comparison The table of topics shown previously were learned using the newly developed online variational learning algorithm. If we compare timing results, we can see a significant speedup in using the new online algorithm over the old EM algorithm: Stay up to date on Apache Spark. ×
  • 26.
    Eight to Late Sensemakingand Analytics for Organizations A gentle introduction to topic modeling using R with 36 comments Introduction The standard way to search for documents on the internet is via keywords or keyphrases. This is pretty much what Google and other search engines do routinely…and they do it well.  However, as useful as this is, it has its limitations. Consider, for example, a situation in which you are confronted with a large collection of documents but have no idea what they are about. One of the first things you might want to do is to classify these documents into topics or themes. Among other things this would help you figure out if there’s anything interest while also directing you to the relevant subset(s) of the corpus. For small collections, one could do this by simply going through each document but this is clearly infeasible for corpuses containing thousands of documents. Topic modeling – the theme of this post – deals with the problem of automatically classifying sets of documents into themes
  • 27.
    Aneesha Bakharia Data Science,Learning Analytics, Electronics — Brisbane, Australia Sep 1, 2016 · 5 min read Follow
  • 28.
    Conclusions LDA is partof the evolving set of topic modeling algorithms, which have advanced from tf-idf, NMF, LSI, pLSI, LDA. LDA is also being elaborated, to supervised models, and weighing in the index, abstract, and bibliography.
  • 29.
    Questions This paper proposesdeep learning for NLP. In some intuitive sense, are we leaving information on the table. Can we use deep learning for topic modeling. How beneficial, how practical is it to fold it into our NLP? "For a long time, core NLP techniques were dominated by machine-learning approaches that used linear models such as support vector machines or logistic regression, trained over very high dimensional yet very sparse feature vectors. Recently, the field has seen some success in switching from such linear models over sparse inputs to non-linear neural-network models over dense inputs." http://u.cs.biu.ac.il/~yogo/nnlp.pdf Would LDA benefit from adding 2-gram features to each document. Google is not just keyword searching. Can we use probabilistic topic modeling as part of a web search algorithm. In the introduction, Blei seems to suggest that we could index the web using topic models. I wonder what the practical constraints to this are. Where have you used topic modeling? Where could you imagine it used?
  • 30.
    D. Blei (2012),Probabilistic Topic Models 
 http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf D. Blei, A. Ng, M. Jordan (2003), Latent Dirichlet Allocation 
 http://www.cs.columbia.edu/~blei/papers/BleiNgJordan2003.pdf J. Pritchard, M. Stephens, P. Donnelly, Inference of Population Structure Using Multilocus Genotype Data (2000), http://www.genetics.org/content/genetics/155/2/945.full.pdf D. Blei (2017), home page at Columbia University 
 http://www.cs.columbia.edu/~blei/ D. Blei (2009), Topic Models, videos from Machine Learning Summer School, Cambridge
 http://videolectures.net/mlss09uk_cambridge/ D. Blei (2017), Google Scholar 
 https://scholar.google.com/scholar?hl=en&q=dm+blei D. Mimno (2017), Topic Modeling Bibliography 
 https://mimno.infosci.cornell.edu/topics.html D. Blei, J. Lafferty (2007), A Correlated Topic Model of Science 
 http://www.cs.columbia.edu/~blei/papers/BleiLafferty2007.pdf
  • 31.
    D. Blei, J.Lafferty (2009), Topic Models 
 http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf Y. Wang (2008), Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details 
 https://cxwangyi.files.wordpress.com/2012/01/llt.pdf D. Blei, A. Kucukelbir, J. McAuliffe (2016), Variational Inference: A Review for Statisticians, 
 https://arxiv.org/abs/1601.00670 D. Blei, M. Jordan (2004), Variational Inference for Dirichlet Process Mixtures 
 http://www.cs.columbia.edu/~blei/papers/BleiJordan2004.pdf M. Beaugureau (2015), Topic Modeling for the Newbie 
 https://www.oreilly.com/ideas/topic-modeling-for-the-newbie F. Liang, Y. Yang, J. Bradley (2015), Large Scale Topic Modeling: Improvements to LDA on Apache Spark 
 https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-improvements-to-lda-on- apache-spark.html K. Awati (2015), A gentle introduction to topic modeling using R
 https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/ A. Bakharia (2016), Topic Modeling with Scikit Learn 
 https://medium.com/@aneesha/topic-modeling-with-scikit-learn-e80d33668730