East Coast MARE Ocean Lecture Mar 29, 2012 - Why is there so much microbial d...coseenow
East Coast MARE hosted an Ocean Lecture & Educators’ Night for teachers focused on bringing ocean literacy to students in New Jersey. Dr. Lee Kerkhof of Rutgers University presented the scientific lecture on March 29, 2012. For more information visit http://coseenow.net/mare/opportunities-resources/ocean-lecture-educators-night/.
East Coast MARE Ocean Lecture Mar 29, 2012 - Why is there so much microbial d...coseenow
East Coast MARE hosted an Ocean Lecture & Educators’ Night for teachers focused on bringing ocean literacy to students in New Jersey. Dr. Lee Kerkhof of Rutgers University presented the scientific lecture on March 29, 2012. For more information visit http://coseenow.net/mare/opportunities-resources/ocean-lecture-educators-night/.
Classification of Microorganisms
1. Whittaker Five Kingdom Classification
2. Three Domain System of Classification
Groups of Microorganisms
1.Bacteria
2. Virus
3. Fungi
4. Algae
5. Protozoa
Uncovering the impacts of circumcision on the penis microbiome, Translational...Copenhagenomics
Dr. Lance Price, Director of Center for Food Microbiology and Environmental Health
Translational Genomics Research Institute (TGen) presents his talk: Uncovering the impacts of circumcision on the penis microbiome
Classification of Microorganisms
1. Whittaker Five Kingdom Classification
2. Three Domain System of Classification
Groups of Microorganisms
1.Bacteria
2. Virus
3. Fungi
4. Algae
5. Protozoa
Uncovering the impacts of circumcision on the penis microbiome, Translational...Copenhagenomics
Dr. Lance Price, Director of Center for Food Microbiology and Environmental Health
Translational Genomics Research Institute (TGen) presents his talk: Uncovering the impacts of circumcision on the penis microbiome
Can you teach coding to kids in a mobile game app in local languages. Do you need to be good in English to learn coding in R or Python?
How young can we train people in coding-
something we worked on for six months but now we are giving up due to lack of funds is this idea.
Feel free to use it, it is licensed cc-by-sa
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
1. Modeling Science
David M. Blei
Department of Computer Science
Princeton University
April 17, 2008
Joint work with John Lafferty (CMU)
D. Blei Modeling Science 1 / 53
2. Modeling Science
Science, August 13, 1886 Science, June 24, 1994
evolution rna disease
water acid disease evolutionary mrna host
milk water blood species site bacteria
food solution cholera organisms splicing diseases
dry experiments bacteria biology rnas new
fed liquid found phylogenetic nuclear bacterial
cows chemical bacillus life sequence resistance
houses action experiments origin introns control
butter copper organisms diversity messenger strains
fat crystals bacilli
groups cleavage infectious
found carbon cases
made alcohol diseases
molecular two malaria
contained made germs animals splice parasites
wells obtained animal two sequences parasite
produced substances koch new polymerase tuberculosis
poisonous nitrogen made living intron health
5
• On-line archives of document collections require better 6
organization. Manual organization is not practical.
• Our goal: To discover the hidden thematic structure with
hierarchical probabilistic models called topic models.
• Use this structure for browsing, search, and similarity.
D. Blei Modeling Science 2 / 53
3. Modeling Science
Science, August 13, 1886 Science, June 24, 1994
evolution rna disease
water acid disease evolutionary mrna host
milk water blood species site bacteria
food solution cholera organisms splicing diseases
dry experiments bacteria biology rnas new
fed liquid found phylogenetic nuclear bacterial
cows chemical bacillus life sequence resistance
houses action experiments origin introns control
butter copper organisms diversity messenger strains
fat crystals bacilli
groups cleavage infectious
found carbon cases
made alcohol diseases
molecular two malaria
contained made germs animals splice parasites
wells obtained animal two sequences parasite
produced substances koch new polymerase tuberculosis
poisonous nitrogen made living intron health
5
• Our data are the pages Science from 1880-2002 (from JSTOR) 6
• No reliable punctuation, meta-data, or references.
• Note: this is just a subset of JSTOR’s archive.
D. Blei Modeling Science 2 / 53
4. Discover topics from a corpus
“Genetics” “Evolution” “Disease” “Computers”
human evolution disease computer
genome evolutionary host models
dna species bacteria information
genetic organisms diseases data
genes life resistance computers
sequence origin bacterial system
gene biology new network
molecular groups strains systems
sequencing phylogenetic control model
map living infectious parallel
information diversity malaria methods
genetics group parasite networks
mapping new parasites software
project two united new
sequences common tuberculosis simulations
D. Blei Modeling Science 3 / 53
5. Model the evolution of topics over time
"Theoretical Physics" "Neuroscience"
FORCE OXYGEN
o o o o LASER o
o o o
o o o o o NERVE o
o o o o o
o o o o
o o o o o o o
o o o
o o o o o o o
RELATIVITY o o o
o
o
o
o o o o
o o o o o o o o
o o
o o o o o
o o o o NEURON
o o o
o o
o o o o o
o
o o o
o o o o o
o o o o
o o o
o o o
o o o o o o o o
o o o o o o
o o
o o o o o o o o o o o o o o
o o o o o o
o o o o o o
o o o o o o o
o o
1880 1900 1920 1940 1960 1980 2000 1880 1900 1920 1940 1960 1980 2000
D. Blei Modeling Science 4 / 53
6. Model connections between topics
neurons
brain stimulus
motor
memory
visual
activated subjects synapses
tyrosine phosphorylation cortical
left ltp
activation
phosphorylation p53 task surface glutamate
kinase cell cycle proteins tip synaptic
activity protein
cyclin binding rna image neurons
regulation domain dna sample materials
computer
domains rna polymerase organic
problem device
receptor cleavage
information
polymer
science amino acids
research
scientists
receptors cdna
site
computers
polymers
funding molecules physicists
support says ligand sequence problems
laser particles
nih research ligands isolated optical physics
program people protein sequence light
apoptosis sequences surface
particle
electrons experiment
genome liquid quantum
wild type dna surfaces stars
mutant enzyme sequencing fluid
mutations enzymes model reaction astronomers
united states mutants
iron
active site
reactions universe
women cells
mutation reduction molecule galaxies
universities
cell molecules
expression magnetic
galaxy
cell lines plants
magnetic field transition state
students bone marrow plant
spin
superconductivity
gene
education genes
superconducting
pressure mantle
arabidopsis
bacteria high pressure crust sun
bacterial pressures upper mantle solar wind
host fossil record core meteorites earth
resistance development birds inner core ratios planets
mice parasite embryos fossils planet
gene dinosaurs
antigen virus drosophila species
disease fossil
t cells hiv genes forest
mutations
antigens aids expression forests
families earthquake co2
immune response infection populations
mutation earthquakes carbon
viruses ecosystems
fault carbon dioxide
ancient images methane
patients genetic found
disease cells population impact
data water
ozone
treatment proteins populations million years ago volcanic atmospheric
drugs differences africa
clinical researchers deposits climate
measurements
variation stratosphere
protein magma ocean
eruption ice concentrations
found volcanism changes
climate change
D. Blei Modeling Science 5 / 53
9. Probabilistic modeling
1 Treat data as observations that arise from a generative
probabilistic process that includes hidden variables
• For documents, the hidden variables reflect the thematic
structure of the collection.
2 Infer the hidden structure using posterior inference
• What are the topics that describe this collection?
3 Situate new data into the estimated model.
• How does this query or new document fit into the estimated
topic structure?
D. Blei Modeling Science 8 / 53
10. Intuition behind LDA
Simple intuition: Documents exhibit multiple topics.
D. Blei Modeling Science 9 / 53
11. Generative process
• Cast these intuitions into a generative probabilistic process
• Each document is a random mixture of corpus-wide topics
• Each word is drawn from one of those topics
D. Blei Modeling Science 10 / 53
12. Generative process
• In reality, we only observe the documents
• Our goal is to infer the underlying topic structure
• What are the topics?
• How are the documents divided according to those topics?
D. Blei Modeling Science 10 / 53
13. Graphical models (Aside)
Y Y
≡
··· Xn
X1 X2 XN N
• Nodes are random variables
• Edges denote possible dependence
• Observed variables are shaded
• Plates denote replicated structure
D. Blei Modeling Science 11 / 53
14. Graphical models (Aside)
Y Y
≡
··· Xn
X1 X2 XN N
• Structure of the graph defines the pattern of conditional
dependence between the ensemble of random variables
• E.g., this graph corresponds to
N
p(y, x1 , . . . , xN ) = p(y ) p(xn | y)
n=1
D. Blei Modeling Science 11 / 53
15. Latent Dirichlet allocation
Per-word
Dirichlet
topic assignment
parameter
Per-document Observed Topic
topic proportions word Topics hyperparameter
α θd Zd,n Wd,n βk η
N
D K
Each piece of the structure is a random variable.
D. Blei Modeling Science 12 / 53
16. Latent Dirichlet allocation
α θd Zd,n Wd,n βk η
N
D K
1 Draw each topic βi ∼ Dir(η), for i ∈ {1, . . . , K }.
2 For each document:
1 Draw topic proportions θd ∼ Dir(α).
2 For each word:
1 Draw Zd,n ∼ Mult(θd ).
2 Draw Wd,n ∼ Mult(βzd,n ).
D. Blei Modeling Science 13 / 53
17. Latent Dirichlet allocation
α θd Zd,n Wd,n βk η
N
D K
• From a collection of documents, infer
• Per-word topic assignment zd,n
• Per-document topic proportions θd
• Per-corpus topic distributions βk
• Use posterior expectations to perform the task at hand, e.g.,
information retrieval, document similarity, etc.
D. Blei Modeling Science 13 / 53
18. Latent Dirichlet allocation
α θd Zd,n Wd,n βk η
N
D K
• Computing the posterior is intractable:
N
p(θ | α) n=1 p(zn | θ )p(wn | zn , β1:K )
N K
θ p(θ | α) n=1 z=1 p(zn | θ )p(wn | zn , β1:K )
• Several approximation techniques have been developed.
D. Blei Modeling Science 13 / 53
19. Latent Dirichlet allocation
α θd Zd,n Wd,n βk η
N
D K
• Mean field variational methods (Blei et al., 2001, 2003)
• Expectation propagation (Minka and Lafferty, 2002)
• Collapsed Gibbs sampling (Griffiths and Steyvers, 2002)
• Collapsed variational inference (Teh et al., 2006)
D. Blei Modeling Science 13 / 53
20. Example inference
• Data: The OCR’ed collection of Science from 1990–2000
• 17K documents
• 11M words
• 20K unique terms (stop words and rare words removed)
• Model: 100-topic LDA model using variational inference.
D. Blei Modeling Science 14 / 53
21. Example inference
0.4
0.3
Probability
0.2
0.1
0.0
1 8 16 26 36 46 56 66 76 86 96
Topics
D. Blei Modeling Science 15 / 53
22. Example topics
“Genetics” “Evolution” “Disease” “Computers”
human evolution disease computer
genome evolutionary host models
dna species bacteria information
genetic organisms diseases data
genes life resistance computers
sequence origin bacterial system
gene biology new network
molecular groups strains systems
sequencing phylogenetic control model
map living infectious parallel
information diversity malaria methods
genetics group parasite networks
mapping new parasites software
project two united new
sequences common tuberculosis simulations
D. Blei Modeling Science 16 / 53
23. LDA summary
• LDA is a powerful model for
• Visualizing the hidden thematic structure in large corpora
• Generalizing new data to fit into that structure
• LDA is a mixed membership model (Erosheva, 2004) that builds
on the work of Deerwester et al. (1990) and Hofmann (1999).
• For document collections and other grouped data, this might be
more appropriate than a simple finite mixture
D. Blei Modeling Science 17 / 53
24. LDA summary
• Modular : It can be embedded in more complicated models.
• E.g., syntax and semantics; authorship; word sense
• General: The data generating distribution can be changed.
• E.g., images; social networks; population genetics data
• Variational inference is fast; lets us to analyze large data sets.
• See Blei et al., 2003 for details and a quantitative comparison.
• Code to play with LDA is freely available on my web-site,
http://www.cs.princeton.edu/∼blei.
D. Blei Modeling Science 18 / 53
25. LDA summary
• But, LDA makes certain assumptions about the data.
• When are they appropriate?
D. Blei Modeling Science 19 / 53
27. LDA and exchangeability
α θd Zd,n Wd,n βk η
N
D K
• LDA assumes that documents are exchangeable.
• I.e., their joint probability is invariant to permutation.
• This is too restrictive.
D. Blei Modeling Science 21 / 53
28. Documents are not exchangeable
"Infrared Reflectance in Leaf-Sitting
"Instantaneous Photography" (1890)
Neotropical Frogs" (1977)
• Documents about the same topic are not exchangeable.
• Topics evolve over time.
D. Blei Modeling Science 22 / 53
29. Dynamic topic model
• Divide corpus into sequential slices (e.g., by year).
• Assume each slice’s documents exchangeable.
• Drawn from an LDA model.
• Allow topic distributions evolve from slice to slice.
D. Blei Modeling Science 23 / 53
30. Dynamic topic models
α α α
θd θd θd
Zd,n Zd,n Zd,n
Wd,n Wd,n Wd,n
N N N
D D D
...
βk,1 βk,2 βk,T
K
D. Blei Modeling Science 24 / 53
31. Modeling evolving topics
βk,1 βk,2 βk,T
...
• Use a logistic normal distribution to model evolving topics
(Aitchison, 1980)
• A state-space model on the natural parameter of the topic
multinomial (West and Harrison, 1997)
βt,k | βt−1,k ∼ N (βt−1,k , Iσ 2 )
V −1
p(w | βt,k ) = exp βt,k − log(1 + v =1 exp{βt,k ,v })
D. Blei Modeling Science 25 / 53
32. Posterior inference
• Our goal is to compute the posterior distribution,
p(β1:T ,1:K , θ1:T ,1:D , z1:T ,1:D | w1:T ,1:D ).
• Exact inference is impossible
• Per-document mixed-membership model
• Non-conjugacy between p(w | βt,k ) and p(βt,k )
• MCMC is not practical for the amount of data.
• Solution: Variational inference
D. Blei Modeling Science 26 / 53
33. Science data
TECHVIEW: DNA S E Q U E N C I NG
Sequencing the Genome, Fast
James C. Mullikin and Amanda A. McMurray
Genome sequencing projects reveal
the genetic makeup of an organism
by reading off the sequence of the
DNA bases, which encodes all of the infor-
mation necessary for the life of the organ-
ism. The base sequence contains four nu-
cleotides-adenine, thymidine, guanosine,
and cytosine-which are linked together
into long double-helical chains. Over the
last two decades, automated DNA se-
quencers have made the process of obtain-
ing the base-by-base sequence of DNA...
• Analyze JSTOR’s entire collection from Science (1880-2002)
• Restrict to 30K terms that occur more than ten times
• The data are 76M words in 130K documents
D. Blei Modeling Science 27 / 53
34. Analyzing a document
Original article Topic proportions
D. Blei Modeling Science 28 / 53
35. Analyzing a document
Original article Most likely words from top topics
sequence devices data
genome device information
genes materials network
sequences current web
human high computer
gene gate language
dna light networks
sequencing silicon time
chromosome material software
regions technology system
analysis electrical words
data fiber algorithm
genomic power number
number based internet
D. Blei Modeling Science 28 / 53
36. Analyzing a topic
1880 1890 1900 1910 1920 1930 1940
electric electric apparatus air apparatus tube air
machine power steam water tube apparatus tube
power company power engineering air glass apparatus
engine steam engine apparatus pressure air glass
steam electrical engineering room water mercury laboratory
two machine water laboratory glass laboratory rubber
machines two construction engineer gas pressure pressure
iron system engineer made made made small
battery motor room gas laboratory gas mercury
wire engine feet tube mercury small gas
1950 1960 1970 1980 1990 2000
tube tube air high materials devices
apparatus system heat power high device
glass temperature power design power materials
air air system heat current current
chamber heat temperature system applications gate
instrument chamber chamber systems technology high
small power high devices devices light
laboratory high flow instruments design silicon
pressure instrument tube control device material
rubber control design large heat technology
D. Blei Modeling Science 29 / 53
37. Visualizing trends within a topic
"Theoretical Physics" "Neuroscience"
FORCE OXYGEN
o o o o LASER o
o o o
o o o o o NERVE o
o o o o o
o o o o
o o o o o o o
o o o
o o o o o o o
RELATIVITY o o o
o
o
o
o o o o
o o o o o o o o
o o
o o o o o
o o o o NEURON
o o o
o o
o o o o o
o
o o o
o o o o o
o o o o
o o o
o o o
o o o o o o o o
o o o o o o
o o
o o o o o o o o o o o o o o
o o o o o o
o o o o o o
o o o o o o o
o o
1880 1900 1920 1940 1960 1980 2000 1880 1900 1920 1940 1960 1980 2000
D. Blei Modeling Science 30 / 53
38. Time-corrected document similarity
• Consider the expected Hellinger distance between the topic
proportions of two documents,
K
dij = E ( θi,k − θj,k )2 | wi , wj
k =1
• Uses the latent structure to define similarity
• Time has been factored out because the topics associated to the
components are different from year to year.
• Similarity based only on topic proportions
D. Blei Modeling Science 31 / 53
40. Time-corrected document similarity
Representation of the Visual Field on the Medial Wall of
Occipital-Parietal Cortex in the Owl Monkey (1976)
D. Blei Modeling Science 33 / 53
42. Quantitative comparison
• Compute the probability of each year’s documents conditional on
all the previous year’s documents,
p(wt | w1 , . . . , wt−1 )
• Compare exchangeable and dynamic topic models
D. Blei Modeling Science 35 / 53
45. The hidden assumptions of the Dirichlet distribution
• The Dirichlet is an exponential family distribution on the simplex,
positive vectors that sum to one.
• However, the near independence of components makes it a poor
choice for modeling topic proportions.
• An article about fossil fuels is more likely to also be about
geology than about genetics.
D. Blei Modeling Science 38 / 53
46. The logistic normal distribution
• The logistic normal is a distribution on the simplex that can
model dependence between components.
• The natural parameters of the multinomial are drawn from a
multivariate Gaussian distribution.
X ∼ NK −1 (µ, )
K −1
θi = exp{xi − log(1 + j=1 exp{xj })}
D. Blei Modeling Science 39 / 53
47. Correlated topic model (CTM)
Σ βk
ηd Zd,n Wd,n K
N
µ D
• Draw topic proportions from a logistic normal, where topic
occurrences can exhibit correlation.
• Use for:
• Providing a “map” of topics and how they are related
• Better prediction via correlated topics
D. Blei Modeling Science 40 / 53
48. neurons
brain stimulus
motor
memory
visual
activated subjects synapses
tyrosine phosphorylation cortical
left ltp
activation
phosphorylation p53 task surface glutamate
kinase cell cycle proteins tip synaptic
activity protein
cyclin binding rna image neurons
regulation domain dna sample materials
computer
domains rna polymerase organic
problem device
receptor cleavage
information
polymer
science amino acids
research
scientists
receptors cdna
site
computers
polymers
funding molecules physicists
support says ligand sequence problems
laser particles
nih research ligands isolated optical physics
program people protein sequence light
apoptosis sequences surface
particle
electrons experiment
genome liquid quantum
wild type dna surfaces stars
mutant enzyme sequencing fluid
mutations enzymes model reaction astronomers
united states mutants
iron
active site
reactions universe
women cells
mutation reduction molecule galaxies
universities
cell molecules
expression magnetic
galaxy
cell lines plants
magnetic field transition state
students bone marrow plant
spin
superconductivity
gene
education genes
superconducting
pressure mantle
arabidopsis
bacteria high pressure crust sun
bacterial pressures upper mantle solar wind
host fossil record core meteorites earth
resistance development birds inner core ratios planets
mice parasite embryos fossils planet
gene dinosaurs
antigen virus drosophila species
disease fossil
t cells hiv genes forest
mutations
antigens aids expression forests
families earthquake co2
immune response infection populations
mutation earthquakes carbon
viruses ecosystems
fault carbon dioxide
ancient images methane
patients genetic found
disease cells population impact
data water
ozone
treatment proteins populations million years ago volcanic atmospheric
drugs differences africa
clinical researchers deposits climate
measurements
variation stratosphere
protein magma ocean
eruption ice concentrations
found volcanism changes
climate change
D. Blei Modeling Science 41 / 53
49. Summary
• Topic models provide useful descriptive statistics for analyzing
and understanding the latent structure of large text collections.
• Probabilistic graphical models are a useful way to express
assumptions about the hidden structure of complicated data.
• Variational methods allow us to perform posterior inference to
automatically infer that structure from large data sets.
• Current research
• Choosing the number of topics
• Continuous time dynamic topic models
• Topic models for prediction
• Inferring the impact of a document
D. Blei Modeling Science 42 / 53
50. “We should seek out unfamiliar summaries of observational material,
and establish their useful properties... And still more novelty can
come from finding, and evading, still deeper lying constraints.”
(John Tukey, The Future of Data Analysis, 1962)
D. Blei Modeling Science 43 / 53
51. Supervised topic models (with Jon McAuliffe)
• Most topic models are unsupervised. They are fit by maximizing
the likelihood of a collection of documents.
• Consider documents paired with response variables.
For example:
• Movie reviews paired with a number of stars
• Web pages paired with a number of “diggs”
• We develop supervised topic models, models of documents and
responses that are fit to find topics predictive of the response.
D. Blei Modeling Science 44 / 53
52. Supervised LDA
α θd Zd,n Wd,n βk K
N
Yd D η, σ 2
1 Draw topic proportions θ | α ∼ Dir(α).
2 For each word
1 Draw topic assignment zn | θ ∼ Mult(θ ).
2 Draw word wn | zn , β1:K ∼ Mult(βzn ).
3 Draw response variable y | z1:N , η, σ 2 ∼ N η z, σ 2 , where
¯
N
z = (1/N)
¯ n=1 zn .
D. Blei Modeling Science 45 / 53
53. Comments
• SLDA is used as follows.
• Fit coefficients and topics from a collection of
document-response pairs.
• Use the fitted model to predict the responses of previously
unseen documents,
E[Y | w1:N , α, β1:K , η, σ 2 ] = η E[Z | w1:N , α, β1:K ].
¯
• The process enforces that the document is generated first,
followed by the response. The response is generated from the
particular topics that were realized in generating the document.
D. Blei Modeling Science 46 / 53
54. Example: Movie reviews
least bad more awful his both
problem guys has featuring their motion
unfortunately watchable than routine character simple
supposed its films dry many perfect
worse not director offered while fascinating
flat one will charlie performance power
dull movie characters paris between complex
● ● ● ●● ● ● ● ●
−30 −20 −10 0 10 20
have not one however
like about from cinematography
you movie there screenplay
was all which performances
just would who pictures
some they much effective
out its what picture
• We fit a 10-topic sLDA model to movie review data (Pang and
Lee, 2005).
• The documents are the words of the reviews.
• The responses are the number of stars associated with
each review (modeled as continuous).
• Each component of coefficient vector η is associated with a topic.
D. Blei Modeling Science 47 / 53
55. Simulations
Movie corpus
0.5
● ● ● ● ●
● ● ●
−6.37
●
● ●
● ●
Per−word held out log likelihood
0.4
−6.38
● ● ●
●
●
●
Predictive R2
●
0.3
−6.39
●
●
●
● ●
● ● ● ● ● ●
● ●
●
−6.40
0.2
●
●
● ●
−6.41
0.1
●
sLDA
LDA
−6.42
●
0.0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Number of topics Number of topics
Digg corpus
−8.0
0.12
−8.1
0.10
Per−word held out log likelihood
●
−8.2
●
0.08
● ●
● ● ●
●
Predictive R2
−8.3
●
●
●
0.06
●
●
−8.4
●
●
0.04
●
●
−8.5
0.02
●
● ●
−8.6
●
●
●
0.00
● ●
●
●
2 4 10 20 30 2 4 10 20 30
Number of topics Number of topics
D. Blei Modeling Science 48 / 53
56. Diversion: Variational inference
• Let x1:N be observations and z1:M be latent variables
• Our goal is to compute the posterior distribution
p(z1:M , x1:N )
p(z1:M | x1:N ) =
p(z1:M , x1:N )dz1:M
• For many interesting distributions, the marginal likelihood of the
observations is difficult to efficiently compute
D. Blei Modeling Science 49 / 53
57. Variational inference
• Use Jensen’s inequality to bound the log prob of the
observations:
log p(x1:N ) ≥ Eqν [log p(z1:M , x1:N )] − Eqν [log qν (z1:M )].
• We have introduced a distribution of the latent variables with free
variational parameters ν.
• We optimize those parameters to tighten this bound.
• This is the same as finding the member of the family qν that is
closest in KL divergence to p(z1:M | x1:N ).
D. Blei Modeling Science 50 / 53
58. Mean-field variational inference
• Complexity of optimization is determined by factorization of qν
• In mean field variational inference qν is fully factored
M
qν (z1:M ) = qνm (zm ).
m=1
• The latent variables are independent.
• Each is governed by its own variational parameter νm .
• In the true posterior they can exhibit dependence
(often, this is what makes exact inference difficult).
D. Blei Modeling Science 51 / 53
59. MFVI and conditional exponential families
• Suppose the distribution of each latent variable conditional on
the observations and other latent variables is in the exponential
family:
p(zm | z−m , x) = hm (zm ) exp{gm (z−m , x)T zm − am (gi (z−m , x))}
• Assume qν is fully factorized and each factor is in the same
exponential family:
qνm (zm ) = hm (zm ) exp{νm zm − am (νm )}
T
D. Blei Modeling Science 52 / 53
60. MFVI and conditional exponential families
• Variational inference is the following coordinate ascent algorithm
νm = Eqν [gm (Z−m , x)]
• Notice the relationship to Gibbs sampling
D. Blei Modeling Science 52 / 53
61. Variational family for the DTM
βk,1 βk,2 βk,T
...
ˆ
βk,1 ˆ
βk,2 ˆ
βk,T
• Distribution of θ and z is fully-factorized (Blei et al., 2003)
• Distribution of {β1,k , . . . , βT ,k } is a variational Kalman filter
• Gaussian state-space model with free observations βk ,t .
ˆ
• Fit observations such that the corresponding posterior over the
chain is close to the true posterior.
D. Blei Modeling Science 53 / 53
62. Variational family for the DTM
βk,1 βk,2 βk,T
...
ˆ
βk,1 ˆ
βk,2 ˆ
βk,T
• Given a document collection, use coordinate ascent on all the
variational parameters until the KL converges.
• Yields a distribution close to the true posterior of interest
• Take expectations w/r/t the simpler variational distribution
D. Blei Modeling Science 53 / 53