Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives
Simultaneous Joint and Conditional
Modeling of
Documents Tagged from Two
Perspectives
Pradipto Das, Rohini Srihari and Yun Fu
SUNY Buffalo
CIKM 2011, Glasgow, Scotland
Ubiquitous Bi-Perspective Document Structure
Words
indicative of
important
Wiki concepts
Actual human
generated
Wiki category
tags – words
that
summarize/
categorize the
document
Wikipedia
Ubiquitous Bi-Perspective Document Structure
Words Actual tags
indicative for the
of forum post
questions – even
frequencies
are given!
Words
indicative
of answers
StackOverflow
Ubiquitous Bi-Perspective Document Structure
Words
indicative
of
document
title
Words
indicative
of image Actual
description tags given
by users
Yahoo! Flickr
Understanding the Two Perspectives
What if the documents
are plain text files?
News Article
Understanding the Two Perspectives
Imagine browsing over reports in a topic cluster
It is believed US investigators have asked
for, but have been so far refused access to,
evidence accumulated by German
prosecutors probing allegations that former
GM director, Mr. Lopez, stole industrial
secrets from the US group and took them
with him when he joined VW last year.
This investigation was launched by US
President Bill Clinton and is in principle a far
more simple or at least more single-minded
pursuit than that of Ms. Holland.
Dorothea Holland, until four months ago
was the only prosecuting lawyer on the
German case.
News Article
Understanding the Two Perspectives
What words can we remember after a first browse?
It is believedUS investigators have asked for,
but have been so far refused access to, evidence
accumulated by German prosecutors
probing allegations that former GM director, Mr. German, US,
Lopez, stole industrial secrets from the US group investigations,
and took them with him when he joined VW last year.
GM, Dorothea
Thisinvestigation was launched by US
President Bill Clinton and is in principle a far more simple
Holland, Lopez,
or at least more single-minded pursuit than that of Ms. prosecute
Holland. The “document level”
Dorothea Holland, until four months ago perspective
was the only prosecuting lawyer on the
News Article German case.
Understanding the Two Perspectives
What helped us generate the Document Level perspective?
The “word level”
perspective
It is believed US investigators have asked
for, but have been so far refused access to,
evidence accumulated by German
Named Entities
prosecutors probing allegations that former German, US,
GM director, Mr. Lopez, stole industrial
LOCATION secrets from the US group and took them
investigations,
MISC with him when he joined VW last year. GM, Dorothea
ORGANIZATION This investigation was launched by US Holland, Lopez,
PERSON President Bill Clinton and is in principle a far
more simple or at least more single-minded
prosecute
Important Verbs pursuit than that of Ms. Holland.
The “document level”
and Dependents Dorothea Holland, until four months ago perspective
WHAT was the only prosecuting lawyer on the
HAPPENED? German case.
News Article
What if we turn the document off?
Summarization power of the perspectives
It is believed US investigators have asked
for, but have been so far refused access to,
evidence accumulated by German
prosecutors probing allegations that former German, US,
GM director, Mr. Lopez, stole industrial
secrets from the US group and took them
investigations,
with him when he joined VW last year. GM, Dorothea
This investigation was launched by US Holland, Lopez,
President Bill Clinton and is in principle a far
more simple or at least more single-minded
prosecute
pursuit than that of Ms. Holland
Dorothea Holland, until four months ago
was the only prosecuting lawyer on the
German case.
Sentence Boundaries
Hypothesis
• Documents are at least tagged from two
different perspectives – either implicit or
explicit and one perspective affects the other
– Simplest example of implicit WL tagging – binned
positions indicating sections
– Simplest example of implicit DL tagging – tag cloud
It is believed US investigators have asked for, but have been so far refused
Begin (0)
access to, evidence accumulated by German prosecutors probing allegations that
former GM director, Mr. Lopez, stole industrial secrets from the US group and
took them with him when he joined VW last year.
This investigation was launched by US President Bill Clinton and is in principle
Midd
le (1)
a far more simple or at least more single-minded pursuit than that of Ms.
Holland.
Dorothea Holland, until four months ago was the only prosecuting lawyer on
End
tagcrowd.com
(2)
the German case.
The “word level” (WL) tags are usually some category descriptions
How can bi-level perspective be exploited?
Can we generate category labels for Wikipedia
documents by looking at image captions?
Can we use images to label latent topics?
Can we build a topic model that incorporates both
perspectives simultaneously?
choice of document level tags, impact on
performance
Can supervised and unsupervised generative
models work together?
Example – A Wikipedia Article on “fog”
0
1
2
Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather
phenomena | Fog | Psychrometrics Labels by human editors
The Wikipedia Article on “fog”
Take the first category label – “weather hazards to aircraft”
“aircraft” doesn’t occur in the document body!
“hazard” only appears in a section label read as “Visibility
hazards”
“Weather” appears only 6 out of 15 times in the main body
However, if we look at the images, it seems that the concept of
fog is related to concepts like fog over the Golden Gate bridge,
fog in streets, poor visibility and quality of air
Categories: fog, San Francisco, visible, high, temperature, streets, Bay, lake, California,
bridge, air Labels by model from title and image captions
Categories: Weather hazards to aircraft | Accidents involving fog | Snow or ice weather
phenomena | Fog | Psychrometrics Labels by human editors
The Family of Tag-Topic Models
• TagLDA: An occurrence of a word depends on
how much of it is explained by a topic K and a
WL tag t
Intuitively
LDA TagLDA
Train
Sample
L L L L L L
S S
LDA’s learnt “purple” topic can TagLDA learns the “purple” topic
generate all 4 large balls with better based on a constraint - it
high probability will generate a mix of large and
small balls with high probability
Faceted Bi-Perspective Document Organization
Topics conditioned on different section identifiers Correspondence
Topics
(WL tag categories) of DL tag words
over
with content
Topic Marginals image
words
captions
Topic Labeling
The Family of Tag-Topic Models
MMLDA METag2LDA TagLDA CorrMMLDA CorrMETag2LDA
Combines Combines
TagLDA and TagLDA and
MMLDA CorrMMLDA
MM = Multinomial + Multinomial; ME = Multinomial + Exponential
The Family of Tag-Topic Models
• METag2LDA: A topic generating all DL tags in a document
doesn’t necessarily mean that the same topic generates all
words in the document
• CorrMETag2LDA: A topic generating *all* DL tags in a
document does mean that the same topic generates all
words in the document - a considerable strongpoint
METag2LDA CorrME-
Topic concentration parameter Tag2LDA
Document specific topic proportions
Indicator variables
Document content words
Document Level (DL) tags
Word Level (WL) tags
Topic Parameters
Tag Parameters
Experiments
Wikipedia articles with images and captions manually
collected along {food, animal, countries, sport, war,
transportation, nature, weapon, universe and ethnic
groups} concepts
Tags used:
DL Tags – image caption words and the article titles
WL Tags – Positions of sections binned into 5 bins
Objective: to generate category labels for test documents
Evaluation
– Perplexity: to see performance among various TagLDA models
– WordNet based similarity evaluation between actual category
labels and model output
Evaluations – Held-out Perplexity
0.8
Millions
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
K=20 K=50 K=100 K=200
MMLDA TagLDA corrLDA METag2LDA corrMETag2LDA
Selected Wikipedia Articles
WL tag categories – Section positions in the document
DL tags – image caption words and article titles
TagLDA perplexity is comparable to MM(METag2)LDA
The (image caption words + article titles) and the content words are
independently discriminative enough
CorrMM(METag2)LDA performs best since almost all image caption words and
the article title for a Wikipedia document are about a specific topic and the
correspondence assumption is accepted by the model with much higher
confidence
Evaluations – Application End-Goals
2
1.8
1.6
1.4 METag2LDA-
AverageDistance
1.2
1 corrMETag2LDA-
AverageDistance
0.8
0.6 METag2LDA-
BestDistance
0.4
0.2 corrMETag2LDA-
BestDistance
0
K=20 K=50 K=100 K=200
Inverse Hop distance in WordNet ontology
Top 5 words from the caption vocabulary are chosen
Max Weighted Average = 5, Max Best = 1
METag2LDA almost always wins by narrow margins
METag2LDA reweights the vocabulary of caption words and article titles that are about a
topic and hence may miss specializations relevant to document within the top (5) ones
In WordNet ontology, specializations lead to more hop distance
Ontology based scoring helps explain connections to caption words to ground truths e.g.
Skateboard skate glide snowboard
Evaluations – Held-out Perplexity
1.65 2
Millions
Millions
1.6
1.5
1.55
1.5 1
1.45
1.4 0.5
1.35
0
40 60 80 100
40 60 80 100
MMLDA METag2LDA corrLDA corrMETag2LDA MMLDA METag2LDA corrLDA
corrMETag2LDA TagLDA
DUC05 Newswire Dataset (Recent Experiments with TagLDA Included)
WL tag categories – Named Entities
DL tags – abstract coherence markers like (“subj” “obj”) e.g. “Mary/Subj taught the
class. Everybody liked Mary/Obj.” *Ignored coref resolution+
Abstract markers like (“subj” “obj”) acting as DL perspective are not document
discriminative markers
Rather they indicate a semantic perspective of coherence which is intricately linked to words
Topics are influenced both by non-sparse document level coherence indicators like (“subj”
“obj”, “subj” “--”, etc.) AND also by document level co-occurrence
By ignoring the DL perspective completely leads to better fit by TagLDA due to variations
in word distributions only
Evaluations – Application End-Goals
4
3.66
3.5
3 3.08
METag2LDA
2.5
CorrMETag2LDA
2
1.88
1.5
1 0.96 0.98 0.91
0.5 0.63
0.35
0
40 60 80 100
Person Named Entity coverage (DUC05 data)
Two PERSON NEs in the same docset i.e., manual topic set are related (G in total)
A_B, A, B are treated as separate PERSON NEs
For each docset in DUC05 data
Create a set of best topics for a docset and pull out top PER NE pairs from the PER NE
facets
Find how many matched over all documents in a docset (M in total)
Win over baseline = M/G (averaged over all docsets)
CorrMETag2LDA wins here because of the nature of DL perspective (Role transitions like
“SubjObj” coherence markers)
More topics are pulled out that group more PER NEs across documents (Recall )
Model Usefulness and Applications
• Applications
– Document classification using reduced dimensions
– Find faceted topics automatically through word level tags
– Learn correspondences between perspectives
– Label topics through document level multimedia
– Create recommendations based on perspectives
– Video analysis: word prediction given video features
– Tying “multilingual comparable corpora” through topics
– Multi-document summarization using coherence
– E-Textbook aided discussion forum mining:
• Explore topics through the lens of students and teachers
• Label topics from posts through concepts in the e-textbook
Summary
• Flexible family of topic models that integrate a
partitioned space of DL tags and words with WL tag
categories
– Supervised models can collaborate with unsupervised
generative models i.e. supervised models can be bettered
independently
• Captioned multimedia objects like images, video, audio
can provide intuitive latent space labeling – a picture is
worth a 1000 words
• Obtain “facets” in topics
• As always held-out perplexity should not always be the
sole judge of end-task performance
Thanks!
Special thanks to Jordan Boyd-Graber for useful
discussions on TagLDA parameter regularizations
Editor's Notes
Hyperlinked text in body represent word level tagsCategories represent document level tags
Word level tags: question/answerDoc level tags: actual tags for the forum post
Word level tags: title, image descriptionDoc level tags: tags given by users
Document about investigationsWe don’t have annotations but let’s see how that can be built up!
Words to the right are relevant to the topic of the document set – mostly by frequency
Since documents are mostly about some events; Certain words strike us – NEs mentioned frequently and across sentencesDependencies between subjects and objects of the important verbs from the document set.
The word and doc level tagged words alone are sufficient to summarize the document as bags of words
I don’t think we need this slide. I should explain these points while showing the previous slide!
Cons:Collocations need to be addressedChains don’t involve causality e.g. (fogs & accidents, [hop length = 12])
Within the family of (corr)MM(E)(Tag2)LDAs modeling joint observations, corrMETag2LDA performs best