Diversified Social Media Retrieval for News Stories

Diversiﬁed Social Media Retrieval for News
Stories.
Bryan Hang ZHANG
Feb. 25th 2016
Master Thesis Colloquium
Department of Computational Linguistics
Dr. Vinay SETTY
Prof. Dr. Günter NEUMANN
Supervisors:

Outline
• Motivation
• Related Work
• Solution
• Experiment Evaluation
• Conclusion
• Acknowledgement

Motivation
• Social media data is generated by users constantly.
• Twitter
• Blogs
• Forums (Quora, WEBMD….)
• Comments (Reddit, Instagram, YouTube….)

Motivation
query: news story
Thread
Thread
Rank: 2nd
Rank: 3rd
ThreadRank: 1st
Rank: K th Thread
When using a news story summary (from Wiki news) to retrieve relevant
information from Reddit comments data,
• Threads (relevant news summary)
•
are retrieved

Motivation
query: news story
Thread
Thread
Rank: 2nd
Rank: 3rd
ThreadRank: 1st
Rank: K th Thread
• linked comments ( by users)
are retrieved

Motivation
query: news story
Thread
Thread
Rank: 2nd
Rank: 3rd
ThreadRank: 1st
Rank: K th Thread
• linked comments ( by users)
are retrieved
Tree

Motivation
Tree-Structured Comments

Motivation
query: news story
Cuba Wants Off U.S. Terrorism List Before Restoring Normal Ties
Most Americans Support Renewed U.S.-Cuba Relations
Obama announces historic overhaul of relations; Cuba releases American
Raul Castro: US Must Return Guantanamo for Normal Relations
December 17 2014 – U.S. President Barack
Obama announces the resumption of normal
relations between the U.S. and Cuba.
2691 comments linked to the top 10 threads
(Okapi BM-25 ranking)

Motivation
query: news story
Thread
Thread
Rank: 2nd
Rank: 3rd
ThreadRank: 1st
Rank: K th Thread
December 17 2014 – U.S. President Barack
Obama announces the resumption of normal
relations between the U.S. and Cuba.
2691 comments linked to the top 10 threads
(Okapi BM-25 ranking)
Tree

Motivation
News story
pseudo search
result
(thread+ linked
comment)
diversiﬁed search
result
(concise, diverse
result list)
Data: Reddit data
Subreddit(category): Politics / World News
• The goal is to reduce the Redundancy in the pseudo
search result from Reddit comments for news
stories and create a concise and diversified search
result.

Related Work
• Research focusing on the reflection of ambiguity of a query in
the retrieved results and reduce redundancy:
Implicit diversification methods: reduce redundancy based on
documents content dissimilarity
• Maximum Marginal Relevance [4]
• BIR[6]
Explicit diversification methods: explicitly models the aspects
(topics, categories) of a query and consider which query aspects
individual documents relate to.
•IA-Diversity[1] (user intention)
•xQuad[2] (query reformation)
•PM[3,5] (proportional representation covering the query
aspects)

Related Work
• Research focusing on summarizing social media data due to the
large volume :
• Continuous summarization of evolving tweet streams. L. Shou, Z. Wang,
K. Chen, and G. Chen. SumblrIn SIGIR, 2013.
• Hierarchical multi-label classiﬁcation of social text streams. Z. Ren, M.-
H. Peetz, S. Liang, W. van Dolen, and In SIGIR, 2014.
• Summarizing web forum threads based on a latent topic propagation
process. Z. Ren, J. Ma, S. Wang, and Y. Liu. In CIKM, 2011
• Topic sentiment mixture: modeling facets and opinions in weblogs. Q.
Mei, X. Ling, M. Wondra, H. Su, and C. Zhai. In WWW, 2007.
• Entity-centric topic-oriented opinion summarization in twitter. X. Meng, F.
Wei, X. Liu, M. Zhou, S. Li, and H. Wang. In KDD, 2012.

Related Work
There is no retrieval diversiﬁcation work that
has been done on the unedited, coherent,
short, Tree-Structured comments

Solution Overview
comments
retrieval
text pre-
processing
topic
modelling
diversiﬁcation

Solution Overview
elastic
search
system
text pre-
processing
topic
modelling
diversiﬁcation
• Top k threads
Text-based scoring using
Okapi BM-25
(linked comments)

Solution Overview
elastic
search
system
text processing
topic
modelling
diversiﬁcation
• Top k threads • Text Normalisation
• Named Entity Tagging
• Text Representation (for better topical clustering)
• Sentiment Tagging
(linked comments)

Text Pre-processing
• Remove urls (using Twokenizer) and non-alphanumeric symbols, sentence
tokenisation (NLTK sentence tokenizer)
• Sentiment Analysis VADER (rule-besed sentiment tagger)
• Part-of-Speech/ Named Entity Tagging Senna Tagger (Neural Network architecture - based tagger)
1. Duplicate Named Entities because there are more entity-based
topic type in social media.
2. Select words according to the Penn Treebank part-of-speech tags.
(according to Centering Theory)
3. Lemmatise selected words (NLTK Lemmatiser)

Text Pre-processing
“The original article i read a couple months ago was in Der Spiegel and said nothing of a new or
alternative party, although its possible i forgot.”
['DER_SPIEGEL', 'DER_SPIEGEL', 'original', 'article', 'read', 'couple', 'month',
'ago', 'der', 'spiegel', 'said', 'nothing', 'new', 'alternative', 'party', 'possible',
'forgot', 'here', 'related', 'article']
"The *Titanic* has hit an iceberg - and takes on more passengers …P.S. Yeah, keep those downvotes
coming: they won't change reality, e.g. the unemployment ﬁgures in the Eurozone."
[TITANIC’, 'TITANIC', 'EUROZONE', 'EUROZONE', 'titanic', 'ha', 'hit', 'iceberg',
'take', 'on', 'more', 'passenger', 'keep', 'downvotes', 'coming', 'won', 'change',
'reality', 'ﬁgure', ‘eurozone']
Named Entity Named Entity word word word

Solution Overview
elastic
search
system
text processing
topic
modelling
diversiﬁcation
• Top k graphs • Topic Tagging
• Topic Extraction
• Text Normalisation
• Text Representation
(thread+comments)

Clustering
• There are many clustering and topical modelling techniques:
k-means, hierarchical clustering, frequent set clustering, LDA, pLSA.
• Challenges for modelling topics for reddit comments.
• Comments are short. k-means, hierarchical clustering, LDA
• Unpredicted number of topics. LDA, pLSA
• Topical clusters interpretation. LDA
• Ungrammatical sentences and sentence fragments.
Relations from Collapsed Typed Dependencies cannot be
accurately extracted.
Topic Modelling

Clustering
the sum of the total probability over all mixture components:
P(d) =
KX
k=1
P(d|z = k)P(z = k) (6)
K is the number of mixture components (clusters). It[41] has the assumptions:
• The words in a document are generated independently when the document”s cluster
label k is known
• The probability of a word is independent of its position within the document.
It [41] assumes that each mixture component(cluster) is a multinomial distribution over
words and a Dirichlet distribution is also assumed as the prior for each mixture component
(cluster):
VX
1.select a mixture component(cluster) k
2. The selected mixture component(cluster) k generates d
⃗z cluster labels of each document
I number of iterations
mz number of documents in cluster z
nz number of words in cluster z
nw
z number of occurrences of word w in cluster z
Nd number of words in document d
Nw
d number of occurrences of word w in document d
Table 1: Notations
probability over all mixture components:
p(d) =
K
k=1
p(d|z = k)p(z = k) (1)
Here, K is the number of mixture components (clusters).
Now, the problem becomes how to deﬁne p(d|z = k) and
p(z = k). DMM makes the Naive Bayes assumption: that
the words in a document are generated independently when
the document’s cluster label k is known, and the probability
of a word is independent of its position within the document.
Then the probability of document d generated by cluster k
can be derived as follows:
p(d|z = k) =
w∈d
p(w|z = k) (2)
Nigam et al. [20] assumes that each mixture component
(cluster) is a multinomial distribution over words, such that
p(w|z = k) = p(w|z = k, Φ) = φk,w, where w = 1, ..., V
and w φk,w = 1. They assume a Dirichlet distribution as
the prior for each mixture component (cluster), such that
p(Φ|⃗β) = Dir(⃗φk|⃗β). They also assume that the weight of
empty, in other words, GSDMM can
into several groups. Through experim
4.5, we found that the number of non
by GSDMM can be near the true num
as K is larger than the true number.
clustering model like Gaussian Mixtu
since we can get the probability of eac
to each cluster from p(zd = z|⃗z¬d, ⃗d).
Algorithm 1: GSDMM
Data: Documents in the input, ⃗d.
Result: Cluster labels of each doc
begin
initialize mz, nz, and nw
z as zero
for each document d ∈ [1, D] do
sample a cluster for d:
zd ← z ∼ Multinomial(1/K
mz ← mz + 1 and nz ← nz
for each word w ∈ d do
nw
z ← nw
z + Nw
d
for i ∈ [1, I] do
for each document d ∈ [1, D
record the current cluste
mz ← mz − 1 and nz ←
nw
z ← nw
z − Nw
d
zd ← z ∼ p(zd = z|⃗z¬d, ⃗d
mz ← mz + 1 and nz ←
P(d) =
KX
k=1
P(d|z = k)P(z = k) (6)
label k is known
(cluster):
P(w|z = k) = P(w|z = k, ) = k,w where
VX
w
w,k = 1 and P( |~) = Dir(~✓|~)
They also assume that the weight of each mixture component (cluster) is sampled from
a multinomial distribution and a Dirichlet prior for this multinomial distribution is also
VX
w
w,k = 1 and P( |~) = Dir(~✓|~)
assumed:
P(z = k) = P(z = k|⇥) = ✓k where
KX
k
✓k = 1 and P(⇥|~↵) = Dir(~✓|~↵)
collapsed Gibbs Samplings for GSDMM is introduced in [59], documents are randomly
assigned to K clusters initially and the following information is recorded: the cluster
labels of each document ~z, mz is number of documents in each cluster z , and nw
z is the
number of occurrences of word w in each cluster z, then documents are traversed for a
number of iterations. In each iteration, each document is reassigned to a cluster according
to the conditional distribution of P(Zd = z|~z¬d)the cluster z given the document ~d and
cluster ~z¬d:
P(Zd = z|~z¬d) /
mz,¬d + ↵
D 1 + K↵
Q
w2d
Nw
dQ
j=1
(nw
z,¬d + + j 1)
NdQ
i=1
(nz,¬d + V + i 1)
Dirichlet Multinomial Mixture Model (DMM)
Topic ModelingTopic Modelling
d
Abstract
Your abstract.
1 Introduction
↵
⇥
2 Some LATEX Examples
Abstract
Your abstract.
1 Introduction
↵
⇥
2 Some LATEX Examples
2.1 How to Leave Comments
1 I
↵
⇥
2 S
2.1
Comm
Your ab
1 Introdu
↵
⇥
2 Some L
2.1 How to
Comments can b
mand, as shown
Your abstract.
1 Introduction
z
d
D
K
↵
⇥
Your ab
1 Introdu
z
d
D
K
↵
⇥
2 Some L
Your abstract.
1 Introduction
z
d
D
K
↵
⇥

Clustering
Dirichlet Multinomial Mixture Model (DMM)
Figure 1: Graphical model of DM
V number of words in the vocabulary
D number of documents in the corpus
¯L average length of documents
⃗d documents in the corpus
m number of documents in cluster z
P(d) =
KX
k=1
P(d|z = k)P(z = k) (6)
label k is known
(cluster):
VX
1. Select a mixture component(cluster) k
2. The selected mixture component(cluster) k generates d
nw
z number of occurrences of word w in cluster z
Nw
d number of occurrences of word w in document d
Table 1: Notations
probability over all mixture components:
p(d) =
K
k=1
p(d|z = k)p(z = k) (1)
Here, K is the number of mixture components (clusters).
Now, the problem becomes how to deﬁne p(d|z = k) and
p(z = k). DMM makes the Naive Bayes assumption: that
the words in a document are generated independently when
the document’s cluster label k is known, and the probability
of a word is independent of its position within the document.
Then the probability of document d generated by cluster k
can be derived as follows:
p(d|z = k) =
w∈d
p(w|z = k) (2)
Nigam et al. [20] assumes that each mixture component
(cluster) is a multinomial distribution over words, such that
p(w|z = k) = p(w|z = k, Φ) = φk,w, where w = 1, ..., V
and w φk,w = 1. They assume a Dirichlet distribution as
the prior for each mixture component (cluster), such that
p(Φ|⃗β) = Dir(⃗φk|⃗β). They also assume that the weight of
empty, in other words, GSDMM can
into several groups. Through experim
4.5, we found that the number of non
by GSDMM can be near the true num
as K is larger than the true number.
clustering model like Gaussian Mixtu
since we can get the probability of eac
to each cluster from p(zd = z|⃗z¬d, ⃗d).
Algorithm 1: GSDMM
Data: Documents in the input, ⃗d.
Result: Cluster labels of each doc
begin
initialize mz, nz, and nw
z as zero
for each document d ∈ [1, D] do
zd ← z ∼ Multinomial(1/K
mz ← mz + 1 and nz ← nz
nw
z ← nw
z + Nw
d
for i ∈ [1, I] do
for each document d ∈ [1, D
record the current cluste
mz ← mz − 1 and nz ←
nw
z ← nw
z − Nw
d
zd ← z ∼ p(zd = z|⃗z¬d, ⃗d
mz ← mz + 1 and nz ←
P(d) =
KX
k=1
P(d|z = k)P(z = k) (6)
label k is known
(cluster):
VX
w
w,k = 1 and P( |~) = Dir(~✓|~)
VX
w
w,k = 1 and P( |~) = Dir(~✓|~)
assumed:
P(z = k) = P(z = k|⇥) = ✓k where
KX
k
✓k = 1 and P(⇥|~↵) = Dir(~✓|~↵)
collapsed Gibbs Samplings for GSDMM is introduced in [59], documents are randomly
assigned to K clusters initially and the following information is recorded: the cluster
labels of each document ~z, mz is number of documents in each cluster z , and nw
z is the
number of occurrences of word w in each cluster z, then documents are traversed for a
number of iterations. In each iteration, each document is reassigned to a cluster according
to the conditional distribution of P(Zd = z|~z¬d)the cluster z given the document ~d and
cluster ~z¬d:
P(Zd = z|~z¬d) /
mz,¬d + ↵
D 1 + K↵
Q
w2d
Nw
dQ
j=1
(nw
z,¬d + + j 1)
NdQ
i=1
(nz,¬d + V + i 1)
Topic ModelingTopic Modelling

Topic Modeling
A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering
JianhuaYin, Tsinghua University, Beijing, China
Gibbs Sampling for Dirichlet Multinomial Mixture Model (DMM)
• They introduced the collapsed Gibbs Sampling Algorithm for DMM
nw
z number of occurrences of word w in c
Nw
d number of occurrences of word w in d
Table 1: Notations
• In each iteration sample a cluster to the document according
to:
er. As
sult in
in the
e same
ument
). We
)
(4)
n doc-
ation 4
tion 3,
tion 3.
f their
uation
odel in
itional
orithm
)
(5)
p(zd = z|⃗z¬d, ⃗d) ∝
p(⃗d, ⃗z|⃗α, ⃗β)
p(⃗d¬d, ⃗z¬d|⃗α, ⃗β)
∝
∆(⃗m + ⃗α)
∆(⃗m¬d + ⃗α)
∆(⃗nz + ⃗β)
∆(⃗nz,¬d + ⃗β)
∝
Γ(mz + α)
Γ(mz,¬d + α)
Γ(D − 1 + Kα)
Γ(D + Kα)
w∈d Γ(nw
z + β)
w∈d Γ(nw
z,¬d + β)
Γ(nz,¬d + V β)
Γ(nz + V β)
(6)
where mz = mz,¬d + 1 and nz = nz,¬d + Nd. Because Γ
function has the following property: Γ(x+m)
Γ(x)
= m
i=1(x + i−
1). We can rewrite Equation 6 into the following form:
p(zd = z|⃗z¬d, ⃗d)
∝
mz,¬d + α
D − 1 + Kα
w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
Nd
i=1(nz,¬d + V β + i − 1)
(7)
When we assume each word can at most appear once in
each document (In the movie group example, the assumption
is a movie can at most appear once in each student’s list).
We can get w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d(nw
z,¬d + β) since nw
z =
nw
z,¬d + 1 holds, and Equation 7 turns out to be Equation 3.
When we allow a word to appear multi-times in each doc-
ument (A movie can appear multi-times in each student’s
list).We can get w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d
Nw
d
j=1(nw
z,¬d + β +
j − 1) since nw
z = nw
z,¬d + Nw
d holds, and Equation 7 turns
out to be Equation 4.
Topic Modelling

Topic Modeling
nw
Nw
Table 1: Notations
to:
er. As
sult in
in the
e same
ument
). We
)
(4)
n doc-
ation 4
tion 3,
tion 3.
f their
uation
odel in
itional
orithm
)
(5)
p(zd = z|⃗z¬d, ⃗d) ∝
p(⃗d, ⃗z|⃗α, ⃗β)
p(⃗d¬d, ⃗z¬d|⃗α, ⃗β)
∝
∆(⃗m + ⃗α)
∆(⃗m¬d + ⃗α)
∆(⃗nz + ⃗β)
∆(⃗nz,¬d + ⃗β)
∝
Γ(mz + α)
Γ(mz,¬d + α)
Γ(D − 1 + Kα)
Γ(D + Kα)
w∈d Γ(nw
z + β)
w∈d Γ(nw
z,¬d + β)
Γ(nz,¬d + V β)
Γ(nz + V β)
(6)
Γ(x)
= m
i=1(x + i−
p(zd = z|⃗z¬d, ⃗d)
∝
mz,¬d + α
D − 1 + Kα
w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
Nd
i=1(nz,¬d + V β + i − 1)
(7)
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d(nw
z =
nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d
Nw
d
j=1(nw
z,¬d + β +
j − 1) since nw
z = nw
z,¬d + Nw
cluster z without document d
Topic Modelling

Topic Modeling
nw
Nw
Table 1: Notations
to:
er. As
sult in
in the
e same
ument
). We
)
(4)
n doc-
ation 4
tion 3,
tion 3.
f their
uation
odel in
itional
orithm
)
(5)
p(zd = z|⃗z¬d, ⃗d) ∝
p(⃗d, ⃗z|⃗α, ⃗β)
p(⃗d¬d, ⃗z¬d|⃗α, ⃗β)
∝
∆(⃗m + ⃗α)
∆(⃗m¬d + ⃗α)
∆(⃗nz + ⃗β)
∆(⃗nz,¬d + ⃗β)
∝
Γ(mz + α)
Γ(mz,¬d + α)
Γ(D − 1 + Kα)
Γ(D + Kα)
w∈d Γ(nw
z + β)
w∈d Γ(nw
z,¬d + β)
Γ(nz,¬d + V β)
Γ(nz + V β)
(6)
Γ(x)
= m
i=1(x + i−
p(zd = z|⃗z¬d, ⃗d)
∝
mz,¬d + α
D − 1 + Kα
w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
Nd
i=1(nz,¬d + V β + i − 1)
(7)
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d(nw
z =
nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d
Nw
d
j=1(nw
z,¬d + β +
j − 1) since nw
z = nw
z,¬d + Nw
cluster z without document d
p(zd = z|⃗z¬d, ⃗d) ∝
mz,¬d + α
D − 1 + Kα
w∈d
Nw
d
j=1(nw
z,¬d + β + j − 1)
Nd
i=1(nz,¬d + V β + i − 1)
(4)
where Nw
d is the number of occurrences of word w in doc-
ument d. We should note that the two parts of Equation 4
have similar relationship with MGP like that of Equation 3,
and the complexity of Equation 4 is the same as Equation 3.
The only diﬀerence between them is the numerator of their
second part. We will try to derive Equation 3 and Equation
4 from the Dirichlet Multinomial Mixture (DMM) model in
the next section.
2.4 Derivation of GSDMM
In this section, we try to formally derive the conditional
distribution p(zd = z|⃗z¬d, ⃗d) used in our GSDMM algorithm
as follows.
p(zd = z|⃗z¬d, ⃗d) =
p(⃗d, ⃗z|⃗α, ⃗β)
p(⃗d, ⃗z¬d|⃗α, ⃗β)
∝
p(⃗d, ⃗z|⃗α, ⃗β)
p(⃗d¬d, ⃗z¬d|⃗α, ⃗β)
(5)
where ¬d means document d is excluded from ⃗z and ⃗d. Now
we need to derive the full distribution p(⃗d, ⃗z|⃗α, ⃗β). From the
graphical model of DMM in Figure 1, we can see p(⃗d, ⃗z|⃗α, ⃗β) =
p(⃗d|⃗z, ⃗β)p(⃗z|⃗α). Then we need to derive p(⃗d|⃗z, ⃗β) and p(⃗z|⃗α).
Let us ﬁrst investigate how to obtain p(⃗z|⃗α). We can see
that p(⃗z|⃗α) can be obtained by integrating with respect to
Θ as p(⃗z|⃗α) = p(⃗z|Θ)p(Θ|⃗α)dΘ. As mentioned in Sec-
tion 2.2, p(Θ|⃗α) is a Dirichlet distribution and p(⃗z|Θ) is
w∈d Γ(n
where mz = mz,¬d +
function has the follow
1). We can rewrite Eq
p(zd = z|⃗z¬d, ⃗d
∝
mz,¬d + α
D − 1 + K
When we assume ea
each document (In the
is a movie can at most
z
w∈d Γ(nw
z,¬
nw
z,¬d + 1 holds, and Eq
When we allow a wo
ument (A movie can a
list).We can get w∈d
w∈d Γ
j − 1) since nw
z = nw
z,¬
3. DISCUSSION
3.1 Meaning of A
In this part, we try to
the help of the Movie
in Section 2.1. From E
to the prior probability
Topic Modelling

Topic Modeling
Collapsed Gibbs Sampling Algorithm [9]
nw
Nw
Table 1: Notations
• In each iteration, sample a cluster to the document
according to until the clusters are stable.
Then the conditional distribution in Equation 5 can be de-
rived as follows:
p(zd = z|⃗z¬d, ⃗d) ∝
p(⃗d, ⃗z|⃗α, ⃗β)
p(⃗d¬d, ⃗z¬d|⃗α, ⃗β)
∝
∆(⃗m + ⃗α)
∆(⃗m¬d + ⃗α)
∆(⃗nz + ⃗β)
∆(⃗nz,¬d + ⃗β)
∝
Γ(mz + α)
Γ(mz,¬d + α)
Γ(D − 1 + Kα)
Γ(D + Kα)
w∈d Γ(nw
z + β)
w∈d Γ(nw
z,¬d + β)
Γ(nz,¬d + V β)
Γ(nz + V β)
(6)
Γ(x)
= m
i=1(x +i−
p(zd = z|⃗z¬d, ⃗d)
∝
mz,¬d + α
D − 1 + Kα
w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
Nd
i=1(nz,¬d + V β + i − 1)
(7)
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d(nw
z =
nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d
Nw
d
j=1(nw
z,¬d + β +
j − 1) since nw
z = nw
z,¬d + Nw
udents in the
e in the same
a document
t’s list). We
s:
+ j − 1)
− 1)
(4)
rd w in doc-
f Equation 4
f Equation 3,
s Equation 3.
rator of their
and Equation
M) model in
e conditional
M algorithm
⃗α, ⃗β)
¬d|⃗α, ⃗β)
(5)
p(d¬d, ⃗z¬d|⃗α, β)
∝
∆(⃗m + ⃗α)
∆(⃗m¬d + ⃗α)
∆(⃗nz + ⃗β)
∆(⃗nz,¬d + ⃗β)
∝
Γ(mz + α)
Γ(mz,¬d + α)
Γ(D − 1 + Kα)
Γ(D + Kα)
w∈d Γ(nw
z + β)
w∈d Γ(nw
z,¬d + β)
Γ(nz,¬d + V β)
Γ(nz + V β)
(6)
Γ(x)
= m
i=1(x + i−
p(zd = z|⃗z¬d, ⃗d)
∝
mz,¬d + α
D − 1 + Kα
w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
Nd
i=1(nz,¬d + V β + i − 1)
(7)
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d(nw
z =
nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d
Nw
d
j=1(nw
z,¬d + β +
j − 1) since nw
z = nw
z,¬d + Nw
nw
z,¬d and nz,¬d are the number of occurrences of movie w
in table z and the total number of movies in table z with-
out considering student d, respectively. When table z has
more students sharing similar interests with student d (i.e.,
watched more movies of the same), movies of student d will
appear more often in table z (with larger nw
z,¬d), and the
probability of student d choosing table z will be larger. As
a result, the second part of Equation 3 tends to result in
large homogeneity, because it can leads the students in the
same table to be more similar (more likely to be in the same
ground true group).
If we allow a word to appear multi-times in a document
(A movie can appear multi-times in a student’s list). We
can derive the conditional probability as follows:
p(zd = z|⃗z¬d, ⃗d) ∝
mz,¬d + α
D − 1 + Kα
w∈d
Nw
d
j=1(nw
z,¬d + β + j − 1)
Nd
i=1(nz,¬d + V β + i − 1)
(4)
where Nw
d is the number of occurrences of word w in doc-
ument d. We should note that the two parts of Equation 4
have similar relationship with MGP like that of Equation 3,
and the complexity of Equation 4 is the same as Equation 3.
The only diﬀerence between them is the numerator of their
second part. We will try to derive Equation 3 and Equation
4 from the Dirichlet Multinomial Mixture (DMM) model in
the next section.
2.4 Derivation of GSDMM
In this section, we try to formally derive the conditional
distribution p(zd = z|⃗z¬d, ⃗d) used in our GSDMM algorithm
as follows.
p(⃗d, ⃗z|⃗α, ⃗β) =
Then the conditional di
rived as follows:
p(zd = z|⃗z¬
∝
∆(⃗m +
∆(⃗m¬d +
∝
Γ(mz +
Γ(mz,¬d
w∈d Γ(n
w∈d Γ(nw
z
where mz = mz,¬d + 1
function has the followin
1). We can rewrite Equ
p(zd = z|⃗z¬d, ⃗d)
∝
mz,¬d + α
D − 1 + Kα
When we assume eac
each document (In the m
is a movie can at most
z +
w∈d Γ(nw
z,¬d
nw
z,¬d + 1 holds, and Equ
When we allow a word
ument (A movie can ap
alpha relates to the cluster popularity beta relates to the similar interest
Topic Modelling
e students in the
to be in the same
es in a document
udent’s list). We
ollows:
+ β + j − 1)
+ i − 1)
(4)
f word w in doc-
rts of Equation 4
at of Equation 3,
me as Equation 3.
umerator of their
n 3 and Equation
(DMM) model in
e the conditional
SDMM algorithm
⃗d, ⃗z|⃗α, ⃗β)
d, ⃗z¬d|⃗α, ⃗β)
(5)
om ⃗z and ⃗d. Now
⃗z|⃗α, ⃗β). From the
n see p(⃗d, ⃗z|⃗α, ⃗β) =
|⃗z, ⃗β) and p(⃗z|⃗α).
∝
∆(⃗m + ⃗α)
∆(⃗m¬d + ⃗α)
∆(⃗nz + ⃗β)
∆(⃗nz,¬d + ⃗β)
∝
Γ(mz + α)
Γ(mz,¬d + α)
Γ(D − 1 + Kα)
Γ(D + Kα)
w∈d Γ(nw
z + β)
w∈d Γ(nw
z,¬d + β)
Γ(nz,¬d + V β)
Γ(nz + V β)
(6)
Γ(x)
= m
i=1(x + i−
p(zd = z|⃗z¬d, ⃗d)
∝
mz,¬d + α
D − 1 + Kα
w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
Nd
i=1(nz,¬d + V β + i − 1)
(7)
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d(nw
z =
nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d
Nw
d
j=1(nw
z,¬d + β +
j − 1) since nw
z = nw
z,¬d + Nw
3. DISCUSSION
3.1 Meaning of Alpha and Beta
set at 0.1 for our task set at 0.2 for our task

Topic Modeling
Dirichlet Multinomial Mixture Model (DMM) and collapsed Gibbs Sampling algorithm
• The number of clusters is inferred automatically
• Balance the completeness and homogeneity of the clusters
• Converge fast
• Cope with the sparse and high- dimensional problem of short texts
• Representative words of each cluster (similar to PLSA and LDA ) are the most frequent
words in each cluster.
We get excellent results, majority of the topic clusters can be easily interpreted using the
representative words.
For the thread Germany won the world cup in Brazil, some comments are in Portuguese
and German. Our topic modelling approach can cluster the comments according to the
language, so we get a cluster of comments in German and a cluster in Portuguese.
Topic Modelling

Topic Modeling
Rank:
Rank:
Rank: K
Thread
Thread
ThreadRank:
Thread
Extract 27 topical Clusters
Topic 1
Topic 2
Topic 3
Topic k
Query: December 17 2014 – U.S. President Barack Obama
announces the resumption of normal relations between the
U.S. and Cuba.
2691 comments from
the top 10 threads
Topic Modelling
Example:

Topic Modeling
• Obama announces historic overhaul of relations; Cuba releases American
• Cuba Wants Off U.S. Terrorism List Before Restoring Normal Ties
• Most Americans Support Renewed U.S.-Cuba Relations
• Raul Castro: US Must Return Guantanamo for Normal Relations
• Russian foreign minister praises new U.S.-Cuba relations
• U.S. Approves Ferry Service Between Florida and Cuba
• US, Cuba restore full diplomatic relations after 54 years
• President Barack Obama announced Wednesday that the U.S. and Cuba
will reopen their embassies in Havana and Washington, heralding a "new
chapter" in relations after a half-century of hostility.
• Raul Castro: U.S. must return Guantanamo for normal relations
• U.S. Takes Cuba off Terror List, Paving the Way for Normal Ties
Although the Top 10 threads are all about Cuba and the US
Topic Modelling

Clustering
27 topical clusters extracted from 2691 comments for the query:
Number of Comments Topic Words ( Top 10 most frequent words)
8 18 war libya utf partagas haiti 69i god somalia isil pakistan
12 7 cuban statistic government cuba un mean have independent number ha
11 21 mexico cuba gulf gitmo america gtmo navy panamacanal small control
10 22 ftfy nixon cheney nato un germany lincoln still facebook republican
13 57 tropico isi terrorist just know order have people cia drone
38 218 cuba america cia germany turkey have soviet japan castro war
15 101 russia ukraine america cuba russian crimea have american eu state
14 240 cuban cigar cuba have tobacco people nicaragua so just dominican
17 10 southafrica angola cuba south africa mozambique death get un leonardcohen
18 155 cuba america canada country american mexico list china ha saudi
30 6 nonmobile pollo jadehelm15 feedback mobile please counter bot non link
37 530 have just re people think american thing up that so
32 416 cuba cuban guantanamo american government have us castro lease country
Topic Index
More relevant topics in their linked comments are discovered.
Topic Modelling

Clustering
Number of Comments Topic Words ( Top 10 most frequent words from left to right)
25 94 castro usa cuba ﬂorida soviet cuban nuke don venezuela wanted
26 1 nigelthornberry
27 79 gitmo guantanamo iraq wto iran naval base obama american china
46 729 cuba cuban american have people obama relation country castro
45 8 abbott texas voter alec id voteridlaw name paulweyrich heritageinstitute co
42 43 lincoln washington roosevelt congress had mandela term unitedstate newburgh
41 3 michigan ohio toledo nwo upper had state won peninsula bet
1 165 obama congress republican democrat have clinton bernie that bush iowa
0 5 republican cost want higher dc job aca highly tax people
3 23 texas woman mexico healthcare republican mmr ha have rate mean
5 21 unsc cuba us un padron iraq cigar uncharter ha charter
7 32 cuba us cuban spanish latinamerica law treaty american spanishempire america
6 30 turkey armenian turk have armenia israel havana just people up
9 4 erdogan ataturk turk hitler mhp chp kurd turkey kurdish election
Topic Index
Topic Modelling

Solution Overview
elastic
search
system
text processing
topic
modelling
diversiﬁcation
• Top k threads • Topic Tagging
• Topic Extraction
• Text Normalisation
• Text Representation
• Sainte Laguë Method
• Comments Tree
Decomposition

ThreadRank ThreadRank
label the comments from each thread then each thread has n topic clusters of comments
Diversiﬁcation -Sainte Laguë Method
topic 1
topic 2
topic 3
po

ThreadRank ThreadRank
label every comment a sentiment label or an emotional label (if emotion modelling works)
Positive
Neutral
Negative

ThreadRank
po
ThreadRank
topic 1
topic 2
topic 3

po
Thread

Sainte Laguë Method
trieving the representative comments proportionally from Lrj. SL method [38] is a high-
est quotient method for allocating seats in party-list proportional representation used in
many voting systems. After all the comments have been tallied, successive quotients are
computed using equation 11 for each cluster. where, V is the total number of comments
in each cluster; S is the number of ’seats’ that cluster has been allocated so far, initially
0 for all clusters.
quotient =
V
2S + 1
(11)
Whichever cluster has the highest quotient gets the next ’seat’ allocated, and their quo-
tient is recalculated given their new ’seat’ total. The process is repeated until all ’seats’
have been allocated. The number of the ’seats’ is a hyper-parameter, it can be set accord-
ing to users’ interests. We use table 2 as an example illustrates how the process works:
There are three clusters. five comments are expected to be retrieved (number of ’seats’
is five). The denominators in the first row are calculated as 2S + 1 where S = 0, 1, 2...
respectively and the quotients in the column 2 to column 4 are the quotients calculated
32
The Sainte Laguë Method is a highest quotient method for allocating seats
in party-list proportional representation used in many voting systems.
After all the comments have been tallied, successive quotients are computed
for each cluster.
where,
• V is the total number of comments in each cluster;
• S is the number of ’seats’ that cluster has been allocated so far, initially 0
for all clusters

po
Thread
Denominator /1 /3 /5 Seats(*)
topic A positive 50* 16.67* 10 2
topic A neutral 40* 13.33* 8 2
topic A negative 30* 10 6 1
Table 2. Sainte-Laguë method example
we retrieve n comments proportionally from the diverse clusters to form the
esult which is concise and diverse as Figure 10.
Example:
5 comments are expected to be retrieved (number of ‘seats’ is 5). The
denominators in the first row are calculated as 2S + 1 where S = 0, 1, 2...
respectively and the quotients in the column 2 to column 4 are the quotients
calculated. The quotients marked with “*” represent the allocated ‘seats’.
So for this example,
2 comments are from cluster topic A positive
2 comments from cluster topic A neutral
1 comment from cluster topic A negative
are retrieved
Here we propose the Sainte-Laguë (SL) method to diversify the search result by re-
trieving the representative comments proportionally from Lrj. SL method [38] is a high-
est quotient method for allocating seats in party-list proportional representation used in
many voting systems. After all the comments have been tallied, successive quotients are
computed using equation 11 for each cluster. where, V is the total number of comments
in each cluster; S is the number of ’seats’ that cluster has been allocated so far, initially
0 for all clusters.
quotient =
V
2S + 1
(11)
Whichever cluster has the highest quotient gets the next ’seat’ allocated, and their quo-
tient is recalculated given their new ’seat’ total. The process is repeated until all ’seats’
have been allocated. The number of the ’seats’ is a hyper-parameter, it can be set accord-
ing to users’ interests. We use table 2 as an example illustrates how the process works:
There are three clusters. five comments are expected to be retrieved (number of ’seats’
is five). The denominators in the first row are calculated as 2S + 1 where S = 0, 1, 2...
respectively and the quotients in the column 2 to column 4 are the quotients calculated
32

We apply SL method to clusters of all ranks. Comments with higher user
score are selected ﬁrst.
retrieval number = min ( γ · Ncl,rj , |Ci,rj | )
• Nrj is the number of retrieved comments at rank rj (rj ∈ R)
• γ is a positive constant controlling the retrieval scale.
• |Ci,rj | is number of comments Ci at rank j.
• Then the representative comments are retrieved from each topic-sentiment
cluster of all ranks proportionally.

po
Thread

po
po
po
rank 1
rank 2
rank 3
Diversiﬁed
Search
Result
Pseudo
Search
Result

Comments Tree Decomposition
THREADRank
Comments Tree
• Levels (with different colors) represents the coherence:
• comment at lower level is the reply to the one at higher level
• comments at the same level are independent.

THREADRank
Comments Tree
Set the Decomposition level at 1,

comment level 0
comment level 1 comment level 1
comment level 2comment level 2 comment level 2
comment level 3
comment level 3
enumerate the path from level 0 to level m

comment level 0
comment level 3
comment level 3
sub-tree 1

comment level 0
comment level 3
comment level 3
sub-tree 2

comment level 0
comment level 3
comment level 3
sub-tree 3

comment level 0
comment level 3
comment level 3
sub-tree 4

Example :
Query: December 17 2014 – U.S. President Barack Obama announces the resumption of
normal relations between the U.S. and Cuba.
• When decomposition level is set at 5, the number of the sub-trees for each retrieved thread is
as follows:
thread1 thread2 thread3 thread4 thread5 thread6 thread7 thread8 thread9 thread10
number of trees (per thread) 1 8 8 3 15 49 90 2 127 11
number of decomposed sub-
trees trees (per thread)
1 19 17 3 35 165 342 2 625 11

Sub-Tree 1
Sub-Tree 2
Select one sub-tree according to
the sub-tree score

• Comment Score: each comment has a score given by users. Sub-
tree score is the sum of the user score of the comments in the
subtree.
• Linguistic Features: score the subtree using the diversity of the
linguistic feature of each comment in the subtree. The different
linguistic features we propose are NP words (words that can
potentially form noun phrases), named entities and bigrams.
• Number of Topics: diversity of the topic tag of comment in the
sub-tree.

Experiment Setup
Data: 26,669,242 Reddit comments
845,004 threads from year 2008-2015
Sub-reddit: worldnews / politics
Queries: 50 news summaries from Wikinews 2011-2014
Ranking: Elasticsearch and Okapi BM-25 score ranking
and choose the top 10 threads and their linked comments; on
average there are 4330.7 comments retrieved for each query.

Experiment Evaluation
• We use Cumulative Gain (CG) to measure the diversity, CG can also
penalise redundancy.
• Charles L.A claims [7] that CG at rank k can be used directly as
diversity evaluation measure.
• Al-maskari [8] provides evidence that CG correlates better with user
satisfaction than Normalised Discounted Cumulative Gain (nDCG).

Experiment EvaluationCG[k] =
kX
j=1
G[j] (12)
G[k] =
mX
i=1
J(dk, i)(1 ↵)ri,k 1
(13)
number of comments (dj) ranked up to position k 1 that contain
ri,k 1 =
k 1X
j=1
J(dj, i)
39
where,
4.1 Relevance Judgments
To estimate P(ni ∈ d) we adopt a simple model
by the manual judgments typical of TREC tasks.
sume that a human assessor reads d and reaches
decision regarding each nugget: Is the nugget cont
the document or not?
Let J(d, i) = 1 if the assessor has judged that d
nugget ni, and J(d, i) = 0 if not. A possible estim
P(ni ∈ d) is then:
P(ni ∈ d) =
ȷ
α if J(d, i) = 1,
0 otherwise.
The value α is a constant with 0 < α ≤ 1, which
the possibility of assessor error. This definition
that positive judgments may be erroneous, but th
tive judgments are always correct. This definition is
approximation of reality, but is still a step beyond
ditional assumption of perfect accuracy. More soph
estimates are possible, but are left for future wor
assume Equation 3, then Equation 2 becomes:
P(R = 1|u, d) = 1 −
mY
(1 − P(ni ∈ u)αJ(d, i))
and
elations after
to lower lev-
0 to all of its
u discussion
iscussions as
c(4); subtree
c(9); subtree
(10); subtree
c(12). The
epresent the
e potentially
nal tree. we
es and select
opose several
the sub-tree
i,j 2 u0
i has
arked with *
score of the
,j .
0.1, a negative tag when it is between -0.1 and -1 or a neu-
tral tag when it falls in between 0.2 and -0.2.
We use Cumulative Gain (CG) to measure the diver-
sity: CG[k] =
Pk
j=1 G[j] and G[k] =
Pm
i=1 J(dk, i)(1
↵)ri,k 1 where, ri,k 1 is the number of comments (dj) ranked
up to position k 1 that contain nugget ni and ri,k 1 =
Pk 1
j=1 J(dj, i); J(d, i) = 1 if comment (d) contains nugget
ni, otherwise J(d, i) = 0; k is set to be 10 in our experiment
because we choose top 10 threads and their linked comments;
The possibility of assessor error ↵ is set to be 0.5. Charles
L.A claims [5] that CG at rank k can be used directly as
diversity evaluation measure and Al-maskari [2] provides ev-
idence that CG correlates better with user satisfaction than
Normalized Discounted Cumulative Gain (nDCG).
Experiment for the Sainte-Laguë (SL) Method: We set
= 2.5 to compute the number of the retrieved comments
for each of top 10 threads. We use topic-sentimental tag
as the nugget, which is the combination of both topic and
sentiment tags to compute CG for the pseudo search result
and diversified search result with SL method for each query,
the average CG over 50 queries is presented in the table 2.
use Cumulative Gain (CG) to measure the diversity and CG also penalizes r
y:
CG[k] =
kX
j=1
G[j]
G[k] =
mX
i=1
J(dk, i)(1 ↵)ri,k 1
e, ri,k 1 is the number of comments (dj) ranked up to position k 1 that co
et ni and
ri,k 1 =
k 1X
j=1
J(dj, i)
39
• r i, k-1 is number of comments dj ranked up to position k-1 that contains
nugget ni .
• alpha is a constant between 0 and 1 that reflects the assessor error. We
set alpha at 0.5 for our experiment.

retrieval result CG retrieve percent
diversified result with SL 71.80 ± 44.31 16.60%
pseudo search result 50.37 ± 27.29 100%
Table 3. Sainte Laguë method experiment result
CG retrieval percent
CTD comment score 27.11 ± 11.19 70.51%
CTD NP words 27.31 ± 10.81 70.67%
CTD named entities 27.54 ± 11.39 59.17%
CTD bigrams 26.60 ± 10.63 70.77%
CTD number of topics 28.45 ± 11.72 73.26%
Table 4. experiment with CTD method
where, J(d, i) = 1 if comment (d) contains nugget ni, otherwise J(d, i) = 0; k is set to be
10 in our experiment because we choose top 10 threads and their linked comments; The
possibility of assessor error ↵ is set to be 0.5. Charles L.A claims [15] that CG at rank k
• SL method shows that diversified search results have tremendous diversity
improvement with only 16.60% of comments from the pseudo search result on
average.
• The increase in diversity is foreseeable because comments are retrieved directly
from the topic sentimental clusters with proportionality. SL method proves to be
effective with the expense of the coherence in the discussion.

• CTD also demonstrates the effectiveness to reduce the redundancy and improve
the diversity of pseudo search results using less comments;
• Small-scale comment trees also maintain the coherence and conversation
style of the discussion.
retrieval result CG retrieve percent
diversified result with SL 71.80 ± 44.31 16.60%
Table 3. Sainte Laguë method experiment result
CG retrieval percent
CTD comment score 27.11 ± 11.19 70.51%
CTD NP words 27.31 ± 10.81 70.67%
CTD named entities 27.54 ± 11.39 59.17%
CTD bigrams 26.60 ± 10.63 70.77%
CTD number of topics 28.45 ± 11.72 73.26%
Table 4. experiment with CTD method
where, J(d, i) = 1 if comment (d) contains nugget ni, otherwise J(d, i) = 0; k is set to be
10 in our experiment because we choose top 10 threads and their linked comments; The
possibility of assessor error ↵ is set to be 0.5. Charles L.A claims [15] that CG at rank k
can be used directly as diversity evaluation measure and Al-maskari [3] provides evidence
that CG correlates better with user satisfaction than Normalized Discounted Cumulative
Gain (nDCG).
Experiment for the Sainte-Laguë (SL) Method: We set = 2.5 to compute the number
Comment Tree Decomposition

Conclusion
• We proposed novel methods to distill the diverse interpretable topics from pseudo
search result using topic model with effective text processing.
• We studied the characteristics of Reddit comments.
• We introduced two diversiﬁcation methods namely Sainte-Laguë (SL) Method
and Comment Tree Decomposition (CTD) Method to reduce the redundancy and
diversify the returned results.
• According to the experiment results of the two methods, both methods prove to
be effective diversiﬁcation techniques. SL method treats comments as entities
while CTD preserves the conversational style of the discussions.

References
[1] Diversifying Search Results, R. Agrawal, S. Gollapudi, A. Halverson, S.
Ieong WSDM 2009
[[2]Exploiting Query Reformulations for Web Search Result Diversification,R.
L. T. Santos, C. Macdonald, I. Ounis WWW 2010
[3]Diversity by Proportionality: An Election-based Approach to Search Result
Diversification, Van Dang and W. Bruce Croft SIGIR 2012
[4]The use of MMR, diversity-based reranking for reordering documents and
producing summaries. J. Carbonell and J. Goldstein. In Proceedings SIGIR
[5]Term Level Search Result Diversification,Van Dang and W. Bruce Croft
SIGIR 2013
[6]Beyond Independent Relevance: Methods and Evaluation Metrics C. Zhai,
W. W. Cohen, J. Lafferty: for Subtopic Retrieval, SIGIR 2003

Thank you !
Dankeschön!
谢谢你们！

Diversified Social Media Retrieval for News Stories

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Diversified Social Media Retrieval for News Stories

Similar to Diversified Social Media Retrieval for News Stories (20)

Recently uploaded

Recently uploaded (20)

Diversified Social Media Retrieval for News Stories