SlideShare a Scribd company logo
1 of 60
Download to read offline
Diversified Social Media Retrieval for News
Stories.
Bryan Hang ZHANG
Feb. 25th 2016
Master Thesis Colloquium
Department of Computational Linguistics
Dr. Vinay SETTY
Prof. Dr. Günter NEUMANN
Supervisors:
Outline
• Motivation
• Related Work
• Solution
• Experiment Evaluation
• Conclusion
• Acknowledgement
Motivation
• Social media data is generated by users constantly.
• Twitter
• Blogs
• Forums (Quora, WEBMD….)
• Comments (Reddit, Instagram, YouTube….)
Motivation
query: news story
Thread
Thread
Rank: 2nd
Rank: 3rd
ThreadRank: 1st
Rank: K th Thread
When using a news story summary (from Wiki news) to retrieve relevant
information from Reddit comments data,
• Threads (relevant news summary)
•
are retrieved
Motivation
query: news story
Thread
Thread
Rank: 2nd
Rank: 3rd
ThreadRank: 1st
Rank: K th Thread
When using a news story summary (from Wiki news) to retrieve relevant
information from Reddit comments data,
• Threads (relevant news summary)
• linked comments ( by users)
are retrieved
Motivation
query: news story
Thread
Thread
Rank: 2nd
Rank: 3rd
ThreadRank: 1st
Rank: K th Thread
When using a news story summary (from Wiki news) to retrieve relevant
information from Reddit comments data,
• Threads (relevant news summary)
• linked comments ( by users)
are retrieved
Tree
Motivation
Tree-Structured Comments
Motivation
query: news story
Cuba Wants Off U.S. Terrorism List Before Restoring Normal Ties
Most Americans Support Renewed U.S.-Cuba Relations
Obama announces historic overhaul of relations; Cuba releases American
Raul Castro: US Must Return Guantanamo for Normal Relations
• Social media data is generated by users constantly.
December 17 2014 – U.S. President Barack
Obama announces the resumption of normal
relations between the U.S. and Cuba.
2691 comments linked to the top 10 threads
(Okapi BM-25 ranking)
Motivation
query: news story
Thread
Thread
Rank: 2nd
Rank: 3rd
ThreadRank: 1st
Rank: K th Thread
• Social media data is generated by users constantly.
December 17 2014 – U.S. President Barack
Obama announces the resumption of normal
relations between the U.S. and Cuba.
2691 comments linked to the top 10 threads
(Okapi BM-25 ranking)
Tree
Motivation
News story
pseudo search
result
(thread+ linked
comment)
diversified search
result
(concise, diverse
result list)
Data: Reddit data
Subreddit(category): Politics / World News
• The goal is to reduce the Redundancy in the pseudo
search result from Reddit comments for news
stories and create a concise and diversified search
result.
Related Work
• Research focusing on the reflection of ambiguity of a query in
the retrieved results and reduce redundancy:
Implicit diversification methods: reduce redundancy based on
documents content dissimilarity
• Maximum Marginal Relevance [4]
• BIR[6]
Explicit diversification methods: explicitly models the aspects
(topics, categories) of a query and consider which query aspects
individual documents relate to.
•IA-Diversity[1] (user intention)
•xQuad[2] (query reformation)
•PM[3,5] (proportional representation covering the query
aspects)
Related Work
• Research focusing on summarizing social media data due to the
large volume :
• Continuous summarization of evolving tweet streams. L. Shou, Z. Wang,
K. Chen, and G. Chen. SumblrIn SIGIR, 2013.
• Hierarchical multi-label classification of social text streams. Z. Ren, M.-
H. Peetz, S. Liang, W. van Dolen, and In SIGIR, 2014.
• Summarizing web forum threads based on a latent topic propagation
process. Z. Ren, J. Ma, S. Wang, and Y. Liu. In CIKM, 2011
• Topic sentiment mixture: modeling facets and opinions in weblogs. Q.
Mei, X. Ling, M. Wondra, H. Su, and C. Zhai. In WWW, 2007.
• Entity-centric topic-oriented opinion summarization in twitter. X. Meng, F.
Wei, X. Liu, M. Zhou, S. Li, and H. Wang. In KDD, 2012.
Related Work
There is no retrieval diversification work that
has been done on the unedited, coherent,
short, Tree-Structured comments
Solution Overview
comments
retrieval
text pre-
processing
topic
modelling
diversification
Solution Overview
elastic
search
system
text pre-
processing
topic
modelling
diversification
• Top k threads
Text-based scoring using
Okapi BM-25
(linked comments)
Solution Overview
elastic
search
system
text processing
topic
modelling
diversification
• Top k threads • Text Normalisation
• Named Entity Tagging
• Text Representation (for better topical clustering)
• Sentiment Tagging
(linked comments)
Text Pre-processing
• Remove urls (using Twokenizer) and non-alphanumeric symbols, sentence
tokenisation (NLTK sentence tokenizer)
• Sentiment Analysis VADER (rule-besed sentiment tagger)
• Part-of-Speech/ Named Entity Tagging Senna Tagger (Neural Network architecture - based tagger)
1. Duplicate Named Entities because there are more entity-based
topic type in social media.
2. Select words according to the Penn Treebank part-of-speech tags.
(according to Centering Theory)
3. Lemmatise selected words (NLTK Lemmatiser)
Text Pre-processing
Text Pre-processing
“The original article i read a couple months ago was in Der Spiegel and said nothing of a new or
alternative party, although its possible i forgot.”
['DER_SPIEGEL', 'DER_SPIEGEL', 'original', 'article', 'read', 'couple', 'month',
'ago', 'der', 'spiegel', 'said', 'nothing', 'new', 'alternative', 'party', 'possible',
'forgot', 'here', 'related', 'article']
"The *Titanic* has hit an iceberg - and takes on more passengers …P.S. Yeah, keep those downvotes
coming: they won't change reality, e.g. the unemployment figures in the Eurozone."
[TITANIC’, 'TITANIC', 'EUROZONE', 'EUROZONE', 'titanic', 'ha', 'hit', 'iceberg',
'take', 'on', 'more', 'passenger', 'keep', 'downvotes', 'coming', 'won', 'change',
'reality', 'figure', ‘eurozone']
Named Entity Named Entity word word word
Solution Overview
elastic
search
system
text processing
topic
modelling
diversification
• Top k graphs • Topic Tagging
• Topic Extraction
• Text Normalisation
• Named Entity Tagging
• Text Representation
• Sentiment Tagging
(thread+comments)
Clustering
• There are many clustering and topical modelling techniques:
k-means, hierarchical clustering, frequent set clustering, LDA, pLSA.
• Challenges for modelling topics for reddit comments.
• Comments are short. k-means, hierarchical clustering, LDA
• Unpredicted number of topics. LDA, pLSA
• Topical clusters interpretation. LDA
• Ungrammatical sentences and sentence fragments.
Relations from Collapsed Typed Dependencies cannot be
accurately extracted.
Topic Modelling
Clustering
the sum of the total probability over all mixture components:
P(d) =
KX
k=1
P(d|z = k)P(z = k) (6)
K is the number of mixture components (clusters). It[41] has the assumptions:
• The words in a document are generated independently when the document”s cluster
label k is known
• The probability of a word is independent of its position within the document.
It [41] assumes that each mixture component(cluster) is a multinomial distribution over
words and a Dirichlet distribution is also assumed as the prior for each mixture component
(cluster):
VX
1.select a mixture component(cluster) k
2. The selected mixture component(cluster) k generates d
⃗z cluster labels of each document
I number of iterations
mz number of documents in cluster z
nz number of words in cluster z
nw
z number of occurrences of word w in cluster z
Nd number of words in document d
Nw
d number of occurrences of word w in document d
Table 1: Notations
probability over all mixture components:
p(d) =
K
k=1
p(d|z = k)p(z = k) (1)
Here, K is the number of mixture components (clusters).
Now, the problem becomes how to define p(d|z = k) and
p(z = k). DMM makes the Naive Bayes assumption: that
the words in a document are generated independently when
the document’s cluster label k is known, and the probability
of a word is independent of its position within the document.
Then the probability of document d generated by cluster k
can be derived as follows:
p(d|z = k) =
w∈d
p(w|z = k) (2)
Nigam et al. [20] assumes that each mixture component
(cluster) is a multinomial distribution over words, such that
p(w|z = k) = p(w|z = k, Φ) = φk,w, where w = 1, ..., V
and w φk,w = 1. They assume a Dirichlet distribution as
the prior for each mixture component (cluster), such that
p(Φ|⃗β) = Dir(⃗φk|⃗β). They also assume that the weight of
empty, in other words, GSDMM can
into several groups. Through experim
4.5, we found that the number of non
by GSDMM can be near the true num
as K is larger than the true number.
clustering model like Gaussian Mixtu
since we can get the probability of eac
to each cluster from p(zd = z|⃗z¬d, ⃗d).
Algorithm 1: GSDMM
Data: Documents in the input, ⃗d.
Result: Cluster labels of each doc
begin
initialize mz, nz, and nw
z as zero
for each document d ∈ [1, D] do
sample a cluster for d:
zd ← z ∼ Multinomial(1/K
mz ← mz + 1 and nz ← nz
for each word w ∈ d do
nw
z ← nw
z + Nw
d
for i ∈ [1, I] do
for each document d ∈ [1, D
record the current cluste
mz ← mz − 1 and nz ←
for each word w ∈ d do
nw
z ← nw
z − Nw
d
sample a cluster for d:
zd ← z ∼ p(zd = z|⃗z¬d, ⃗d
mz ← mz + 1 and nz ←
the sum of the total probability over all mixture components:
P(d) =
KX
k=1
P(d|z = k)P(z = k) (6)
K is the number of mixture components (clusters). It[41] has the assumptions:
• The words in a document are generated independently when the document”s cluster
label k is known
• The probability of a word is independent of its position within the document.
It [41] assumes that each mixture component(cluster) is a multinomial distribution over
words and a Dirichlet distribution is also assumed as the prior for each mixture component
(cluster):
P(w|z = k) = P(w|z = k, ) = k,w where
VX
w
w,k = 1 and P( |~) = Dir(~✓|~)
They also assume that the weight of each mixture component (cluster) is sampled from
a multinomial distribution and a Dirichlet prior for this multinomial distribution is also
P(w|z = k) = P(w|z = k, ) = k,w where
VX
w
w,k = 1 and P( |~) = Dir(~✓|~)
They also assume that the weight of each mixture component (cluster) is sampled from
a multinomial distribution and a Dirichlet prior for this multinomial distribution is also
assumed:
P(z = k) = P(z = k|⇥) = ✓k where
KX
k
✓k = 1 and P(⇥|~↵) = Dir(~✓|~↵)
collapsed Gibbs Samplings for GSDMM is introduced in [59], documents are randomly
assigned to K clusters initially and the following information is recorded: the cluster
labels of each document ~z, mz is number of documents in each cluster z , and nw
z is the
number of occurrences of word w in each cluster z, then documents are traversed for a
number of iterations. In each iteration, each document is reassigned to a cluster according
to the conditional distribution of P(Zd = z|~z¬d)the cluster z given the document ~d and
cluster ~z¬d:
P(Zd = z|~z¬d) /
mz,¬d + ↵
D 1 + K↵
Q
w2d
Nw
dQ
j=1
(nw
z,¬d + + j 1)
NdQ
i=1
(nz,¬d + V + i 1)
Dirichlet Multinomial Mixture Model (DMM)
Topic ModelingTopic Modelling
d
Abstract
Your abstract.
1 Introduction
↵
⇥
2 Some LATEX Examples
Abstract
Your abstract.
1 Introduction
↵
⇥
2 Some LATEX Examples
2.1 How to Leave Comments
1 I
↵
⇥
2 S
2.1
Comm
Your ab
1 Introdu
↵
⇥
2 Some L
2.1 How to
Comments can b
mand, as shown
Your abstract.
1 Introduction
z
d
D
K
↵
⇥
Your ab
1 Introdu
z
d
D
K
↵
⇥
2 Some L
Your abstract.
1 Introduction
z
d
D
K
↵
⇥
Clustering
Dirichlet Multinomial Mixture Model (DMM)
Figure 1: Graphical model of DM
V number of words in the vocabulary
D number of documents in the corpus
¯L average length of documents
⃗d documents in the corpus
⃗z cluster labels of each document
I number of iterations
m number of documents in cluster z
the sum of the total probability over all mixture components:
P(d) =
KX
k=1
P(d|z = k)P(z = k) (6)
K is the number of mixture components (clusters). It[41] has the assumptions:
• The words in a document are generated independently when the document”s cluster
label k is known
• The probability of a word is independent of its position within the document.
It [41] assumes that each mixture component(cluster) is a multinomial distribution over
words and a Dirichlet distribution is also assumed as the prior for each mixture component
(cluster):
VX
1. Select a mixture component(cluster) k
2. The selected mixture component(cluster) k generates d
⃗z cluster labels of each document
I number of iterations
mz number of documents in cluster z
nz number of words in cluster z
nw
z number of occurrences of word w in cluster z
Nd number of words in document d
Nw
d number of occurrences of word w in document d
Table 1: Notations
probability over all mixture components:
p(d) =
K
k=1
p(d|z = k)p(z = k) (1)
Here, K is the number of mixture components (clusters).
Now, the problem becomes how to define p(d|z = k) and
p(z = k). DMM makes the Naive Bayes assumption: that
the words in a document are generated independently when
the document’s cluster label k is known, and the probability
of a word is independent of its position within the document.
Then the probability of document d generated by cluster k
can be derived as follows:
p(d|z = k) =
w∈d
p(w|z = k) (2)
Nigam et al. [20] assumes that each mixture component
(cluster) is a multinomial distribution over words, such that
p(w|z = k) = p(w|z = k, Φ) = φk,w, where w = 1, ..., V
and w φk,w = 1. They assume a Dirichlet distribution as
the prior for each mixture component (cluster), such that
p(Φ|⃗β) = Dir(⃗φk|⃗β). They also assume that the weight of
empty, in other words, GSDMM can
into several groups. Through experim
4.5, we found that the number of non
by GSDMM can be near the true num
as K is larger than the true number.
clustering model like Gaussian Mixtu
since we can get the probability of eac
to each cluster from p(zd = z|⃗z¬d, ⃗d).
Algorithm 1: GSDMM
Data: Documents in the input, ⃗d.
Result: Cluster labels of each doc
begin
initialize mz, nz, and nw
z as zero
for each document d ∈ [1, D] do
sample a cluster for d:
zd ← z ∼ Multinomial(1/K
mz ← mz + 1 and nz ← nz
for each word w ∈ d do
nw
z ← nw
z + Nw
d
for i ∈ [1, I] do
for each document d ∈ [1, D
record the current cluste
mz ← mz − 1 and nz ←
for each word w ∈ d do
nw
z ← nw
z − Nw
d
sample a cluster for d:
zd ← z ∼ p(zd = z|⃗z¬d, ⃗d
mz ← mz + 1 and nz ←
the sum of the total probability over all mixture components:
P(d) =
KX
k=1
P(d|z = k)P(z = k) (6)
K is the number of mixture components (clusters). It[41] has the assumptions:
• The words in a document are generated independently when the document”s cluster
label k is known
• The probability of a word is independent of its position within the document.
It [41] assumes that each mixture component(cluster) is a multinomial distribution over
words and a Dirichlet distribution is also assumed as the prior for each mixture component
(cluster):
P(w|z = k) = P(w|z = k, ) = k,w where
VX
w
w,k = 1 and P( |~) = Dir(~✓|~)
They also assume that the weight of each mixture component (cluster) is sampled from
a multinomial distribution and a Dirichlet prior for this multinomial distribution is also
P(w|z = k) = P(w|z = k, ) = k,w where
VX
w
w,k = 1 and P( |~) = Dir(~✓|~)
They also assume that the weight of each mixture component (cluster) is sampled from
a multinomial distribution and a Dirichlet prior for this multinomial distribution is also
assumed:
P(z = k) = P(z = k|⇥) = ✓k where
KX
k
✓k = 1 and P(⇥|~↵) = Dir(~✓|~↵)
collapsed Gibbs Samplings for GSDMM is introduced in [59], documents are randomly
assigned to K clusters initially and the following information is recorded: the cluster
labels of each document ~z, mz is number of documents in each cluster z , and nw
z is the
number of occurrences of word w in each cluster z, then documents are traversed for a
number of iterations. In each iteration, each document is reassigned to a cluster according
to the conditional distribution of P(Zd = z|~z¬d)the cluster z given the document ~d and
cluster ~z¬d:
P(Zd = z|~z¬d) /
mz,¬d + ↵
D 1 + K↵
Q
w2d
Nw
dQ
j=1
(nw
z,¬d + + j 1)
NdQ
i=1
(nz,¬d + V + i 1)
Topic ModelingTopic Modelling
Topic Modeling
A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering
JianhuaYin, Tsinghua University, Beijing, China
Gibbs Sampling for Dirichlet Multinomial Mixture Model (DMM)
• They introduced the collapsed Gibbs Sampling Algorithm for DMM
Figure 1: Graphical model of DM
V number of words in the vocabulary
D number of documents in the corpus
¯L average length of documents
⃗d documents in the corpus
⃗z cluster labels of each document
I number of iterations
mz number of documents in cluster z
nz number of words in cluster z
nw
z number of occurrences of word w in c
Nd number of words in document d
Nw
d number of occurrences of word w in d
Table 1: Notations
• In each iteration sample a cluster to the document according
to:
er. As
sult in
in the
e same
ument
). We
)
(4)
n doc-
ation 4
tion 3,
tion 3.
f their
uation
odel in
itional
orithm
)
(5)
p(zd = z|⃗z¬d, ⃗d) ∝
p(⃗d, ⃗z|⃗α, ⃗β)
p(⃗d¬d, ⃗z¬d|⃗α, ⃗β)
∝
∆(⃗m + ⃗α)
∆(⃗m¬d + ⃗α)
∆(⃗nz + ⃗β)
∆(⃗nz,¬d + ⃗β)
∝
Γ(mz + α)
Γ(mz,¬d + α)
Γ(D − 1 + Kα)
Γ(D + Kα)
w∈d Γ(nw
z + β)
w∈d Γ(nw
z,¬d + β)
Γ(nz,¬d + V β)
Γ(nz + V β)
(6)
where mz = mz,¬d + 1 and nz = nz,¬d + Nd. Because Γ
function has the following property: Γ(x+m)
Γ(x)
= m
i=1(x + i−
1). We can rewrite Equation 6 into the following form:
p(zd = z|⃗z¬d, ⃗d)
∝
mz,¬d + α
D − 1 + Kα
w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
Nd
i=1(nz,¬d + V β + i − 1)
(7)
When we assume each word can at most appear once in
each document (In the movie group example, the assumption
is a movie can at most appear once in each student’s list).
We can get w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d(nw
z,¬d + β) since nw
z =
nw
z,¬d + 1 holds, and Equation 7 turns out to be Equation 3.
When we allow a word to appear multi-times in each doc-
ument (A movie can appear multi-times in each student’s
list).We can get w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d
Nw
d
j=1(nw
z,¬d + β +
j − 1) since nw
z = nw
z,¬d + Nw
d holds, and Equation 7 turns
out to be Equation 4.
Topic Modelling
Topic Modeling
A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering
JianhuaYin, Tsinghua University, Beijing, China
Gibbs Sampling for Dirichlet Multinomial Mixture Model (DMM)
• They introduced the collapsed Gibbs Sampling Algorithm for DMM
Figure 1: Graphical model of DM
V number of words in the vocabulary
D number of documents in the corpus
¯L average length of documents
⃗d documents in the corpus
⃗z cluster labels of each document
I number of iterations
mz number of documents in cluster z
nz number of words in cluster z
nw
z number of occurrences of word w in c
Nd number of words in document d
Nw
d number of occurrences of word w in d
Table 1: Notations
• In each iteration sample a cluster to the document according
to:
er. As
sult in
in the
e same
ument
). We
)
(4)
n doc-
ation 4
tion 3,
tion 3.
f their
uation
odel in
itional
orithm
)
(5)
p(zd = z|⃗z¬d, ⃗d) ∝
p(⃗d, ⃗z|⃗α, ⃗β)
p(⃗d¬d, ⃗z¬d|⃗α, ⃗β)
∝
∆(⃗m + ⃗α)
∆(⃗m¬d + ⃗α)
∆(⃗nz + ⃗β)
∆(⃗nz,¬d + ⃗β)
∝
Γ(mz + α)
Γ(mz,¬d + α)
Γ(D − 1 + Kα)
Γ(D + Kα)
w∈d Γ(nw
z + β)
w∈d Γ(nw
z,¬d + β)
Γ(nz,¬d + V β)
Γ(nz + V β)
(6)
where mz = mz,¬d + 1 and nz = nz,¬d + Nd. Because Γ
function has the following property: Γ(x+m)
Γ(x)
= m
i=1(x + i−
1). We can rewrite Equation 6 into the following form:
p(zd = z|⃗z¬d, ⃗d)
∝
mz,¬d + α
D − 1 + Kα
w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
Nd
i=1(nz,¬d + V β + i − 1)
(7)
When we assume each word can at most appear once in
each document (In the movie group example, the assumption
is a movie can at most appear once in each student’s list).
We can get w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d(nw
z,¬d + β) since nw
z =
nw
z,¬d + 1 holds, and Equation 7 turns out to be Equation 3.
When we allow a word to appear multi-times in each doc-
ument (A movie can appear multi-times in each student’s
list).We can get w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d
Nw
d
j=1(nw
z,¬d + β +
j − 1) since nw
z = nw
z,¬d + Nw
d holds, and Equation 7 turns
out to be Equation 4.
cluster z without document d
Topic Modelling
Topic Modeling
A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering
JianhuaYin, Tsinghua University, Beijing, China
Gibbs Sampling for Dirichlet Multinomial Mixture Model (DMM)
• They introduced the collapsed Gibbs Sampling Algorithm for DMM
Figure 1: Graphical model of DM
V number of words in the vocabulary
D number of documents in the corpus
¯L average length of documents
⃗d documents in the corpus
⃗z cluster labels of each document
I number of iterations
mz number of documents in cluster z
nz number of words in cluster z
nw
z number of occurrences of word w in c
Nd number of words in document d
Nw
d number of occurrences of word w in d
Table 1: Notations
• In each iteration sample a cluster to the document according
to:
er. As
sult in
in the
e same
ument
). We
)
(4)
n doc-
ation 4
tion 3,
tion 3.
f their
uation
odel in
itional
orithm
)
(5)
p(zd = z|⃗z¬d, ⃗d) ∝
p(⃗d, ⃗z|⃗α, ⃗β)
p(⃗d¬d, ⃗z¬d|⃗α, ⃗β)
∝
∆(⃗m + ⃗α)
∆(⃗m¬d + ⃗α)
∆(⃗nz + ⃗β)
∆(⃗nz,¬d + ⃗β)
∝
Γ(mz + α)
Γ(mz,¬d + α)
Γ(D − 1 + Kα)
Γ(D + Kα)
w∈d Γ(nw
z + β)
w∈d Γ(nw
z,¬d + β)
Γ(nz,¬d + V β)
Γ(nz + V β)
(6)
where mz = mz,¬d + 1 and nz = nz,¬d + Nd. Because Γ
function has the following property: Γ(x+m)
Γ(x)
= m
i=1(x + i−
1). We can rewrite Equation 6 into the following form:
p(zd = z|⃗z¬d, ⃗d)
∝
mz,¬d + α
D − 1 + Kα
w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
Nd
i=1(nz,¬d + V β + i − 1)
(7)
When we assume each word can at most appear once in
each document (In the movie group example, the assumption
is a movie can at most appear once in each student’s list).
We can get w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d(nw
z,¬d + β) since nw
z =
nw
z,¬d + 1 holds, and Equation 7 turns out to be Equation 3.
When we allow a word to appear multi-times in each doc-
ument (A movie can appear multi-times in each student’s
list).We can get w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d
Nw
d
j=1(nw
z,¬d + β +
j − 1) since nw
z = nw
z,¬d + Nw
d holds, and Equation 7 turns
out to be Equation 4.
cluster z without document d
p(zd = z|⃗z¬d, ⃗d) ∝
mz,¬d + α
D − 1 + Kα
w∈d
Nw
d
j=1(nw
z,¬d + β + j − 1)
Nd
i=1(nz,¬d + V β + i − 1)
(4)
where Nw
d is the number of occurrences of word w in doc-
ument d. We should note that the two parts of Equation 4
have similar relationship with MGP like that of Equation 3,
and the complexity of Equation 4 is the same as Equation 3.
The only difference between them is the numerator of their
second part. We will try to derive Equation 3 and Equation
4 from the Dirichlet Multinomial Mixture (DMM) model in
the next section.
2.4 Derivation of GSDMM
In this section, we try to formally derive the conditional
distribution p(zd = z|⃗z¬d, ⃗d) used in our GSDMM algorithm
as follows.
p(zd = z|⃗z¬d, ⃗d) =
p(⃗d, ⃗z|⃗α, ⃗β)
p(⃗d, ⃗z¬d|⃗α, ⃗β)
∝
p(⃗d, ⃗z|⃗α, ⃗β)
p(⃗d¬d, ⃗z¬d|⃗α, ⃗β)
(5)
where ¬d means document d is excluded from ⃗z and ⃗d. Now
we need to derive the full distribution p(⃗d, ⃗z|⃗α, ⃗β). From the
graphical model of DMM in Figure 1, we can see p(⃗d, ⃗z|⃗α, ⃗β) =
p(⃗d|⃗z, ⃗β)p(⃗z|⃗α). Then we need to derive p(⃗d|⃗z, ⃗β) and p(⃗z|⃗α).
Let us first investigate how to obtain p(⃗z|⃗α). We can see
that p(⃗z|⃗α) can be obtained by integrating with respect to
Θ as p(⃗z|⃗α) = p(⃗z|Θ)p(Θ|⃗α)dΘ. As mentioned in Sec-
tion 2.2, p(Θ|⃗α) is a Dirichlet distribution and p(⃗z|Θ) is
w∈d Γ(n
where mz = mz,¬d +
function has the follow
1). We can rewrite Eq
p(zd = z|⃗z¬d, ⃗d
∝
mz,¬d + α
D − 1 + K
When we assume ea
each document (In the
is a movie can at most
We can get w∈d Γ(nw
z
w∈d Γ(nw
z,¬
nw
z,¬d + 1 holds, and Eq
When we allow a wo
ument (A movie can a
list).We can get w∈d
w∈d Γ
j − 1) since nw
z = nw
z,¬
out to be Equation 4.
3. DISCUSSION
3.1 Meaning of A
In this part, we try to
the help of the Movie
in Section 2.1. From E
to the prior probability
Topic Modelling
Topic Modeling
Collapsed Gibbs Sampling Algorithm [9]
Figure 1: Graphical model of DM
V number of words in the vocabulary
D number of documents in the corpus
¯L average length of documents
⃗d documents in the corpus
⃗z cluster labels of each document
I number of iterations
mz number of documents in cluster z
nz number of words in cluster z
nw
z number of occurrences of word w in c
Nd number of words in document d
Nw
d number of occurrences of word w in d
Table 1: Notations
• In each iteration, sample a cluster to the document
according to until the clusters are stable.
Then the conditional distribution in Equation 5 can be de-
rived as follows:
p(zd = z|⃗z¬d, ⃗d) ∝
p(⃗d, ⃗z|⃗α, ⃗β)
p(⃗d¬d, ⃗z¬d|⃗α, ⃗β)
∝
∆(⃗m + ⃗α)
∆(⃗m¬d + ⃗α)
∆(⃗nz + ⃗β)
∆(⃗nz,¬d + ⃗β)
∝
Γ(mz + α)
Γ(mz,¬d + α)
Γ(D − 1 + Kα)
Γ(D + Kα)
w∈d Γ(nw
z + β)
w∈d Γ(nw
z,¬d + β)
Γ(nz,¬d + V β)
Γ(nz + V β)
(6)
where mz = mz,¬d + 1 and nz = nz,¬d + Nd. Because Γ
function has the following property: Γ(x+m)
Γ(x)
= m
i=1(x +i−
1). We can rewrite Equation 6 into the following form:
p(zd = z|⃗z¬d, ⃗d)
∝
mz,¬d + α
D − 1 + Kα
w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
Nd
i=1(nz,¬d + V β + i − 1)
(7)
When we assume each word can at most appear once in
each document (In the movie group example, the assumption
is a movie can at most appear once in each student’s list).
We can get w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d(nw
z,¬d + β) since nw
z =
nw
z,¬d + 1 holds, and Equation 7 turns out to be Equation 3.
When we allow a word to appear multi-times in each doc-
ument (A movie can appear multi-times in each student’s
list).We can get w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d
Nw
d
j=1(nw
z,¬d + β +
j − 1) since nw
z = nw
z,¬d + Nw
d holds, and Equation 7 turns
udents in the
e in the same
a document
t’s list). We
s:
+ j − 1)
− 1)
(4)
rd w in doc-
f Equation 4
f Equation 3,
s Equation 3.
rator of their
and Equation
M) model in
e conditional
M algorithm
⃗α, ⃗β)
¬d|⃗α, ⃗β)
(5)
p(d¬d, ⃗z¬d|⃗α, β)
∝
∆(⃗m + ⃗α)
∆(⃗m¬d + ⃗α)
∆(⃗nz + ⃗β)
∆(⃗nz,¬d + ⃗β)
∝
Γ(mz + α)
Γ(mz,¬d + α)
Γ(D − 1 + Kα)
Γ(D + Kα)
w∈d Γ(nw
z + β)
w∈d Γ(nw
z,¬d + β)
Γ(nz,¬d + V β)
Γ(nz + V β)
(6)
where mz = mz,¬d + 1 and nz = nz,¬d + Nd. Because Γ
function has the following property: Γ(x+m)
Γ(x)
= m
i=1(x + i−
1). We can rewrite Equation 6 into the following form:
p(zd = z|⃗z¬d, ⃗d)
∝
mz,¬d + α
D − 1 + Kα
w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
Nd
i=1(nz,¬d + V β + i − 1)
(7)
When we assume each word can at most appear once in
each document (In the movie group example, the assumption
is a movie can at most appear once in each student’s list).
We can get w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d(nw
z,¬d + β) since nw
z =
nw
z,¬d + 1 holds, and Equation 7 turns out to be Equation 3.
When we allow a word to appear multi-times in each doc-
ument (A movie can appear multi-times in each student’s
list).We can get w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d
Nw
d
j=1(nw
z,¬d + β +
j − 1) since nw
z = nw
z,¬d + Nw
d holds, and Equation 7 turns
out to be Equation 4.
nw
z,¬d and nz,¬d are the number of occurrences of movie w
in table z and the total number of movies in table z with-
out considering student d, respectively. When table z has
more students sharing similar interests with student d (i.e.,
watched more movies of the same), movies of student d will
appear more often in table z (with larger nw
z,¬d), and the
probability of student d choosing table z will be larger. As
a result, the second part of Equation 3 tends to result in
large homogeneity, because it can leads the students in the
same table to be more similar (more likely to be in the same
ground true group).
If we allow a word to appear multi-times in a document
(A movie can appear multi-times in a student’s list). We
can derive the conditional probability as follows:
p(zd = z|⃗z¬d, ⃗d) ∝
mz,¬d + α
D − 1 + Kα
w∈d
Nw
d
j=1(nw
z,¬d + β + j − 1)
Nd
i=1(nz,¬d + V β + i − 1)
(4)
where Nw
d is the number of occurrences of word w in doc-
ument d. We should note that the two parts of Equation 4
have similar relationship with MGP like that of Equation 3,
and the complexity of Equation 4 is the same as Equation 3.
The only difference between them is the numerator of their
second part. We will try to derive Equation 3 and Equation
4 from the Dirichlet Multinomial Mixture (DMM) model in
the next section.
2.4 Derivation of GSDMM
In this section, we try to formally derive the conditional
distribution p(zd = z|⃗z¬d, ⃗d) used in our GSDMM algorithm
as follows.
p(⃗d, ⃗z|⃗α, ⃗β) =
Then the conditional di
rived as follows:
p(zd = z|⃗z¬
∝
∆(⃗m +
∆(⃗m¬d +
∝
Γ(mz +
Γ(mz,¬d
w∈d Γ(n
w∈d Γ(nw
z
where mz = mz,¬d + 1
function has the followin
1). We can rewrite Equ
p(zd = z|⃗z¬d, ⃗d)
∝
mz,¬d + α
D − 1 + Kα
When we assume eac
each document (In the m
is a movie can at most
We can get w∈d Γ(nw
z +
w∈d Γ(nw
z,¬d
nw
z,¬d + 1 holds, and Equ
When we allow a word
ument (A movie can ap
alpha relates to the cluster popularity beta relates to the similar interest
Topic Modelling
e students in the
to be in the same
es in a document
udent’s list). We
ollows:
+ β + j − 1)
+ i − 1)
(4)
f word w in doc-
rts of Equation 4
at of Equation 3,
me as Equation 3.
umerator of their
n 3 and Equation
(DMM) model in
e the conditional
SDMM algorithm
⃗d, ⃗z|⃗α, ⃗β)
d, ⃗z¬d|⃗α, ⃗β)
(5)
om ⃗z and ⃗d. Now
⃗z|⃗α, ⃗β). From the
n see p(⃗d, ⃗z|⃗α, ⃗β) =
|⃗z, ⃗β) and p(⃗z|⃗α).
∝
∆(⃗m + ⃗α)
∆(⃗m¬d + ⃗α)
∆(⃗nz + ⃗β)
∆(⃗nz,¬d + ⃗β)
∝
Γ(mz + α)
Γ(mz,¬d + α)
Γ(D − 1 + Kα)
Γ(D + Kα)
w∈d Γ(nw
z + β)
w∈d Γ(nw
z,¬d + β)
Γ(nz,¬d + V β)
Γ(nz + V β)
(6)
where mz = mz,¬d + 1 and nz = nz,¬d + Nd. Because Γ
function has the following property: Γ(x+m)
Γ(x)
= m
i=1(x + i−
1). We can rewrite Equation 6 into the following form:
p(zd = z|⃗z¬d, ⃗d)
∝
mz,¬d + α
D − 1 + Kα
w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
Nd
i=1(nz,¬d + V β + i − 1)
(7)
When we assume each word can at most appear once in
each document (In the movie group example, the assumption
is a movie can at most appear once in each student’s list).
We can get w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d(nw
z,¬d + β) since nw
z =
nw
z,¬d + 1 holds, and Equation 7 turns out to be Equation 3.
When we allow a word to appear multi-times in each doc-
ument (A movie can appear multi-times in each student’s
list).We can get w∈d Γ(nw
z +β)
w∈d Γ(nw
z,¬d
+β)
= w∈d
Nw
d
j=1(nw
z,¬d + β +
j − 1) since nw
z = nw
z,¬d + Nw
d holds, and Equation 7 turns
out to be Equation 4.
3. DISCUSSION
3.1 Meaning of Alpha and Beta
set at 0.1 for our task set at 0.2 for our task
Topic Modeling
Dirichlet Multinomial Mixture Model (DMM) and collapsed Gibbs Sampling algorithm
• The number of clusters is inferred automatically
• Balance the completeness and homogeneity of the clusters
• Converge fast
• Cope with the sparse and high- dimensional problem of short texts
• Representative words of each cluster (similar to PLSA and LDA ) are the most frequent
words in each cluster.
We get excellent results, majority of the topic clusters can be easily interpreted using the
representative words.
For the thread Germany won the world cup in Brazil, some comments are in Portuguese
and German. Our topic modelling approach can cluster the comments according to the
language, so we get a cluster of comments in German and a cluster in Portuguese.
Topic Modelling
Topic Modeling
Rank:
Rank:
Rank: K
Thread
Thread
ThreadRank:
Thread
Extract 27 topical Clusters
Topic 1
Topic 2
Topic 3
Topic k
Query: December 17 2014 – U.S. President Barack Obama
announces the resumption of normal relations between the
U.S. and Cuba.
2691 comments from
the top 10 threads
Topic Modelling
Example:
Topic Modeling
• Obama announces historic overhaul of relations; Cuba releases American
• Cuba Wants Off U.S. Terrorism List Before Restoring Normal Ties
• Most Americans Support Renewed U.S.-Cuba Relations
• Raul Castro: US Must Return Guantanamo for Normal Relations
• Russian foreign minister praises new U.S.-Cuba relations
• U.S. Approves Ferry Service Between Florida and Cuba
• US, Cuba restore full diplomatic relations after 54 years
• President Barack Obama announced Wednesday that the U.S. and Cuba
will reopen their embassies in Havana and Washington, heralding a "new
chapter" in relations after a half-century of hostility.
• Raul Castro: U.S. must return Guantanamo for normal relations
• U.S. Takes Cuba off Terror List, Paving the Way for Normal Ties
Although the Top 10 threads are all about Cuba and the US
Topic Modelling
Clustering
27 topical clusters extracted from 2691 comments for the query:
Number of Comments Topic Words ( Top 10 most frequent words)
8 18 war libya utf partagas haiti 69i god somalia isil pakistan
12 7 cuban statistic government cuba un mean have independent number ha
11 21 mexico cuba gulf gitmo america gtmo navy panamacanal small control
10 22 ftfy nixon cheney nato un germany lincoln still facebook republican
13 57 tropico isi terrorist just know order have people cia drone
38 218 cuba america cia germany turkey have soviet japan castro war
15 101 russia ukraine america cuba russian crimea have american eu state
14 240 cuban cigar cuba have tobacco people nicaragua so just dominican
17 10 southafrica angola cuba south africa mozambique death get un leonardcohen
18 155 cuba america canada country american mexico list china ha saudi
30 6 nonmobile pollo jadehelm15 feedback mobile please counter bot non link
37 530 have just re people think american thing up that so
32 416 cuba cuban guantanamo american government have us castro lease country
Topic Index
More relevant topics in their linked comments are discovered.
Topic Modelling
Clustering
Number of Comments Topic Words ( Top 10 most frequent words from left to right)
25 94 castro usa cuba florida soviet cuban nuke don venezuela wanted
26 1 nigelthornberry
27 79 gitmo guantanamo iraq wto iran naval base obama american china
46 729 cuba cuban american have people obama relation country castro
45 8 abbott texas voter alec id voteridlaw name paulweyrich heritageinstitute co
42 43 lincoln washington roosevelt congress had mandela term unitedstate newburgh
41 3 michigan ohio toledo nwo upper had state won peninsula bet
1 165 obama congress republican democrat have clinton bernie that bush iowa
0 5 republican cost want higher dc job aca highly tax people
3 23 texas woman mexico healthcare republican mmr ha have rate mean
5 21 unsc cuba us un padron iraq cigar uncharter ha charter
7 32 cuba us cuban spanish latinamerica law treaty american spanishempire america
6 30 turkey armenian turk have armenia israel havana just people up
9 4 erdogan ataturk turk hitler mhp chp kurd turkey kurdish election
Topic Index
Topic Modelling
Solution Overview
elastic
search
system
text processing
topic
modelling
diversification
• Top k threads • Topic Tagging
• Topic Extraction
• Text Normalisation
• Named Entity Tagging
• Text Representation
• Sentiment Tagging
• Sainte Laguë Method
• Comments Tree
Decomposition
ThreadRank ThreadRank
label the comments from each thread then each thread has n topic clusters of comments
Diversification -Sainte Laguë Method
topic 1
topic 2
topic 3
po
ThreadRank ThreadRank
label every comment a sentiment label or an emotional label (if emotion modelling works)
Positive
Neutral
Negative
Diversification -Sainte Laguë Method
Diversification -Sainte Laguë Method
ThreadRank
po
ThreadRank
topic 1
topic 2
topic 3
po
Thread
Diversification -Sainte Laguë Method
Diversification -Sainte Laguë Method
Sainte Laguë Method
trieving the representative comments proportionally from Lrj. SL method [38] is a high-
est quotient method for allocating seats in party-list proportional representation used in
many voting systems. After all the comments have been tallied, successive quotients are
computed using equation 11 for each cluster. where, V is the total number of comments
in each cluster; S is the number of ’seats’ that cluster has been allocated so far, initially
0 for all clusters.
quotient =
V
2S + 1
(11)
Whichever cluster has the highest quotient gets the next ’seat’ allocated, and their quo-
tient is recalculated given their new ’seat’ total. The process is repeated until all ’seats’
have been allocated. The number of the ’seats’ is a hyper-parameter, it can be set accord-
ing to users’ interests. We use table 2 as an example illustrates how the process works:
There are three clusters. five comments are expected to be retrieved (number of ’seats’
is five). The denominators in the first row are calculated as 2S + 1 where S = 0, 1, 2...
respectively and the quotients in the column 2 to column 4 are the quotients calculated
32
The Sainte Laguë Method is a highest quotient method for allocating seats
in party-list proportional representation used in many voting systems.
After all the comments have been tallied, successive quotients are computed
for each cluster.
where,
• V is the total number of comments in each cluster;
• S is the number of ’seats’ that cluster has been allocated so far, initially 0
for all clusters
po
Thread
Diversification -Sainte Laguë Method
Denominator /1 /3 /5 Seats(*)
topic A positive 50* 16.67* 10 2
topic A neutral 40* 13.33* 8 2
topic A negative 30* 10 6 1
Table 2. Sainte-Lagu¨e method example
we retrieve n comments proportionally from the diverse clusters to form the
esult which is concise and diverse as Figure 10.
Example:
5 comments are expected to be retrieved (number of ‘seats’ is 5). The
denominators in the first row are calculated as 2S + 1 where S = 0, 1, 2...
respectively and the quotients in the column 2 to column 4 are the quotients
calculated. The quotients marked with “*” represent the allocated ‘seats’.
So for this example,
2 comments are from cluster topic A positive
2 comments from cluster topic A neutral
1 comment from cluster topic A negative
are retrieved
Here we propose the Sainte-Lagu¨e (SL) method to diversify the search result by re-
trieving the representative comments proportionally from Lrj. SL method [38] is a high-
est quotient method for allocating seats in party-list proportional representation used in
many voting systems. After all the comments have been tallied, successive quotients are
computed using equation 11 for each cluster. where, V is the total number of comments
in each cluster; S is the number of ’seats’ that cluster has been allocated so far, initially
0 for all clusters.
quotient =
V
2S + 1
(11)
Whichever cluster has the highest quotient gets the next ’seat’ allocated, and their quo-
tient is recalculated given their new ’seat’ total. The process is repeated until all ’seats’
have been allocated. The number of the ’seats’ is a hyper-parameter, it can be set accord-
ing to users’ interests. We use table 2 as an example illustrates how the process works:
There are three clusters. five comments are expected to be retrieved (number of ’seats’
is five). The denominators in the first row are calculated as 2S + 1 where S = 0, 1, 2...
respectively and the quotients in the column 2 to column 4 are the quotients calculated
32
Diversification -Sainte Laguë Method
We apply SL method to clusters of all ranks. Comments with higher user
score are selected first.
retrieval number = min ( γ · Ncl,rj , |Ci,rj | )
• Nrj is the number of retrieved comments at rank rj (rj ∈ R)
• γ is a positive constant controlling the retrieval scale.
• |Ci,rj | is number of comments Ci at rank j.
• Then the representative comments are retrieved from each topic-sentiment
cluster of all ranks proportionally.
Sainte Laguë Method
po
Thread
po
po
po
rank 1
rank 2
rank 3
Diversified
Search
Result
Pseudo
Search
Result
Comments Tree Decomposition
THREADRank
Comments Tree
• Levels (with different colors) represents the coherence:
• comment at lower level is the reply to the one at higher level
• comments at the same level are independent.
Comments Tree Decomposition
THREADRank
Comments Tree
Set the Decomposition level at 1,
Comments Tree Decomposition
comment level 0
comment level 1 comment level 1
comment level 2comment level 2 comment level 2
comment level 3
comment level 4 comment level 4
comment level 3
enumerate the path from level 0 to level m
Comments Tree Decomposition
comment level 0
comment level 1 comment level 1
comment level 2comment level 2 comment level 2
comment level 3
comment level 4 comment level 4
comment level 3
sub-tree 1
Comments Tree Decomposition
comment level 0
comment level 1 comment level 1
comment level 2comment level 2 comment level 2
comment level 3
comment level 4 comment level 4
comment level 3
sub-tree 2
Comments Tree Decomposition
comment level 0
comment level 1 comment level 1
comment level 2comment level 2 comment level 2
comment level 3
comment level 4 comment level 4
comment level 3
sub-tree 3
Comments Tree Decomposition
comment level 0
comment level 1 comment level 1
comment level 2comment level 2 comment level 2
comment level 3
comment level 4 comment level 4
comment level 3
sub-tree 4
Example :
Query: December 17 2014 – U.S. President Barack Obama announces the resumption of
normal relations between the U.S. and Cuba.
• When decomposition level is set at 5, the number of the sub-trees for each retrieved thread is
as follows:
thread1 thread2 thread3 thread4 thread5 thread6 thread7 thread8 thread9 thread10
number of trees (per thread) 1 8 8 3 15 49 90 2 127 11
number of decomposed sub-
trees trees (per thread)
1 19 17 3 35 165 342 2 625 11
Comments Tree Decomposition
Sub-Tree 1
Sub-Tree 2
Select one sub-tree according to
the sub-tree score
Comments Tree Decomposition
Comments Tree Decomposition
• Comment Score: each comment has a score given by users. Sub-
tree score is the sum of the user score of the comments in the
subtree.
• Linguistic Features: score the subtree using the diversity of the
linguistic feature of each comment in the subtree. The different
linguistic features we propose are NP words (words that can
potentially form noun phrases), named entities and bigrams.
• Number of Topics: diversity of the topic tag of comment in the
sub-tree.
Experiment Setup
Data: 26,669,242 Reddit comments
845,004 threads from year 2008-2015
Sub-reddit: worldnews / politics
Queries: 50 news summaries from Wikinews 2011-2014
Ranking: Elasticsearch and Okapi BM-25 score ranking
and choose the top 10 threads and their linked comments; on
average there are 4330.7 comments retrieved for each query.
Experiment Evaluation
• We use Cumulative Gain (CG) to measure the diversity, CG can also
penalise redundancy.
• Charles L.A claims [7] that CG at rank k can be used directly as
diversity evaluation measure.
• Al-maskari [8] provides evidence that CG correlates better with user
satisfaction than Normalised Discounted Cumulative Gain (nDCG).
Experiment EvaluationCG[k] =
kX
j=1
G[j] (12)
G[k] =
mX
i=1
J(dk, i)(1 ↵)ri,k 1
(13)
number of comments (dj) ranked up to position k 1 that contain
ri,k 1 =
k 1X
j=1
J(dj, i)
39
where,
4.1 Relevance Judgments
To estimate P(ni ∈ d) we adopt a simple model
by the manual judgments typical of TREC tasks.
sume that a human assessor reads d and reaches
decision regarding each nugget: Is the nugget cont
the document or not?
Let J(d, i) = 1 if the assessor has judged that d
nugget ni, and J(d, i) = 0 if not. A possible estim
P(ni ∈ d) is then:
P(ni ∈ d) =
ȷ
α if J(d, i) = 1,
0 otherwise.
The value α is a constant with 0 < α ≤ 1, which
the possibility of assessor error. This definition
that positive judgments may be erroneous, but th
tive judgments are always correct. This definition is
approximation of reality, but is still a step beyond
ditional assumption of perfect accuracy. More soph
estimates are possible, but are left for future wor
assume Equation 3, then Equation 2 becomes:
P(R = 1|u, d) = 1 −
mY
(1 − P(ni ∈ u)αJ(d, i))
and
elations after
to lower lev-
0 to all of its
u discussion
iscussions as
c(4); subtree
c(9); subtree
(10); subtree
c(12). The
epresent the
e potentially
nal tree. we
es and select
opose several
the sub-tree
i,j 2 u0
i has
arked with *
score of the
,j .
0.1, a negative tag when it is between -0.1 and -1 or a neu-
tral tag when it falls in between 0.2 and -0.2.
We use Cumulative Gain (CG) to measure the diver-
sity: CG[k] =
Pk
j=1 G[j] and G[k] =
Pm
i=1 J(dk, i)(1
↵)ri,k 1 where, ri,k 1 is the number of comments (dj) ranked
up to position k 1 that contain nugget ni and ri,k 1 =
Pk 1
j=1 J(dj, i); J(d, i) = 1 if comment (d) contains nugget
ni, otherwise J(d, i) = 0; k is set to be 10 in our experiment
because we choose top 10 threads and their linked comments;
The possibility of assessor error ↵ is set to be 0.5. Charles
L.A claims [5] that CG at rank k can be used directly as
diversity evaluation measure and Al-maskari [2] provides ev-
idence that CG correlates better with user satisfaction than
Normalized Discounted Cumulative Gain (nDCG).
Experiment for the Sainte-Lagu¨e (SL) Method: We set
= 2.5 to compute the number of the retrieved comments
for each of top 10 threads. We use topic-sentimental tag
as the nugget, which is the combination of both topic and
sentiment tags to compute CG for the pseudo search result
and diversified search result with SL method for each query,
the average CG over 50 queries is presented in the table 2.
use Cumulative Gain (CG) to measure the diversity and CG also penalizes r
y:
CG[k] =
kX
j=1
G[j]
G[k] =
mX
i=1
J(dk, i)(1 ↵)ri,k 1
e, ri,k 1 is the number of comments (dj) ranked up to position k 1 that co
et ni and
ri,k 1 =
k 1X
j=1
J(dj, i)
39
• r i, k-1 is number of comments dj ranked up to position k-1 that contains
nugget ni .
• alpha is a constant between 0 and 1 that reflects the assessor error. We
set alpha at 0.5 for our experiment.
Experiment Evaluation
retrieval result CG retrieve percent
diversified result with SL 71.80 ± 44.31 16.60%
pseudo search result 50.37 ± 27.29 100%
Table 3. Sainte Lagu¨e method experiment result
CG retrieval percent
CTD comment score 27.11 ± 11.19 70.51%
CTD NP words 27.31 ± 10.81 70.67%
CTD named entities 27.54 ± 11.39 59.17%
CTD bigrams 26.60 ± 10.63 70.77%
CTD number of topics 28.45 ± 11.72 73.26%
pseudo search result 26.38 ± 9.8 100%
Table 4. experiment with CTD method
where, J(d, i) = 1 if comment (d) contains nugget ni, otherwise J(d, i) = 0; k is set to be
10 in our experiment because we choose top 10 threads and their linked comments; The
possibility of assessor error ↵ is set to be 0.5. Charles L.A claims [15] that CG at rank k
Sainte Laguë Method
• SL method shows that diversified search results have tremendous diversity
improvement with only 16.60% of comments from the pseudo search result on
average.
• The increase in diversity is foreseeable because comments are retrieved directly
from the topic sentimental clusters with proportionality. SL method proves to be
effective with the expense of the coherence in the discussion.
Experiment Evaluation
• CTD also demonstrates the effectiveness to reduce the redundancy and improve
the diversity of pseudo search results using less comments;
• Small-scale comment trees also maintain the coherence and conversation
style of the discussion.
retrieval result CG retrieve percent
diversified result with SL 71.80 ± 44.31 16.60%
pseudo search result 50.37 ± 27.29 100%
Table 3. Sainte Lagu¨e method experiment result
CG retrieval percent
CTD comment score 27.11 ± 11.19 70.51%
CTD NP words 27.31 ± 10.81 70.67%
CTD named entities 27.54 ± 11.39 59.17%
CTD bigrams 26.60 ± 10.63 70.77%
CTD number of topics 28.45 ± 11.72 73.26%
pseudo search result 26.38 ± 9.8 100%
Table 4. experiment with CTD method
where, J(d, i) = 1 if comment (d) contains nugget ni, otherwise J(d, i) = 0; k is set to be
10 in our experiment because we choose top 10 threads and their linked comments; The
possibility of assessor error ↵ is set to be 0.5. Charles L.A claims [15] that CG at rank k
can be used directly as diversity evaluation measure and Al-maskari [3] provides evidence
that CG correlates better with user satisfaction than Normalized Discounted Cumulative
Gain (nDCG).
Experiment for the Sainte-Lagu¨e (SL) Method: We set = 2.5 to compute the number
Comment Tree Decomposition
Conclusion
• We proposed novel methods to distill the diverse interpretable topics from pseudo
search result using topic model with effective text processing.
• We studied the characteristics of Reddit comments.
• We introduced two diversification methods namely Sainte-Laguë (SL) Method
and Comment Tree Decomposition (CTD) Method to reduce the redundancy and
diversify the returned results.
• According to the experiment results of the two methods, both methods prove to
be effective diversification techniques. SL method treats comments as entities
while CTD preserves the conversational style of the discussions.
References
[1] Diversifying Search Results, R. Agrawal, S. Gollapudi, A. Halverson, S.
Ieong WSDM 2009
[[2]Exploiting Query Reformulations for Web Search Result Diversification,R.
L. T. Santos, C. Macdonald, I. Ounis WWW 2010
[3]Diversity by Proportionality: An Election-based Approach to Search Result
Diversification, Van Dang and W. Bruce Croft SIGIR 2012
[4]The use of MMR, diversity-based reranking for reordering documents and
producing summaries. J. Carbonell and J. Goldstein. In Proceedings SIGIR
[5]Term Level Search Result Diversification,Van Dang and W. Bruce Croft
SIGIR 2013
[6]Beyond Independent Relevance: Methods and Evaluation Metrics C. Zhai,
W. W. Cohen, J. Lafferty: for Subtopic Retrieval, SIGIR 2003
Thank you !
Dankeschön!
谢谢你们!

More Related Content

What's hot

Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationQuentin Pleplé
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With RJahnab Kumar Deka
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Bhaskar Mitra
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningRoberto Pereira Silveira
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlBen Healey
 
AestasIT - Internal DSLs in Scala
AestasIT - Internal DSLs in ScalaAestasIT - Internal DSLs in Scala
AestasIT - Internal DSLs in ScalaDmitry Buzdin
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionDeeksha thakur
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
What Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceWhat Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceJonathan Mugan
 
Object Oriented Paradigm
Object Oriented ParadigmObject Oriented Paradigm
Object Oriented ParadigmHüseyin Ergin
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationPierre de Lacaze
 
집합모델 확장불린모델
집합모델  확장불린모델집합모델  확장불린모델
집합모델 확장불린모델guesta34d441
 

What's hot (20)

The Duet model
The Duet modelThe Duet model
The Duet model
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for Search
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet Allocation
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With R
 
N20190729
N20190729N20190729
N20190729
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco Control
 
Zizka aimsa 2012
Zizka aimsa 2012Zizka aimsa 2012
Zizka aimsa 2012
 
AestasIT - Internal DSLs in Scala
AestasIT - Internal DSLs in ScalaAestasIT - Internal DSLs in Scala
AestasIT - Internal DSLs in Scala
 
Author Topic Model
Author Topic ModelAuthor Topic Model
Author Topic Model
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
What Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceWhat Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial Intelligence
 
Object Oriented Paradigm
Object Oriented ParadigmObject Oriented Paradigm
Object Oriented Paradigm
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
Lec1
Lec1Lec1
Lec1
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
 
집합모델 확장불린모델
집합모델  확장불린모델집합모델  확장불린모델
집합모델 확장불린모델
 

Similar to Diversified Social Media Retrieval for News Stories

Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text ClassificationSai Srinivas Kotni
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networksconnectbeubax
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspectiveankurpandeyinfo
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)fridolin.wild
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)Sihan Chen
 
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...National Institute of Informatics
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain OntologyKeerti Bhogaraju
 

Similar to Diversified Social Media Retrieval for News Stories (20)

Topic modelling
Topic modellingTopic modelling
Topic modelling
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Text clustering
Text clusteringText clustering
Text clustering
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
话题模型2
话题模型2话题模型2
话题模型2
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networks
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspective
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)
 
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
Probabilistic Topic models
Probabilistic Topic modelsProbabilistic Topic models
Probabilistic Topic models
 
LDA on social bookmarking systems
LDA on social bookmarking systemsLDA on social bookmarking systems
LDA on social bookmarking systems
 
Collaborative DL
Collaborative DLCollaborative DL
Collaborative DL
 

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 

Diversified Social Media Retrieval for News Stories

  • 1. Diversified Social Media Retrieval for News Stories. Bryan Hang ZHANG Feb. 25th 2016 Master Thesis Colloquium Department of Computational Linguistics Dr. Vinay SETTY Prof. Dr. Günter NEUMANN Supervisors:
  • 2. Outline • Motivation • Related Work • Solution • Experiment Evaluation • Conclusion • Acknowledgement
  • 3. Motivation • Social media data is generated by users constantly. • Twitter • Blogs • Forums (Quora, WEBMD….) • Comments (Reddit, Instagram, YouTube….)
  • 4. Motivation query: news story Thread Thread Rank: 2nd Rank: 3rd ThreadRank: 1st Rank: K th Thread When using a news story summary (from Wiki news) to retrieve relevant information from Reddit comments data, • Threads (relevant news summary) • are retrieved
  • 5. Motivation query: news story Thread Thread Rank: 2nd Rank: 3rd ThreadRank: 1st Rank: K th Thread When using a news story summary (from Wiki news) to retrieve relevant information from Reddit comments data, • Threads (relevant news summary) • linked comments ( by users) are retrieved
  • 6. Motivation query: news story Thread Thread Rank: 2nd Rank: 3rd ThreadRank: 1st Rank: K th Thread When using a news story summary (from Wiki news) to retrieve relevant information from Reddit comments data, • Threads (relevant news summary) • linked comments ( by users) are retrieved Tree
  • 8. Motivation query: news story Cuba Wants Off U.S. Terrorism List Before Restoring Normal Ties Most Americans Support Renewed U.S.-Cuba Relations Obama announces historic overhaul of relations; Cuba releases American Raul Castro: US Must Return Guantanamo for Normal Relations • Social media data is generated by users constantly. December 17 2014 – U.S. President Barack Obama announces the resumption of normal relations between the U.S. and Cuba. 2691 comments linked to the top 10 threads (Okapi BM-25 ranking)
  • 9. Motivation query: news story Thread Thread Rank: 2nd Rank: 3rd ThreadRank: 1st Rank: K th Thread • Social media data is generated by users constantly. December 17 2014 – U.S. President Barack Obama announces the resumption of normal relations between the U.S. and Cuba. 2691 comments linked to the top 10 threads (Okapi BM-25 ranking) Tree
  • 10. Motivation News story pseudo search result (thread+ linked comment) diversified search result (concise, diverse result list) Data: Reddit data Subreddit(category): Politics / World News • The goal is to reduce the Redundancy in the pseudo search result from Reddit comments for news stories and create a concise and diversified search result.
  • 11. Related Work • Research focusing on the reflection of ambiguity of a query in the retrieved results and reduce redundancy: Implicit diversification methods: reduce redundancy based on documents content dissimilarity • Maximum Marginal Relevance [4] • BIR[6] Explicit diversification methods: explicitly models the aspects (topics, categories) of a query and consider which query aspects individual documents relate to. •IA-Diversity[1] (user intention) •xQuad[2] (query reformation) •PM[3,5] (proportional representation covering the query aspects)
  • 12. Related Work • Research focusing on summarizing social media data due to the large volume : • Continuous summarization of evolving tweet streams. L. Shou, Z. Wang, K. Chen, and G. Chen. SumblrIn SIGIR, 2013. • Hierarchical multi-label classification of social text streams. Z. Ren, M.- H. Peetz, S. Liang, W. van Dolen, and In SIGIR, 2014. • Summarizing web forum threads based on a latent topic propagation process. Z. Ren, J. Ma, S. Wang, and Y. Liu. In CIKM, 2011 • Topic sentiment mixture: modeling facets and opinions in weblogs. Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai. In WWW, 2007. • Entity-centric topic-oriented opinion summarization in twitter. X. Meng, F. Wei, X. Liu, M. Zhou, S. Li, and H. Wang. In KDD, 2012.
  • 13. Related Work There is no retrieval diversification work that has been done on the unedited, coherent, short, Tree-Structured comments
  • 15. Solution Overview elastic search system text pre- processing topic modelling diversification • Top k threads Text-based scoring using Okapi BM-25 (linked comments)
  • 16. Solution Overview elastic search system text processing topic modelling diversification • Top k threads • Text Normalisation • Named Entity Tagging • Text Representation (for better topical clustering) • Sentiment Tagging (linked comments)
  • 17. Text Pre-processing • Remove urls (using Twokenizer) and non-alphanumeric symbols, sentence tokenisation (NLTK sentence tokenizer) • Sentiment Analysis VADER (rule-besed sentiment tagger) • Part-of-Speech/ Named Entity Tagging Senna Tagger (Neural Network architecture - based tagger) 1. Duplicate Named Entities because there are more entity-based topic type in social media. 2. Select words according to the Penn Treebank part-of-speech tags. (according to Centering Theory) 3. Lemmatise selected words (NLTK Lemmatiser)
  • 19. Text Pre-processing “The original article i read a couple months ago was in Der Spiegel and said nothing of a new or alternative party, although its possible i forgot.” ['DER_SPIEGEL', 'DER_SPIEGEL', 'original', 'article', 'read', 'couple', 'month', 'ago', 'der', 'spiegel', 'said', 'nothing', 'new', 'alternative', 'party', 'possible', 'forgot', 'here', 'related', 'article'] "The *Titanic* has hit an iceberg - and takes on more passengers …P.S. Yeah, keep those downvotes coming: they won't change reality, e.g. the unemployment figures in the Eurozone." [TITANIC’, 'TITANIC', 'EUROZONE', 'EUROZONE', 'titanic', 'ha', 'hit', 'iceberg', 'take', 'on', 'more', 'passenger', 'keep', 'downvotes', 'coming', 'won', 'change', 'reality', 'figure', ‘eurozone'] Named Entity Named Entity word word word
  • 20. Solution Overview elastic search system text processing topic modelling diversification • Top k graphs • Topic Tagging • Topic Extraction • Text Normalisation • Named Entity Tagging • Text Representation • Sentiment Tagging (thread+comments)
  • 21. Clustering • There are many clustering and topical modelling techniques: k-means, hierarchical clustering, frequent set clustering, LDA, pLSA. • Challenges for modelling topics for reddit comments. • Comments are short. k-means, hierarchical clustering, LDA • Unpredicted number of topics. LDA, pLSA • Topical clusters interpretation. LDA • Ungrammatical sentences and sentence fragments. Relations from Collapsed Typed Dependencies cannot be accurately extracted. Topic Modelling
  • 22. Clustering the sum of the total probability over all mixture components: P(d) = KX k=1 P(d|z = k)P(z = k) (6) K is the number of mixture components (clusters). It[41] has the assumptions: • The words in a document are generated independently when the document”s cluster label k is known • The probability of a word is independent of its position within the document. It [41] assumes that each mixture component(cluster) is a multinomial distribution over words and a Dirichlet distribution is also assumed as the prior for each mixture component (cluster): VX 1.select a mixture component(cluster) k 2. The selected mixture component(cluster) k generates d ⃗z cluster labels of each document I number of iterations mz number of documents in cluster z nz number of words in cluster z nw z number of occurrences of word w in cluster z Nd number of words in document d Nw d number of occurrences of word w in document d Table 1: Notations probability over all mixture components: p(d) = K k=1 p(d|z = k)p(z = k) (1) Here, K is the number of mixture components (clusters). Now, the problem becomes how to define p(d|z = k) and p(z = k). DMM makes the Naive Bayes assumption: that the words in a document are generated independently when the document’s cluster label k is known, and the probability of a word is independent of its position within the document. Then the probability of document d generated by cluster k can be derived as follows: p(d|z = k) = w∈d p(w|z = k) (2) Nigam et al. [20] assumes that each mixture component (cluster) is a multinomial distribution over words, such that p(w|z = k) = p(w|z = k, Φ) = φk,w, where w = 1, ..., V and w φk,w = 1. They assume a Dirichlet distribution as the prior for each mixture component (cluster), such that p(Φ|⃗β) = Dir(⃗φk|⃗β). They also assume that the weight of empty, in other words, GSDMM can into several groups. Through experim 4.5, we found that the number of non by GSDMM can be near the true num as K is larger than the true number. clustering model like Gaussian Mixtu since we can get the probability of eac to each cluster from p(zd = z|⃗z¬d, ⃗d). Algorithm 1: GSDMM Data: Documents in the input, ⃗d. Result: Cluster labels of each doc begin initialize mz, nz, and nw z as zero for each document d ∈ [1, D] do sample a cluster for d: zd ← z ∼ Multinomial(1/K mz ← mz + 1 and nz ← nz for each word w ∈ d do nw z ← nw z + Nw d for i ∈ [1, I] do for each document d ∈ [1, D record the current cluste mz ← mz − 1 and nz ← for each word w ∈ d do nw z ← nw z − Nw d sample a cluster for d: zd ← z ∼ p(zd = z|⃗z¬d, ⃗d mz ← mz + 1 and nz ← the sum of the total probability over all mixture components: P(d) = KX k=1 P(d|z = k)P(z = k) (6) K is the number of mixture components (clusters). It[41] has the assumptions: • The words in a document are generated independently when the document”s cluster label k is known • The probability of a word is independent of its position within the document. It [41] assumes that each mixture component(cluster) is a multinomial distribution over words and a Dirichlet distribution is also assumed as the prior for each mixture component (cluster): P(w|z = k) = P(w|z = k, ) = k,w where VX w w,k = 1 and P( |~) = Dir(~✓|~) They also assume that the weight of each mixture component (cluster) is sampled from a multinomial distribution and a Dirichlet prior for this multinomial distribution is also P(w|z = k) = P(w|z = k, ) = k,w where VX w w,k = 1 and P( |~) = Dir(~✓|~) They also assume that the weight of each mixture component (cluster) is sampled from a multinomial distribution and a Dirichlet prior for this multinomial distribution is also assumed: P(z = k) = P(z = k|⇥) = ✓k where KX k ✓k = 1 and P(⇥|~↵) = Dir(~✓|~↵) collapsed Gibbs Samplings for GSDMM is introduced in [59], documents are randomly assigned to K clusters initially and the following information is recorded: the cluster labels of each document ~z, mz is number of documents in each cluster z , and nw z is the number of occurrences of word w in each cluster z, then documents are traversed for a number of iterations. In each iteration, each document is reassigned to a cluster according to the conditional distribution of P(Zd = z|~z¬d)the cluster z given the document ~d and cluster ~z¬d: P(Zd = z|~z¬d) / mz,¬d + ↵ D 1 + K↵ Q w2d Nw dQ j=1 (nw z,¬d + + j 1) NdQ i=1 (nz,¬d + V + i 1) Dirichlet Multinomial Mixture Model (DMM) Topic ModelingTopic Modelling d Abstract Your abstract. 1 Introduction ↵ ⇥ 2 Some LATEX Examples Abstract Your abstract. 1 Introduction ↵ ⇥ 2 Some LATEX Examples 2.1 How to Leave Comments 1 I ↵ ⇥ 2 S 2.1 Comm Your ab 1 Introdu ↵ ⇥ 2 Some L 2.1 How to Comments can b mand, as shown Your abstract. 1 Introduction z d D K ↵ ⇥ Your ab 1 Introdu z d D K ↵ ⇥ 2 Some L Your abstract. 1 Introduction z d D K ↵ ⇥
  • 23. Clustering Dirichlet Multinomial Mixture Model (DMM) Figure 1: Graphical model of DM V number of words in the vocabulary D number of documents in the corpus ¯L average length of documents ⃗d documents in the corpus ⃗z cluster labels of each document I number of iterations m number of documents in cluster z the sum of the total probability over all mixture components: P(d) = KX k=1 P(d|z = k)P(z = k) (6) K is the number of mixture components (clusters). It[41] has the assumptions: • The words in a document are generated independently when the document”s cluster label k is known • The probability of a word is independent of its position within the document. It [41] assumes that each mixture component(cluster) is a multinomial distribution over words and a Dirichlet distribution is also assumed as the prior for each mixture component (cluster): VX 1. Select a mixture component(cluster) k 2. The selected mixture component(cluster) k generates d ⃗z cluster labels of each document I number of iterations mz number of documents in cluster z nz number of words in cluster z nw z number of occurrences of word w in cluster z Nd number of words in document d Nw d number of occurrences of word w in document d Table 1: Notations probability over all mixture components: p(d) = K k=1 p(d|z = k)p(z = k) (1) Here, K is the number of mixture components (clusters). Now, the problem becomes how to define p(d|z = k) and p(z = k). DMM makes the Naive Bayes assumption: that the words in a document are generated independently when the document’s cluster label k is known, and the probability of a word is independent of its position within the document. Then the probability of document d generated by cluster k can be derived as follows: p(d|z = k) = w∈d p(w|z = k) (2) Nigam et al. [20] assumes that each mixture component (cluster) is a multinomial distribution over words, such that p(w|z = k) = p(w|z = k, Φ) = φk,w, where w = 1, ..., V and w φk,w = 1. They assume a Dirichlet distribution as the prior for each mixture component (cluster), such that p(Φ|⃗β) = Dir(⃗φk|⃗β). They also assume that the weight of empty, in other words, GSDMM can into several groups. Through experim 4.5, we found that the number of non by GSDMM can be near the true num as K is larger than the true number. clustering model like Gaussian Mixtu since we can get the probability of eac to each cluster from p(zd = z|⃗z¬d, ⃗d). Algorithm 1: GSDMM Data: Documents in the input, ⃗d. Result: Cluster labels of each doc begin initialize mz, nz, and nw z as zero for each document d ∈ [1, D] do sample a cluster for d: zd ← z ∼ Multinomial(1/K mz ← mz + 1 and nz ← nz for each word w ∈ d do nw z ← nw z + Nw d for i ∈ [1, I] do for each document d ∈ [1, D record the current cluste mz ← mz − 1 and nz ← for each word w ∈ d do nw z ← nw z − Nw d sample a cluster for d: zd ← z ∼ p(zd = z|⃗z¬d, ⃗d mz ← mz + 1 and nz ← the sum of the total probability over all mixture components: P(d) = KX k=1 P(d|z = k)P(z = k) (6) K is the number of mixture components (clusters). It[41] has the assumptions: • The words in a document are generated independently when the document”s cluster label k is known • The probability of a word is independent of its position within the document. It [41] assumes that each mixture component(cluster) is a multinomial distribution over words and a Dirichlet distribution is also assumed as the prior for each mixture component (cluster): P(w|z = k) = P(w|z = k, ) = k,w where VX w w,k = 1 and P( |~) = Dir(~✓|~) They also assume that the weight of each mixture component (cluster) is sampled from a multinomial distribution and a Dirichlet prior for this multinomial distribution is also P(w|z = k) = P(w|z = k, ) = k,w where VX w w,k = 1 and P( |~) = Dir(~✓|~) They also assume that the weight of each mixture component (cluster) is sampled from a multinomial distribution and a Dirichlet prior for this multinomial distribution is also assumed: P(z = k) = P(z = k|⇥) = ✓k where KX k ✓k = 1 and P(⇥|~↵) = Dir(~✓|~↵) collapsed Gibbs Samplings for GSDMM is introduced in [59], documents are randomly assigned to K clusters initially and the following information is recorded: the cluster labels of each document ~z, mz is number of documents in each cluster z , and nw z is the number of occurrences of word w in each cluster z, then documents are traversed for a number of iterations. In each iteration, each document is reassigned to a cluster according to the conditional distribution of P(Zd = z|~z¬d)the cluster z given the document ~d and cluster ~z¬d: P(Zd = z|~z¬d) / mz,¬d + ↵ D 1 + K↵ Q w2d Nw dQ j=1 (nw z,¬d + + j 1) NdQ i=1 (nz,¬d + V + i 1) Topic ModelingTopic Modelling
  • 24. Topic Modeling A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering JianhuaYin, Tsinghua University, Beijing, China Gibbs Sampling for Dirichlet Multinomial Mixture Model (DMM) • They introduced the collapsed Gibbs Sampling Algorithm for DMM Figure 1: Graphical model of DM V number of words in the vocabulary D number of documents in the corpus ¯L average length of documents ⃗d documents in the corpus ⃗z cluster labels of each document I number of iterations mz number of documents in cluster z nz number of words in cluster z nw z number of occurrences of word w in c Nd number of words in document d Nw d number of occurrences of word w in d Table 1: Notations • In each iteration sample a cluster to the document according to: er. As sult in in the e same ument ). We ) (4) n doc- ation 4 tion 3, tion 3. f their uation odel in itional orithm ) (5) p(zd = z|⃗z¬d, ⃗d) ∝ p(⃗d, ⃗z|⃗α, ⃗β) p(⃗d¬d, ⃗z¬d|⃗α, ⃗β) ∝ ∆(⃗m + ⃗α) ∆(⃗m¬d + ⃗α) ∆(⃗nz + ⃗β) ∆(⃗nz,¬d + ⃗β) ∝ Γ(mz + α) Γ(mz,¬d + α) Γ(D − 1 + Kα) Γ(D + Kα) w∈d Γ(nw z + β) w∈d Γ(nw z,¬d + β) Γ(nz,¬d + V β) Γ(nz + V β) (6) where mz = mz,¬d + 1 and nz = nz,¬d + Nd. Because Γ function has the following property: Γ(x+m) Γ(x) = m i=1(x + i− 1). We can rewrite Equation 6 into the following form: p(zd = z|⃗z¬d, ⃗d) ∝ mz,¬d + α D − 1 + Kα w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) Nd i=1(nz,¬d + V β + i − 1) (7) When we assume each word can at most appear once in each document (In the movie group example, the assumption is a movie can at most appear once in each student’s list). We can get w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) = w∈d(nw z,¬d + β) since nw z = nw z,¬d + 1 holds, and Equation 7 turns out to be Equation 3. When we allow a word to appear multi-times in each doc- ument (A movie can appear multi-times in each student’s list).We can get w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) = w∈d Nw d j=1(nw z,¬d + β + j − 1) since nw z = nw z,¬d + Nw d holds, and Equation 7 turns out to be Equation 4. Topic Modelling
  • 25. Topic Modeling A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering JianhuaYin, Tsinghua University, Beijing, China Gibbs Sampling for Dirichlet Multinomial Mixture Model (DMM) • They introduced the collapsed Gibbs Sampling Algorithm for DMM Figure 1: Graphical model of DM V number of words in the vocabulary D number of documents in the corpus ¯L average length of documents ⃗d documents in the corpus ⃗z cluster labels of each document I number of iterations mz number of documents in cluster z nz number of words in cluster z nw z number of occurrences of word w in c Nd number of words in document d Nw d number of occurrences of word w in d Table 1: Notations • In each iteration sample a cluster to the document according to: er. As sult in in the e same ument ). We ) (4) n doc- ation 4 tion 3, tion 3. f their uation odel in itional orithm ) (5) p(zd = z|⃗z¬d, ⃗d) ∝ p(⃗d, ⃗z|⃗α, ⃗β) p(⃗d¬d, ⃗z¬d|⃗α, ⃗β) ∝ ∆(⃗m + ⃗α) ∆(⃗m¬d + ⃗α) ∆(⃗nz + ⃗β) ∆(⃗nz,¬d + ⃗β) ∝ Γ(mz + α) Γ(mz,¬d + α) Γ(D − 1 + Kα) Γ(D + Kα) w∈d Γ(nw z + β) w∈d Γ(nw z,¬d + β) Γ(nz,¬d + V β) Γ(nz + V β) (6) where mz = mz,¬d + 1 and nz = nz,¬d + Nd. Because Γ function has the following property: Γ(x+m) Γ(x) = m i=1(x + i− 1). We can rewrite Equation 6 into the following form: p(zd = z|⃗z¬d, ⃗d) ∝ mz,¬d + α D − 1 + Kα w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) Nd i=1(nz,¬d + V β + i − 1) (7) When we assume each word can at most appear once in each document (In the movie group example, the assumption is a movie can at most appear once in each student’s list). We can get w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) = w∈d(nw z,¬d + β) since nw z = nw z,¬d + 1 holds, and Equation 7 turns out to be Equation 3. When we allow a word to appear multi-times in each doc- ument (A movie can appear multi-times in each student’s list).We can get w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) = w∈d Nw d j=1(nw z,¬d + β + j − 1) since nw z = nw z,¬d + Nw d holds, and Equation 7 turns out to be Equation 4. cluster z without document d Topic Modelling
  • 26. Topic Modeling A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering JianhuaYin, Tsinghua University, Beijing, China Gibbs Sampling for Dirichlet Multinomial Mixture Model (DMM) • They introduced the collapsed Gibbs Sampling Algorithm for DMM Figure 1: Graphical model of DM V number of words in the vocabulary D number of documents in the corpus ¯L average length of documents ⃗d documents in the corpus ⃗z cluster labels of each document I number of iterations mz number of documents in cluster z nz number of words in cluster z nw z number of occurrences of word w in c Nd number of words in document d Nw d number of occurrences of word w in d Table 1: Notations • In each iteration sample a cluster to the document according to: er. As sult in in the e same ument ). We ) (4) n doc- ation 4 tion 3, tion 3. f their uation odel in itional orithm ) (5) p(zd = z|⃗z¬d, ⃗d) ∝ p(⃗d, ⃗z|⃗α, ⃗β) p(⃗d¬d, ⃗z¬d|⃗α, ⃗β) ∝ ∆(⃗m + ⃗α) ∆(⃗m¬d + ⃗α) ∆(⃗nz + ⃗β) ∆(⃗nz,¬d + ⃗β) ∝ Γ(mz + α) Γ(mz,¬d + α) Γ(D − 1 + Kα) Γ(D + Kα) w∈d Γ(nw z + β) w∈d Γ(nw z,¬d + β) Γ(nz,¬d + V β) Γ(nz + V β) (6) where mz = mz,¬d + 1 and nz = nz,¬d + Nd. Because Γ function has the following property: Γ(x+m) Γ(x) = m i=1(x + i− 1). We can rewrite Equation 6 into the following form: p(zd = z|⃗z¬d, ⃗d) ∝ mz,¬d + α D − 1 + Kα w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) Nd i=1(nz,¬d + V β + i − 1) (7) When we assume each word can at most appear once in each document (In the movie group example, the assumption is a movie can at most appear once in each student’s list). We can get w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) = w∈d(nw z,¬d + β) since nw z = nw z,¬d + 1 holds, and Equation 7 turns out to be Equation 3. When we allow a word to appear multi-times in each doc- ument (A movie can appear multi-times in each student’s list).We can get w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) = w∈d Nw d j=1(nw z,¬d + β + j − 1) since nw z = nw z,¬d + Nw d holds, and Equation 7 turns out to be Equation 4. cluster z without document d p(zd = z|⃗z¬d, ⃗d) ∝ mz,¬d + α D − 1 + Kα w∈d Nw d j=1(nw z,¬d + β + j − 1) Nd i=1(nz,¬d + V β + i − 1) (4) where Nw d is the number of occurrences of word w in doc- ument d. We should note that the two parts of Equation 4 have similar relationship with MGP like that of Equation 3, and the complexity of Equation 4 is the same as Equation 3. The only difference between them is the numerator of their second part. We will try to derive Equation 3 and Equation 4 from the Dirichlet Multinomial Mixture (DMM) model in the next section. 2.4 Derivation of GSDMM In this section, we try to formally derive the conditional distribution p(zd = z|⃗z¬d, ⃗d) used in our GSDMM algorithm as follows. p(zd = z|⃗z¬d, ⃗d) = p(⃗d, ⃗z|⃗α, ⃗β) p(⃗d, ⃗z¬d|⃗α, ⃗β) ∝ p(⃗d, ⃗z|⃗α, ⃗β) p(⃗d¬d, ⃗z¬d|⃗α, ⃗β) (5) where ¬d means document d is excluded from ⃗z and ⃗d. Now we need to derive the full distribution p(⃗d, ⃗z|⃗α, ⃗β). From the graphical model of DMM in Figure 1, we can see p(⃗d, ⃗z|⃗α, ⃗β) = p(⃗d|⃗z, ⃗β)p(⃗z|⃗α). Then we need to derive p(⃗d|⃗z, ⃗β) and p(⃗z|⃗α). Let us first investigate how to obtain p(⃗z|⃗α). We can see that p(⃗z|⃗α) can be obtained by integrating with respect to Θ as p(⃗z|⃗α) = p(⃗z|Θ)p(Θ|⃗α)dΘ. As mentioned in Sec- tion 2.2, p(Θ|⃗α) is a Dirichlet distribution and p(⃗z|Θ) is w∈d Γ(n where mz = mz,¬d + function has the follow 1). We can rewrite Eq p(zd = z|⃗z¬d, ⃗d ∝ mz,¬d + α D − 1 + K When we assume ea each document (In the is a movie can at most We can get w∈d Γ(nw z w∈d Γ(nw z,¬ nw z,¬d + 1 holds, and Eq When we allow a wo ument (A movie can a list).We can get w∈d w∈d Γ j − 1) since nw z = nw z,¬ out to be Equation 4. 3. DISCUSSION 3.1 Meaning of A In this part, we try to the help of the Movie in Section 2.1. From E to the prior probability Topic Modelling
  • 27. Topic Modeling Collapsed Gibbs Sampling Algorithm [9] Figure 1: Graphical model of DM V number of words in the vocabulary D number of documents in the corpus ¯L average length of documents ⃗d documents in the corpus ⃗z cluster labels of each document I number of iterations mz number of documents in cluster z nz number of words in cluster z nw z number of occurrences of word w in c Nd number of words in document d Nw d number of occurrences of word w in d Table 1: Notations • In each iteration, sample a cluster to the document according to until the clusters are stable. Then the conditional distribution in Equation 5 can be de- rived as follows: p(zd = z|⃗z¬d, ⃗d) ∝ p(⃗d, ⃗z|⃗α, ⃗β) p(⃗d¬d, ⃗z¬d|⃗α, ⃗β) ∝ ∆(⃗m + ⃗α) ∆(⃗m¬d + ⃗α) ∆(⃗nz + ⃗β) ∆(⃗nz,¬d + ⃗β) ∝ Γ(mz + α) Γ(mz,¬d + α) Γ(D − 1 + Kα) Γ(D + Kα) w∈d Γ(nw z + β) w∈d Γ(nw z,¬d + β) Γ(nz,¬d + V β) Γ(nz + V β) (6) where mz = mz,¬d + 1 and nz = nz,¬d + Nd. Because Γ function has the following property: Γ(x+m) Γ(x) = m i=1(x +i− 1). We can rewrite Equation 6 into the following form: p(zd = z|⃗z¬d, ⃗d) ∝ mz,¬d + α D − 1 + Kα w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) Nd i=1(nz,¬d + V β + i − 1) (7) When we assume each word can at most appear once in each document (In the movie group example, the assumption is a movie can at most appear once in each student’s list). We can get w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) = w∈d(nw z,¬d + β) since nw z = nw z,¬d + 1 holds, and Equation 7 turns out to be Equation 3. When we allow a word to appear multi-times in each doc- ument (A movie can appear multi-times in each student’s list).We can get w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) = w∈d Nw d j=1(nw z,¬d + β + j − 1) since nw z = nw z,¬d + Nw d holds, and Equation 7 turns udents in the e in the same a document t’s list). We s: + j − 1) − 1) (4) rd w in doc- f Equation 4 f Equation 3, s Equation 3. rator of their and Equation M) model in e conditional M algorithm ⃗α, ⃗β) ¬d|⃗α, ⃗β) (5) p(d¬d, ⃗z¬d|⃗α, β) ∝ ∆(⃗m + ⃗α) ∆(⃗m¬d + ⃗α) ∆(⃗nz + ⃗β) ∆(⃗nz,¬d + ⃗β) ∝ Γ(mz + α) Γ(mz,¬d + α) Γ(D − 1 + Kα) Γ(D + Kα) w∈d Γ(nw z + β) w∈d Γ(nw z,¬d + β) Γ(nz,¬d + V β) Γ(nz + V β) (6) where mz = mz,¬d + 1 and nz = nz,¬d + Nd. Because Γ function has the following property: Γ(x+m) Γ(x) = m i=1(x + i− 1). We can rewrite Equation 6 into the following form: p(zd = z|⃗z¬d, ⃗d) ∝ mz,¬d + α D − 1 + Kα w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) Nd i=1(nz,¬d + V β + i − 1) (7) When we assume each word can at most appear once in each document (In the movie group example, the assumption is a movie can at most appear once in each student’s list). We can get w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) = w∈d(nw z,¬d + β) since nw z = nw z,¬d + 1 holds, and Equation 7 turns out to be Equation 3. When we allow a word to appear multi-times in each doc- ument (A movie can appear multi-times in each student’s list).We can get w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) = w∈d Nw d j=1(nw z,¬d + β + j − 1) since nw z = nw z,¬d + Nw d holds, and Equation 7 turns out to be Equation 4. nw z,¬d and nz,¬d are the number of occurrences of movie w in table z and the total number of movies in table z with- out considering student d, respectively. When table z has more students sharing similar interests with student d (i.e., watched more movies of the same), movies of student d will appear more often in table z (with larger nw z,¬d), and the probability of student d choosing table z will be larger. As a result, the second part of Equation 3 tends to result in large homogeneity, because it can leads the students in the same table to be more similar (more likely to be in the same ground true group). If we allow a word to appear multi-times in a document (A movie can appear multi-times in a student’s list). We can derive the conditional probability as follows: p(zd = z|⃗z¬d, ⃗d) ∝ mz,¬d + α D − 1 + Kα w∈d Nw d j=1(nw z,¬d + β + j − 1) Nd i=1(nz,¬d + V β + i − 1) (4) where Nw d is the number of occurrences of word w in doc- ument d. We should note that the two parts of Equation 4 have similar relationship with MGP like that of Equation 3, and the complexity of Equation 4 is the same as Equation 3. The only difference between them is the numerator of their second part. We will try to derive Equation 3 and Equation 4 from the Dirichlet Multinomial Mixture (DMM) model in the next section. 2.4 Derivation of GSDMM In this section, we try to formally derive the conditional distribution p(zd = z|⃗z¬d, ⃗d) used in our GSDMM algorithm as follows. p(⃗d, ⃗z|⃗α, ⃗β) = Then the conditional di rived as follows: p(zd = z|⃗z¬ ∝ ∆(⃗m + ∆(⃗m¬d + ∝ Γ(mz + Γ(mz,¬d w∈d Γ(n w∈d Γ(nw z where mz = mz,¬d + 1 function has the followin 1). We can rewrite Equ p(zd = z|⃗z¬d, ⃗d) ∝ mz,¬d + α D − 1 + Kα When we assume eac each document (In the m is a movie can at most We can get w∈d Γ(nw z + w∈d Γ(nw z,¬d nw z,¬d + 1 holds, and Equ When we allow a word ument (A movie can ap alpha relates to the cluster popularity beta relates to the similar interest Topic Modelling e students in the to be in the same es in a document udent’s list). We ollows: + β + j − 1) + i − 1) (4) f word w in doc- rts of Equation 4 at of Equation 3, me as Equation 3. umerator of their n 3 and Equation (DMM) model in e the conditional SDMM algorithm ⃗d, ⃗z|⃗α, ⃗β) d, ⃗z¬d|⃗α, ⃗β) (5) om ⃗z and ⃗d. Now ⃗z|⃗α, ⃗β). From the n see p(⃗d, ⃗z|⃗α, ⃗β) = |⃗z, ⃗β) and p(⃗z|⃗α). ∝ ∆(⃗m + ⃗α) ∆(⃗m¬d + ⃗α) ∆(⃗nz + ⃗β) ∆(⃗nz,¬d + ⃗β) ∝ Γ(mz + α) Γ(mz,¬d + α) Γ(D − 1 + Kα) Γ(D + Kα) w∈d Γ(nw z + β) w∈d Γ(nw z,¬d + β) Γ(nz,¬d + V β) Γ(nz + V β) (6) where mz = mz,¬d + 1 and nz = nz,¬d + Nd. Because Γ function has the following property: Γ(x+m) Γ(x) = m i=1(x + i− 1). We can rewrite Equation 6 into the following form: p(zd = z|⃗z¬d, ⃗d) ∝ mz,¬d + α D − 1 + Kα w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) Nd i=1(nz,¬d + V β + i − 1) (7) When we assume each word can at most appear once in each document (In the movie group example, the assumption is a movie can at most appear once in each student’s list). We can get w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) = w∈d(nw z,¬d + β) since nw z = nw z,¬d + 1 holds, and Equation 7 turns out to be Equation 3. When we allow a word to appear multi-times in each doc- ument (A movie can appear multi-times in each student’s list).We can get w∈d Γ(nw z +β) w∈d Γ(nw z,¬d +β) = w∈d Nw d j=1(nw z,¬d + β + j − 1) since nw z = nw z,¬d + Nw d holds, and Equation 7 turns out to be Equation 4. 3. DISCUSSION 3.1 Meaning of Alpha and Beta set at 0.1 for our task set at 0.2 for our task
  • 28. Topic Modeling Dirichlet Multinomial Mixture Model (DMM) and collapsed Gibbs Sampling algorithm • The number of clusters is inferred automatically • Balance the completeness and homogeneity of the clusters • Converge fast • Cope with the sparse and high- dimensional problem of short texts • Representative words of each cluster (similar to PLSA and LDA ) are the most frequent words in each cluster. We get excellent results, majority of the topic clusters can be easily interpreted using the representative words. For the thread Germany won the world cup in Brazil, some comments are in Portuguese and German. Our topic modelling approach can cluster the comments according to the language, so we get a cluster of comments in German and a cluster in Portuguese. Topic Modelling
  • 29. Topic Modeling Rank: Rank: Rank: K Thread Thread ThreadRank: Thread Extract 27 topical Clusters Topic 1 Topic 2 Topic 3 Topic k Query: December 17 2014 – U.S. President Barack Obama announces the resumption of normal relations between the U.S. and Cuba. 2691 comments from the top 10 threads Topic Modelling Example:
  • 30. Topic Modeling • Obama announces historic overhaul of relations; Cuba releases American • Cuba Wants Off U.S. Terrorism List Before Restoring Normal Ties • Most Americans Support Renewed U.S.-Cuba Relations • Raul Castro: US Must Return Guantanamo for Normal Relations • Russian foreign minister praises new U.S.-Cuba relations • U.S. Approves Ferry Service Between Florida and Cuba • US, Cuba restore full diplomatic relations after 54 years • President Barack Obama announced Wednesday that the U.S. and Cuba will reopen their embassies in Havana and Washington, heralding a "new chapter" in relations after a half-century of hostility. • Raul Castro: U.S. must return Guantanamo for normal relations • U.S. Takes Cuba off Terror List, Paving the Way for Normal Ties Although the Top 10 threads are all about Cuba and the US Topic Modelling
  • 31. Clustering 27 topical clusters extracted from 2691 comments for the query: Number of Comments Topic Words ( Top 10 most frequent words) 8 18 war libya utf partagas haiti 69i god somalia isil pakistan 12 7 cuban statistic government cuba un mean have independent number ha 11 21 mexico cuba gulf gitmo america gtmo navy panamacanal small control 10 22 ftfy nixon cheney nato un germany lincoln still facebook republican 13 57 tropico isi terrorist just know order have people cia drone 38 218 cuba america cia germany turkey have soviet japan castro war 15 101 russia ukraine america cuba russian crimea have american eu state 14 240 cuban cigar cuba have tobacco people nicaragua so just dominican 17 10 southafrica angola cuba south africa mozambique death get un leonardcohen 18 155 cuba america canada country american mexico list china ha saudi 30 6 nonmobile pollo jadehelm15 feedback mobile please counter bot non link 37 530 have just re people think american thing up that so 32 416 cuba cuban guantanamo american government have us castro lease country Topic Index More relevant topics in their linked comments are discovered. Topic Modelling
  • 32. Clustering Number of Comments Topic Words ( Top 10 most frequent words from left to right) 25 94 castro usa cuba florida soviet cuban nuke don venezuela wanted 26 1 nigelthornberry 27 79 gitmo guantanamo iraq wto iran naval base obama american china 46 729 cuba cuban american have people obama relation country castro 45 8 abbott texas voter alec id voteridlaw name paulweyrich heritageinstitute co 42 43 lincoln washington roosevelt congress had mandela term unitedstate newburgh 41 3 michigan ohio toledo nwo upper had state won peninsula bet 1 165 obama congress republican democrat have clinton bernie that bush iowa 0 5 republican cost want higher dc job aca highly tax people 3 23 texas woman mexico healthcare republican mmr ha have rate mean 5 21 unsc cuba us un padron iraq cigar uncharter ha charter 7 32 cuba us cuban spanish latinamerica law treaty american spanishempire america 6 30 turkey armenian turk have armenia israel havana just people up 9 4 erdogan ataturk turk hitler mhp chp kurd turkey kurdish election Topic Index Topic Modelling
  • 33. Solution Overview elastic search system text processing topic modelling diversification • Top k threads • Topic Tagging • Topic Extraction • Text Normalisation • Named Entity Tagging • Text Representation • Sentiment Tagging • Sainte Laguë Method • Comments Tree Decomposition
  • 34. ThreadRank ThreadRank label the comments from each thread then each thread has n topic clusters of comments Diversification -Sainte Laguë Method topic 1 topic 2 topic 3 po
  • 35. ThreadRank ThreadRank label every comment a sentiment label or an emotional label (if emotion modelling works) Positive Neutral Negative Diversification -Sainte Laguë Method
  • 36. Diversification -Sainte Laguë Method ThreadRank po ThreadRank topic 1 topic 2 topic 3
  • 38. Diversification -Sainte Laguë Method Sainte Laguë Method trieving the representative comments proportionally from Lrj. SL method [38] is a high- est quotient method for allocating seats in party-list proportional representation used in many voting systems. After all the comments have been tallied, successive quotients are computed using equation 11 for each cluster. where, V is the total number of comments in each cluster; S is the number of ’seats’ that cluster has been allocated so far, initially 0 for all clusters. quotient = V 2S + 1 (11) Whichever cluster has the highest quotient gets the next ’seat’ allocated, and their quo- tient is recalculated given their new ’seat’ total. The process is repeated until all ’seats’ have been allocated. The number of the ’seats’ is a hyper-parameter, it can be set accord- ing to users’ interests. We use table 2 as an example illustrates how the process works: There are three clusters. five comments are expected to be retrieved (number of ’seats’ is five). The denominators in the first row are calculated as 2S + 1 where S = 0, 1, 2... respectively and the quotients in the column 2 to column 4 are the quotients calculated 32 The Sainte Laguë Method is a highest quotient method for allocating seats in party-list proportional representation used in many voting systems. After all the comments have been tallied, successive quotients are computed for each cluster. where, • V is the total number of comments in each cluster; • S is the number of ’seats’ that cluster has been allocated so far, initially 0 for all clusters
  • 39. po Thread Diversification -Sainte Laguë Method Denominator /1 /3 /5 Seats(*) topic A positive 50* 16.67* 10 2 topic A neutral 40* 13.33* 8 2 topic A negative 30* 10 6 1 Table 2. Sainte-Lagu¨e method example we retrieve n comments proportionally from the diverse clusters to form the esult which is concise and diverse as Figure 10. Example: 5 comments are expected to be retrieved (number of ‘seats’ is 5). The denominators in the first row are calculated as 2S + 1 where S = 0, 1, 2... respectively and the quotients in the column 2 to column 4 are the quotients calculated. The quotients marked with “*” represent the allocated ‘seats’. So for this example, 2 comments are from cluster topic A positive 2 comments from cluster topic A neutral 1 comment from cluster topic A negative are retrieved Here we propose the Sainte-Lagu¨e (SL) method to diversify the search result by re- trieving the representative comments proportionally from Lrj. SL method [38] is a high- est quotient method for allocating seats in party-list proportional representation used in many voting systems. After all the comments have been tallied, successive quotients are computed using equation 11 for each cluster. where, V is the total number of comments in each cluster; S is the number of ’seats’ that cluster has been allocated so far, initially 0 for all clusters. quotient = V 2S + 1 (11) Whichever cluster has the highest quotient gets the next ’seat’ allocated, and their quo- tient is recalculated given their new ’seat’ total. The process is repeated until all ’seats’ have been allocated. The number of the ’seats’ is a hyper-parameter, it can be set accord- ing to users’ interests. We use table 2 as an example illustrates how the process works: There are three clusters. five comments are expected to be retrieved (number of ’seats’ is five). The denominators in the first row are calculated as 2S + 1 where S = 0, 1, 2... respectively and the quotients in the column 2 to column 4 are the quotients calculated 32
  • 40. Diversification -Sainte Laguë Method We apply SL method to clusters of all ranks. Comments with higher user score are selected first. retrieval number = min ( γ · Ncl,rj , |Ci,rj | ) • Nrj is the number of retrieved comments at rank rj (rj ∈ R) • γ is a positive constant controlling the retrieval scale. • |Ci,rj | is number of comments Ci at rank j. • Then the representative comments are retrieved from each topic-sentiment cluster of all ranks proportionally.
  • 42. po po po rank 1 rank 2 rank 3 Diversified Search Result Pseudo Search Result
  • 43. Comments Tree Decomposition THREADRank Comments Tree • Levels (with different colors) represents the coherence: • comment at lower level is the reply to the one at higher level • comments at the same level are independent.
  • 44. Comments Tree Decomposition THREADRank Comments Tree Set the Decomposition level at 1,
  • 45. Comments Tree Decomposition comment level 0 comment level 1 comment level 1 comment level 2comment level 2 comment level 2 comment level 3 comment level 4 comment level 4 comment level 3 enumerate the path from level 0 to level m
  • 46. Comments Tree Decomposition comment level 0 comment level 1 comment level 1 comment level 2comment level 2 comment level 2 comment level 3 comment level 4 comment level 4 comment level 3 sub-tree 1
  • 47. Comments Tree Decomposition comment level 0 comment level 1 comment level 1 comment level 2comment level 2 comment level 2 comment level 3 comment level 4 comment level 4 comment level 3 sub-tree 2
  • 48. Comments Tree Decomposition comment level 0 comment level 1 comment level 1 comment level 2comment level 2 comment level 2 comment level 3 comment level 4 comment level 4 comment level 3 sub-tree 3
  • 49. Comments Tree Decomposition comment level 0 comment level 1 comment level 1 comment level 2comment level 2 comment level 2 comment level 3 comment level 4 comment level 4 comment level 3 sub-tree 4
  • 50. Example : Query: December 17 2014 – U.S. President Barack Obama announces the resumption of normal relations between the U.S. and Cuba. • When decomposition level is set at 5, the number of the sub-trees for each retrieved thread is as follows: thread1 thread2 thread3 thread4 thread5 thread6 thread7 thread8 thread9 thread10 number of trees (per thread) 1 8 8 3 15 49 90 2 127 11 number of decomposed sub- trees trees (per thread) 1 19 17 3 35 165 342 2 625 11 Comments Tree Decomposition
  • 51. Sub-Tree 1 Sub-Tree 2 Select one sub-tree according to the sub-tree score Comments Tree Decomposition
  • 52. Comments Tree Decomposition • Comment Score: each comment has a score given by users. Sub- tree score is the sum of the user score of the comments in the subtree. • Linguistic Features: score the subtree using the diversity of the linguistic feature of each comment in the subtree. The different linguistic features we propose are NP words (words that can potentially form noun phrases), named entities and bigrams. • Number of Topics: diversity of the topic tag of comment in the sub-tree.
  • 53. Experiment Setup Data: 26,669,242 Reddit comments 845,004 threads from year 2008-2015 Sub-reddit: worldnews / politics Queries: 50 news summaries from Wikinews 2011-2014 Ranking: Elasticsearch and Okapi BM-25 score ranking and choose the top 10 threads and their linked comments; on average there are 4330.7 comments retrieved for each query.
  • 54. Experiment Evaluation • We use Cumulative Gain (CG) to measure the diversity, CG can also penalise redundancy. • Charles L.A claims [7] that CG at rank k can be used directly as diversity evaluation measure. • Al-maskari [8] provides evidence that CG correlates better with user satisfaction than Normalised Discounted Cumulative Gain (nDCG).
  • 55. Experiment EvaluationCG[k] = kX j=1 G[j] (12) G[k] = mX i=1 J(dk, i)(1 ↵)ri,k 1 (13) number of comments (dj) ranked up to position k 1 that contain ri,k 1 = k 1X j=1 J(dj, i) 39 where, 4.1 Relevance Judgments To estimate P(ni ∈ d) we adopt a simple model by the manual judgments typical of TREC tasks. sume that a human assessor reads d and reaches decision regarding each nugget: Is the nugget cont the document or not? Let J(d, i) = 1 if the assessor has judged that d nugget ni, and J(d, i) = 0 if not. A possible estim P(ni ∈ d) is then: P(ni ∈ d) = ȷ α if J(d, i) = 1, 0 otherwise. The value α is a constant with 0 < α ≤ 1, which the possibility of assessor error. This definition that positive judgments may be erroneous, but th tive judgments are always correct. This definition is approximation of reality, but is still a step beyond ditional assumption of perfect accuracy. More soph estimates are possible, but are left for future wor assume Equation 3, then Equation 2 becomes: P(R = 1|u, d) = 1 − mY (1 − P(ni ∈ u)αJ(d, i)) and elations after to lower lev- 0 to all of its u discussion iscussions as c(4); subtree c(9); subtree (10); subtree c(12). The epresent the e potentially nal tree. we es and select opose several the sub-tree i,j 2 u0 i has arked with * score of the ,j . 0.1, a negative tag when it is between -0.1 and -1 or a neu- tral tag when it falls in between 0.2 and -0.2. We use Cumulative Gain (CG) to measure the diver- sity: CG[k] = Pk j=1 G[j] and G[k] = Pm i=1 J(dk, i)(1 ↵)ri,k 1 where, ri,k 1 is the number of comments (dj) ranked up to position k 1 that contain nugget ni and ri,k 1 = Pk 1 j=1 J(dj, i); J(d, i) = 1 if comment (d) contains nugget ni, otherwise J(d, i) = 0; k is set to be 10 in our experiment because we choose top 10 threads and their linked comments; The possibility of assessor error ↵ is set to be 0.5. Charles L.A claims [5] that CG at rank k can be used directly as diversity evaluation measure and Al-maskari [2] provides ev- idence that CG correlates better with user satisfaction than Normalized Discounted Cumulative Gain (nDCG). Experiment for the Sainte-Lagu¨e (SL) Method: We set = 2.5 to compute the number of the retrieved comments for each of top 10 threads. We use topic-sentimental tag as the nugget, which is the combination of both topic and sentiment tags to compute CG for the pseudo search result and diversified search result with SL method for each query, the average CG over 50 queries is presented in the table 2. use Cumulative Gain (CG) to measure the diversity and CG also penalizes r y: CG[k] = kX j=1 G[j] G[k] = mX i=1 J(dk, i)(1 ↵)ri,k 1 e, ri,k 1 is the number of comments (dj) ranked up to position k 1 that co et ni and ri,k 1 = k 1X j=1 J(dj, i) 39 • r i, k-1 is number of comments dj ranked up to position k-1 that contains nugget ni . • alpha is a constant between 0 and 1 that reflects the assessor error. We set alpha at 0.5 for our experiment.
  • 56. Experiment Evaluation retrieval result CG retrieve percent diversified result with SL 71.80 ± 44.31 16.60% pseudo search result 50.37 ± 27.29 100% Table 3. Sainte Lagu¨e method experiment result CG retrieval percent CTD comment score 27.11 ± 11.19 70.51% CTD NP words 27.31 ± 10.81 70.67% CTD named entities 27.54 ± 11.39 59.17% CTD bigrams 26.60 ± 10.63 70.77% CTD number of topics 28.45 ± 11.72 73.26% pseudo search result 26.38 ± 9.8 100% Table 4. experiment with CTD method where, J(d, i) = 1 if comment (d) contains nugget ni, otherwise J(d, i) = 0; k is set to be 10 in our experiment because we choose top 10 threads and their linked comments; The possibility of assessor error ↵ is set to be 0.5. Charles L.A claims [15] that CG at rank k Sainte Laguë Method • SL method shows that diversified search results have tremendous diversity improvement with only 16.60% of comments from the pseudo search result on average. • The increase in diversity is foreseeable because comments are retrieved directly from the topic sentimental clusters with proportionality. SL method proves to be effective with the expense of the coherence in the discussion.
  • 57. Experiment Evaluation • CTD also demonstrates the effectiveness to reduce the redundancy and improve the diversity of pseudo search results using less comments; • Small-scale comment trees also maintain the coherence and conversation style of the discussion. retrieval result CG retrieve percent diversified result with SL 71.80 ± 44.31 16.60% pseudo search result 50.37 ± 27.29 100% Table 3. Sainte Lagu¨e method experiment result CG retrieval percent CTD comment score 27.11 ± 11.19 70.51% CTD NP words 27.31 ± 10.81 70.67% CTD named entities 27.54 ± 11.39 59.17% CTD bigrams 26.60 ± 10.63 70.77% CTD number of topics 28.45 ± 11.72 73.26% pseudo search result 26.38 ± 9.8 100% Table 4. experiment with CTD method where, J(d, i) = 1 if comment (d) contains nugget ni, otherwise J(d, i) = 0; k is set to be 10 in our experiment because we choose top 10 threads and their linked comments; The possibility of assessor error ↵ is set to be 0.5. Charles L.A claims [15] that CG at rank k can be used directly as diversity evaluation measure and Al-maskari [3] provides evidence that CG correlates better with user satisfaction than Normalized Discounted Cumulative Gain (nDCG). Experiment for the Sainte-Lagu¨e (SL) Method: We set = 2.5 to compute the number Comment Tree Decomposition
  • 58. Conclusion • We proposed novel methods to distill the diverse interpretable topics from pseudo search result using topic model with effective text processing. • We studied the characteristics of Reddit comments. • We introduced two diversification methods namely Sainte-Laguë (SL) Method and Comment Tree Decomposition (CTD) Method to reduce the redundancy and diversify the returned results. • According to the experiment results of the two methods, both methods prove to be effective diversification techniques. SL method treats comments as entities while CTD preserves the conversational style of the discussions.
  • 59. References [1] Diversifying Search Results, R. Agrawal, S. Gollapudi, A. Halverson, S. Ieong WSDM 2009 [[2]Exploiting Query Reformulations for Web Search Result Diversification,R. L. T. Santos, C. Macdonald, I. Ounis WWW 2010 [3]Diversity by Proportionality: An Election-based Approach to Search Result Diversification, Van Dang and W. Bruce Croft SIGIR 2012 [4]The use of MMR, diversity-based reranking for reordering documents and producing summaries. J. Carbonell and J. Goldstein. In Proceedings SIGIR [5]Term Level Search Result Diversification,Van Dang and W. Bruce Croft SIGIR 2013 [6]Beyond Independent Relevance: Methods and Evaluation Metrics C. Zhai, W. W. Cohen, J. Lafferty: for Subtopic Retrieval, SIGIR 2003