Finding Missing Tweets using Topic Structure and Browsing Time

1/19
Finding Missing Tweets using Topic Structure and Browsing Time
Finding Missing Tweets
using Topic Structure and Browsing Time
Yu Suzuki†
, Hiromitsu Ohara‡
, Akiyo Nadamoto‡
†
Nara Institute of Science and Technology, Japan
‡ Konan University, Japan
5. December, 2017

2/19
Introduction
Introduction
From Social Network Services (SNSs), there are massive volumes of
messages.
Users are not always on-line.
Users miss important information on SNSs.
c.f.) A function on twitter “While you were away.” The structure of
summarization is ﬂat.
Users need to understand in a short time about the topics while the
users are off-line.
A mechanism of summarizing the tweets is useful.
We believe that when we summarize the tweets as a tree structure, the
users can easily understand the topics.
Summarize Tweets Using Topic Structure and Browsing Time

3/19
Introduction
Why we consider topic structure?
missing tweets topic sub topic
Today’s baseball game is exciting! baseball game
Yesterday I went to baseball stadium baseball place
I’m at Salzburg! travel austria
I’m at baseball stadium! baseball place
· · · · · · · · ·
Tweets with minority topics are ignored if we summarize missing tweets.
Missing tweets are mainly related to “baseball.”
Only one tweet is related to “travel.”
If these tweets are summarized without using topics, the tweet about travel
may not be appeared at the summary.
We visualize this topics of tweets as a tree structure.
First, the users see top-level topics, such as “baseball” and “travel.”
if the users are interested in “baseball,” the users browse “game” and “place.”
Users do not miss a tweet about travel.
How to construct the topic structure?

4/19
Introduction
Our contribution
1 Generate topic structures of tweets using the Wikipedia category tree
and browsing time
We use Wikipedia category as a knowledge to construct tree structure.
We use browsing time as a tweets which users miss.
2 Visualize the topic structure of tweets using a network graph
We implement our method using Web application.
3 Conﬁrm using real dataset that our proposed method is effective for
commonly known topics
Our method is effective if there are many information about the theme.
Wikipedia only have articles about commonly known topics.

5/19
Our Proposed Method
Overview
2. Generate a Topic Graph
Wikipedia Category Tree
Tweets
1. Clustering of Tweets
C0 = Ichiro C1 = Masahiro
C3 = Human ➡ delete
too wide to cover topics
Ichiro Masahiro
MLB playerSportsJapanese
Topic node:
a parent node of
Tweet clusters
3. Visualization
Ichiro Masahiro
Japanese MLB Player
Tweet list
Now three of the greatest
hitters in Major League history
in one dugout with the Marlins.
Barry Bonds, Ichiro and Don
Kelly. amazing.
Joe Girardi discusses
Masahiro Tanaka pitching on
extended rest after Tuesday
night's 9-0 victory.
Baseball
Sports
Basketball
Mariners
MLB
Players
Japan
Players
Abstract node:
a parent node of
topic nodes
Topic Graph
tweets
correspond to
category
about Ichiro
about Masahiro

6/19
Our Proposed Method
Steps
Overview
1 Extracting missing tweet: Extracting which tweets are submitted
during user’s browsing time and it is before and after.
2 Clustering Tweets into Categories and extracting topics: Using
Repeated Bisection as clustering tools, we divide a set of tweets into
clusters and extract topics in each cluster.
3 Generate a topic graph: Using the topics of tweets and the Wikipedia
category tree, we generate a topic graph of the tweets.
4 Classify topics Classify the topics which are nodes of the topic graph
as known topics and unknown topics.
5 Visualization of topic graphs: We visualize the topic graph and the
corresponding tweets using our implemented Web user interface.

7/19
Our Proposed Method
0. Extraction of missing tweets
0. Extraction of missing tweets
We extract tweets which users have not browse.
We assume that the browsing time is given.
Browsing time may be available if we construct twitter client applications.

8/19
Our Proposed Method
1. Clustering tweets
Tweets
1. Clustering of Tweets
C0 = Ichiro C1 = Masahiro
C3 = Human ➡ delete
too wide to cover topics
We use repeated-bisection for clustering tweets.
In our experiment, repeated-bisection is the most effective method for
clustering short texts.
Similar to k-means.
We remove noise clusters.
We calculate the cosine similarity between each two texts in a cluster.
We remove the nodes if the similarity is beyond the threshold.

9/19
Our Proposed Method
Repeated bisection
Given a set of tweets T, we extract a feature vector for each tweet. First, we
divide a tweet into the terms using morphological analysis or POS tagger.
Then, we select noun and unknown terms as feature terms. The reason of
using unknown terms is that these terms consist of slang and newly invented
words which are not recognized by the morphological analysis. To clean the
feature terms, we select the terms which are included in more than two
tweets. Feature vector f(ti ) of tweet ti (ti ∈ T) is deﬁned as follows.
f(ti ) = [tf(ti , w1) · idf(w1), tf(ti , w2) · idf(w2), · · · ,
tf(ti , wm) · idf(wm)] (1)
tf(ti , wj ) =



1 if wk appears at ti
more than once
0 else
(2)
idf(wj ) = − log
df(wj )
|T|
(3)
where wj is a term in T, |T| is the number of tweets in T, tf(ti , wj ) indicates
whether wj appears at ti or not, df(wj ) is the number of tweets which have wj ,
and idf(wj ) is an IDF (Inverted Document Frequency) value of wj where a
document is a tweet.

10/19
Our Proposed Method
2. Topic graph
2. Generate a topic graph
2. Generate a Topic Graph
Ichiro Masahiro
MLB playerSportsJapanese
Topic node:
a parent node of
Tweet clusters
Abstract node:
a parent node of
topic nodes
tweets
correspond to
category
1 Generate a topic node corresponds to a tweet.
2 Generate a semantic node, which corresponds to a topic node.
3 Merge multiple nodes into simple structure of nodes.

11/19
Our Proposed Method
2.1 Generate a topic node
Generate a topic node
Topic node: An Wikipedia article corresponds to a cluster.
Method
1 Repeated bisection method outputs keywords for each category, with related
degrees between keywords and categories.
2 We retrieve articles in Wikipedia, and ﬁnd the most relevant category.
Many categories have their articles, then the categories are also
candidates of topic nodes.
Example
A category has keywords such that {(“baseball , 1), (“player , 0.5)}.
There are two Wikipedia articles wp and wq:
The title of wp is “baseball team” and wq is “baseball player.”
Calculate scores for each article:
wq = 1 + 0 = 1 , and wq = 1 + 0.5 = 1.5.
We select an article wq, “baseball player,” as a topic node.

12/19
Our Proposed Method
2.2 Generate a semantic node
Generate a semantic node
Semantic node: The categories which correspond to the topic node on
the Wikipedia.
Method
1 Get category names using Wikipedia.
2 Prune unsuitable categories from semantic nodes using black list.
Person born in 19xx, Stub, A list of xx, . . .
Example
Category c0 is tagged by “Ichiro Suzuki, ”
An article “Ichiro Suzuki” has two categories “Yankees Players” and
“Baseball Players.”
“Yankee players” and “Baseball players” are considered as semantic node.
Ichiro Suzuki Kenta Maeda
Yankees Player Baseball Player Baseball PlayerDodgers Player

13/19
Our Proposed Method
2.3 Merge multiple nodes
Merge Multiple Nodes
Yankees Player Baseball Player Baseball PlayerDodgers Player
Figure: Example of two network graphs
Yankees Player Baseball Player Dodgers Player
Figure: Two nodes are merged if
two graphs share the common
nodes.
Yankees Player Baseball Player Dodgers Player
Sportspeople
Figure: If a leaf node and not leaf node correspond to the same article, these nodes
are merged.

14/19
Our Proposed Method
Visualization
Visualize topic nodes and semantic nodes
3. Visualization
Ichiro Masahiro
Japanese MLB Player
Tweet list
Now three of the greatest
hitters in Major League history
in one dugout with the Marlins.
Barry Bonds, Ichiro and Don
Kelly. amazing.
Joe Girardi discusses
Masahiro Tanaka pitching on
extended rest after Tuesday
night's 9-0 victory.
Topic Graph
about Ichiro
about Masahiro

15/19
Experiments
Experimental Setup
Experimental Setup
Aim of our experiment:
To conﬁrm that our method is effective or not.
Which themes of tweets are appropriate for applying our proposed method.
Evaluation Measure: Precision ratio
We (the second author of our paper) manually select an appropriate
categories for each tweet.
We calculate precision ratio for each category.
precision =
The number of accurately categorized tweets
The number of tweets in the category
Dataset
Category: Politics, Music, Computer, Sports, and Animation/Games (ﬁve
categories)
Tweets: We prepared 2,000 tweets for each category. We use Twitter search
API.

16/19
Experiments
Experimental Setup
Procedure of the experiment
1 Clustering 2,000 tweets for each theme, and extracting topics of each
cluster
2 Generate the topic graph using our proposed method
3 Give clusters and their corresponding Wikipedia article titles to the
observers.
Observers are hired using crowdsourcing (Crowdworks).
4 Observers evaluate whether the article titles are appropriate or not for
representing the clusters using the following ﬁve degrees (5:
appropriate, 4: almost appropriate, 3: cannot say, 2: almost
inappropriate, 1: inappropriate).
5 Summarize the observer’s evaluations, and analyze whether our
proposed method has good accuracy or not

17/19
Experiments
Experimental Results
Experimetal Results
1.0-2.0 2.0-3.0 3.0-4.0 4.0-5.0
25
0
5
10
15
20
Average of evaluation scores
Numberofevaluationscores
Politics
Music
Computer
Sports
Video Games
Table: Numbers of
evaluation scores for
respective bins.
Theme # obsv. Prec.
Politics 8 0.72
Music 11 0.56
Computer 5 0.44
Sports 5 0.42
Animation 4 0.52
& Games
Our method is useful for tweets about politics.
There are many technical terms about politics
Many articles are on the WIkipedia.
Our method is not effective for computer, sports.
There are wide variety of topics.
Less number of articles are on the Wikipedia.

18/19
Experiments
Experimental Results
Merging Multiple topics
ϰϵ͗zĂŬƵůƚ
:ĂƉĂŶĞƐĞ ďĞǀĞƌĂŐĞƐ
ϯϰ͗^ŽĨƚĂŶŬ
ŽŵƉĂŶŝĞƐ ůŝƐƚĞĚ ŽŶ ƚŚĞ WŝŶŬ ^ŚĞĞƚƐ
ŽŵƉĂŶŝĞƐ ďĂƐĞĚ ŝŶ dŽŬǇŽ
One example of a topic/semantic graph
Black node means topic node, and gray node means semantic node.
There is a topic node about “Yakult” and “Sofrbank.”
Yakult: A Manufacturer of drinks
Softbank: A Carrer of Cell phone
Both two companies are based in Tokyo.
We can connect two nodes using our proposed merging nodes of
multiple topics.

19/19
Conclusion
Conclusion
We proposed a method for automatically extracting user’s missing tweets
based on topic granularity and missing time of browsing user.
We extract missing tweet based on the missing time.
We propose a method for mapping a set of extracted missing tweets to the
Wikipedia category tree by considering topic structure granularity.
We conﬁrmed that our proposed method is effective for “politics,” but not
effective for “computer” and “sports.”
Future Work
We should consider resources other than Wikipedia as a knowledge base.
Wikipedia is not always suitable for personal topics.
We should consider synonyms.
We should compare the other methods with our method.
We should do a usability test of Web user interface.

Finding Missing Tweets using Topic Structure and Browsing Time

Recommended

Recommended

More Related Content

Similar to Finding Missing Tweets using Topic Structure and Browsing Time

Similar to Finding Missing Tweets using Topic Structure and Browsing Time (20)

Recently uploaded

Recently uploaded (20)

Finding Missing Tweets using Topic Structure and Browsing Time