Classification Method for
Shared Information on
Twitter Without Text Data
the University of Tokyo, Japan
Seigo Baba
Fujio Toriumi, Takeshi Sakaki, Kosuke Shinoda,
Kazuhiro Kazama, Satoshi Kurihara, Itsuki Noda
The 3rd International Workshop on Social Web for Disaster Management
(SWDM'15) with WWW’15(May 2015, Florence, Italy)
1
Contents
• Introduction
• Proposed tweet Clustering method
• Subjective Experiments
• Linguistic Similarities in Clusters
• Conclusions
2
Contents
• Introduction
• Proposed tweet Clustering method
• Subjective Experiments
• Linguistic Similarities in Clusters
• Conclusions
3
Information in Disaster Situation
• Local information must be collected
– For Victims
• Shelter location
• Tsunami, ...
– For Rescuers
• Donating money
• Volunteer activities, ...
4
How to collect information in
disaster situation ?
• From mass media ?
– General and public information only
– Not personalized
• From social media ?
– They perform well
• [10 Mendoza],[11 miyabe],[10 sakaki]
– In particular, Twitter is useful
• We also focus on Twitter
5
Classification of Tweets is
required
6
A lot of Tweets
5,000 Tweets posted per sec in
the 2011 Great East Japan Earthquake
(Official Twitter Blog — Japan)
Collecting appropriate Tweets is difficult
Classification of Tweets is required !
Weakness in Classification
using Text Mining
7
「Shut off the gas」
「My head hurts」
「Wear shoes」
「Good morning !」
「A head office」
「Protect your head」
:
Group
Cluster①
「My head hurts」
「A head office」
「Protect your head」
Cluster②
:
• Are they topic similar?
Focusing on Retweet
• RT(Retweet): Suggest a user has interest
in a Tweet[13 Toriumi]
8
「Shut off the gas」
「My head hurts」
「Wear shoes」
「Good morning !」
「A head office」
「Protect your head」
:
Group
Cluster①
「Shut off the gas」
「Wear shoes」
「Protect your head」
Cluster②
:
Interest
Purpose of this study
9
「Shut off the gas」
「My head hurts」
「Wear shoes」
「Good morning !」
「A head office」
「Protect your head」
:
Group
Cluster①
「Shut off the gas」
「Wear shoes」
「Protect your head」
Cluster②
:
Interest
Propose a novel tweet classification method
focusing on retweets
Contents
• Introduction
• Proposed tweet Clustering method
• Subjective Experiments
• Linguistic Similarities in Clusters
• Conclusions
10
An outline of proposed method
11
Calculate the
similarity between
tweets 𝑡𝑖 and 𝑡𝑗
Construct
retweet network
Network clustering
Tweet1 Tweet20.15
Tweet3
0.03
The Similarity of Retweet Users
• Similar tweets are retweeted by similar users
• Two tweets whose similarity of retweet users is high may
share a topic
– 𝑂𝑖𝑗 =
𝑈 𝑖∩𝑈 𝑗
𝑈 𝑖∪𝑈 𝑗
– 𝑈𝑖:Users who retweeted tweet ,𝑇𝑖
12・・・・
=Retweet
T1 T2 T3 T4 T5 ・・・・・・
T=Tweet
The Similarity of Retweet Users
• Similar tweets are retweeted by similar users
• Two tweets whose similarity of retweet users is high may
share a topic
– 𝑂𝑖𝑗 =
𝑈 𝑖∩𝑈 𝑗
𝑈 𝑖∪𝑈 𝑗
– 𝑈𝑖:Users who retweeted tweet ,𝑇𝑖
13・・・・
=Retweet
T1 T2 T3 T4 T5 ・・・・・・
T=Tweet
Construct retweet network
• Connect two tweets which satisfy 𝑂𝑖𝑗 > 𝑡ℎ
– 𝑡ℎ=0.05
– The similarities for all the combination of two
tweets were calculated
– Nodes in obtained component may be topic
similar mutually
14
T1
T4T3
T2
0.06
0.1
0.01
T1
T4T3
T2
𝑡ℎ=0.05
Data
• Tweets retweeted more than 100 times
from March 5 to 24, 2011
– The Great East Japan Earthquake occurred at
11th
– 34,860 tweets
15
Retweet network
16
Network Clustering
• It is assumed that large component have
various topics
• Apply clustering method based on
Newman method [04 Newman] to retweet
network
– To extract clusters that contain similar tweets
17
Clustering Result
• 11,494 Tweets→2,001 Clusters
• Following slides show some clusters
18
Result Example 1
• Cluster about shelter
19
The Oura cafeteria on the Ueno Campus of the
Tokyo University of the Arts is open. You can
spend the night there.
[a quick report] Okumakodo is open! It looks like it
has some blankets http://twitpic.com/48f6y2
Are you all right? [The Tokyo Bunka Kaikan just
opened. It's getting dark and cold, so if you are
around Ueno Station, please go there.]
Result Example 2
• Cluster about advice for victims
20
If you are evacuating with a baby, wrap the baby
in a blanket and carry it in a tote bag. No baby
buggies! #jishin
[Please spread] If you use Twitter by mobile
phone, turn off your icons to conserve battery life.
Contents
• Introduction
• Proposed tweet Clustering method
• Subjective Experiments
• Linguistic Similarities in Clusters
• Conclusions
21
Proposed Method’s Validity
• Conduct subjective experiments to clarify
the proposed method’s validity
– Are tweets in same cluster similar to each
other ?
• The Experiment consists of 2 choice
questions
22
Example of a question in
subjective experiment
23
Twitter is a source of
information
Yahoo! Map shows the
area of the rolling
blackouts
The site gives information about
power plant and rolling blackouts
Which tweet is more topic-similar
to me?
Choice Tweet A Choice Tweet B
Statement Tweet
How to Make Questions ?
• Choice tweets consist of two tweets
– Inner tweet
• Belongs to the cluster to which the statement tweet
belongs
– Outer tweet
• Belongs to the cluster to which the statement tweet
does not belong
24
Tweet
Tweet
Tweet
Tweet
Cluster
Tweet
Tweet
Tweet
ClusterInner Tweet
Statement
Tweet
Outer Tweet
How to Make Questions ?
25
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Cluster Cluster
Tweet
Tweet
Tweet
Tweet
Cluster
Tweet
Tweet
Tweet
Tweet
Cluster
Tweet
• Two cluster are selected randomly
Tweet
Tweet
Tweet
Cluster
How to Make Questions ?
26
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Cluster Cluster
Tweet
Tweet
Tweet
Tweet
Cluster
Tweet
Tweet
Tweet
Tweet
Cluster
Tweet
• Two cluster are selected randomly
Tweet
Tweet
Tweet
Cluster
How to Make Questions ?
27
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Cluster Cluster
Tweet
Tweet
Tweet
Tweet
Cluster
Tweet
Tweet
Tweet
Tweet
Cluster
Tweet
• A statement Tweet is selected randomly
Tweet
Tweet
Tweet
Cluster
Statement
Tweet
How to Make Questions ?
28
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Tweet
Cluster
Tweet
Tweet
Tweet
Tweet
Cluster
Tweet
Tweet
Tweet
Tweet
Cluster
Tweet
• Inner Tweet and outer Tweet are selected
randomly
Tweet
Tweet
Tweet
Cluster
Statement
TweetInner Tweet Outer Tweet
Example of a question in
subjective experiment
29
Twitter is a source of
information
Yahoo! Map shows the
area of the rolling
blackouts
The site gives information about
power plant and rolling blackouts
Which tweet is more topic-similar
to me?
Choice Tweet A Choice Tweet B
Statement Tweet
Inner TweetOuter Tweet
Examinees and Questions
• 100 questions were selected randomly
– Each examinee solved 50 of them
• Fourteen Examinees
– Seven examinees solved each question
– If more than four examinees select a inner
tweet, the result is labeled as ‘Correct’.
30
Subjective Experiment Result
• 89% of all the question were correct !
31
89% !
Subjective Experiment Result
• The similarities are obvious
– More than six examinees selected the inner
tweet in 77% of questions
32
77% !
The validity was confirmed
• We confirmed the validity of the proposed
method
– The rate of the clusters whose nodes are
mutually similar in the cluster to all cluster is
very high
– The similarities of the nodes in each cluster
are obvious
33
Contents
• Introduction
• Proposed tweet Clustering method
• Subjective Experiments
• Linguistic Similarities in Clusters
• Conclusions
34
Can classification based on text
mining group them?
35
This tweet was posted by a volunteer center.
Yesterday, more than 1000 people read it and
learned about dangerous areas and shortages.
What should we do? http://t.co/4JpWlXt #jishin
RT [please spread] Check that your car has a jack
for changing tires. They are useful for rescuing
victims from rubble. #jishin #jisin
• Some clusters have little linguistic similarities
– Which are difficult to group by using text mining
Cluster about advice for victims
Linguistic Similarities in Clusters
• The quantitative assessments of linguistic
similarities is required
• Apply Vector Space Model
– Calculate the linguistic similarity between two
document based on TF-IDF
• In this Study, document = tweet
36
Apply Vector Space Model
• Calculate linguistic similarities of two
tweets for all the combination(34,860 𝐶2)
– Including linked and unlinked combination
– To calculate reference values
37
Reference Values
• The result of calculation for all combination
– Average = 0.0156
• When the similarity between two tweets is
under that average(0.0156), their linguistic
similarity is random at most
38
Linguistic Similarities in each
cluster
• The linguistic similarities in each cluster
were also calculated
– Defined as the average of the tweets for all
the combinations of the nodes that belong to
the cluster
39
3 𝐶2
Cluster
Tweet1 Tweet2
Tweet1
Tweet2 Tweet3
Tweet3
0.5
0.3
0.1
0.5+0.3+0.1
3
=0.3
Tweet1
Tweet2
Tweet3
All combinations
in cluster
Linguistic similarity
in cluster
Linguistic Similarities in each
cluster
• 8.25 % of all clusters are under 0.0156
– Some of the clusters are as low as randomly
selected tweets
– Which are difficult to group by using text
mining !
40
8.25%, 0.0156
Example of Clusters with low
linguistic similarities 1
• Cluster about life in shelter
– Linguistic similarity is 0.0108
41
I've experienced two big earthquakes. I spent a
few nights in a car and saw many senior citizens
who seemed to be suffering from economy class
syndrome from remaining in the same posture
for a long time. If you have to spend too much
time in a car or a cramped shelter, don't forget to
stretch your legs.
If children are shaking or suffering from fear, hug
and comfort them.
Example of Clusters with low
linguistic similarities 2
• Cluster about advices for victims
– Linguistic similarity is 0.0052
42
RT [Summarize the information]
Open the door, Cook some rice, Place baggages
in an entrance, Buy water, Snacks and a towel,
Blankets, Wear shoes ....
My friend who survived the Great Hanshin
Earthquake evacuated his house in pajamas. So
tonight, sleep in clothes just case you have to
leave quickly.
Contents
• Introduction
• Proposed tweet Clustering method
• Subjective Experiments
• Linguistic Similarities in Clusters
• Conclusions
43
Conclusions
• We proposed a novel method of the
classification of tweets by focusing on
retweets without using text mining
• Most of the obtained clusters have local
information which are very useful in
disaster situation
44
Conclusions
• A subjective experiment confirmed the
validity of our method
– Nodes are similar to each other in 89 %
clusters
– The similarities are obvious
• Clusters obtained by our method are topic-
similar, even if they are not linguistically
similar
45
Future Works
• Apply a softClustering method to retweet
network
– Our proposed method is alternative classification
– A tweet can’t belong to multi clusters
46
Tsunami
Shelter
Donating
money
Volunteer
Donating
supplies
?
Information for
victims
Information for
rescuers
Future Works
• Apply a softClustering method to retweet
network
– When softClustering is applied to retweet
network, a tweet can belong to multi clusters
47
Tsunami
Shelter
Donating
money
Volunteer
Donating
supplies
Information for
victims
Information for
rescuers
Future Works
• Reduce the amount of calculations
– Information must be provided quickly in
disaster situation
48
Thank you!
• If you have good idea for our study, please
mail me.
49
baba@crimson.q.t.u-tokyo.ac.jp

Swdm15

  • 1.
    Classification Method for SharedInformation on Twitter Without Text Data the University of Tokyo, Japan Seigo Baba Fujio Toriumi, Takeshi Sakaki, Kosuke Shinoda, Kazuhiro Kazama, Satoshi Kurihara, Itsuki Noda The 3rd International Workshop on Social Web for Disaster Management (SWDM'15) with WWW’15(May 2015, Florence, Italy) 1
  • 2.
    Contents • Introduction • Proposedtweet Clustering method • Subjective Experiments • Linguistic Similarities in Clusters • Conclusions 2
  • 3.
    Contents • Introduction • Proposedtweet Clustering method • Subjective Experiments • Linguistic Similarities in Clusters • Conclusions 3
  • 4.
    Information in DisasterSituation • Local information must be collected – For Victims • Shelter location • Tsunami, ... – For Rescuers • Donating money • Volunteer activities, ... 4
  • 5.
    How to collectinformation in disaster situation ? • From mass media ? – General and public information only – Not personalized • From social media ? – They perform well • [10 Mendoza],[11 miyabe],[10 sakaki] – In particular, Twitter is useful • We also focus on Twitter 5
  • 6.
    Classification of Tweetsis required 6 A lot of Tweets 5,000 Tweets posted per sec in the 2011 Great East Japan Earthquake (Official Twitter Blog — Japan) Collecting appropriate Tweets is difficult Classification of Tweets is required !
  • 7.
    Weakness in Classification usingText Mining 7 「Shut off the gas」 「My head hurts」 「Wear shoes」 「Good morning !」 「A head office」 「Protect your head」 : Group Cluster① 「My head hurts」 「A head office」 「Protect your head」 Cluster② : • Are they topic similar?
  • 8.
    Focusing on Retweet •RT(Retweet): Suggest a user has interest in a Tweet[13 Toriumi] 8 「Shut off the gas」 「My head hurts」 「Wear shoes」 「Good morning !」 「A head office」 「Protect your head」 : Group Cluster① 「Shut off the gas」 「Wear shoes」 「Protect your head」 Cluster② : Interest
  • 9.
    Purpose of thisstudy 9 「Shut off the gas」 「My head hurts」 「Wear shoes」 「Good morning !」 「A head office」 「Protect your head」 : Group Cluster① 「Shut off the gas」 「Wear shoes」 「Protect your head」 Cluster② : Interest Propose a novel tweet classification method focusing on retweets
  • 10.
    Contents • Introduction • Proposedtweet Clustering method • Subjective Experiments • Linguistic Similarities in Clusters • Conclusions 10
  • 11.
    An outline ofproposed method 11 Calculate the similarity between tweets 𝑡𝑖 and 𝑡𝑗 Construct retweet network Network clustering Tweet1 Tweet20.15 Tweet3 0.03
  • 12.
    The Similarity ofRetweet Users • Similar tweets are retweeted by similar users • Two tweets whose similarity of retweet users is high may share a topic – 𝑂𝑖𝑗 = 𝑈 𝑖∩𝑈 𝑗 𝑈 𝑖∪𝑈 𝑗 – 𝑈𝑖:Users who retweeted tweet ,𝑇𝑖 12・・・・ =Retweet T1 T2 T3 T4 T5 ・・・・・・ T=Tweet
  • 13.
    The Similarity ofRetweet Users • Similar tweets are retweeted by similar users • Two tweets whose similarity of retweet users is high may share a topic – 𝑂𝑖𝑗 = 𝑈 𝑖∩𝑈 𝑗 𝑈 𝑖∪𝑈 𝑗 – 𝑈𝑖:Users who retweeted tweet ,𝑇𝑖 13・・・・ =Retweet T1 T2 T3 T4 T5 ・・・・・・ T=Tweet
  • 14.
    Construct retweet network •Connect two tweets which satisfy 𝑂𝑖𝑗 > 𝑡ℎ – 𝑡ℎ=0.05 – The similarities for all the combination of two tweets were calculated – Nodes in obtained component may be topic similar mutually 14 T1 T4T3 T2 0.06 0.1 0.01 T1 T4T3 T2 𝑡ℎ=0.05
  • 15.
    Data • Tweets retweetedmore than 100 times from March 5 to 24, 2011 – The Great East Japan Earthquake occurred at 11th – 34,860 tweets 15
  • 16.
  • 17.
    Network Clustering • Itis assumed that large component have various topics • Apply clustering method based on Newman method [04 Newman] to retweet network – To extract clusters that contain similar tweets 17
  • 18.
    Clustering Result • 11,494Tweets→2,001 Clusters • Following slides show some clusters 18
  • 19.
    Result Example 1 •Cluster about shelter 19 The Oura cafeteria on the Ueno Campus of the Tokyo University of the Arts is open. You can spend the night there. [a quick report] Okumakodo is open! It looks like it has some blankets http://twitpic.com/48f6y2 Are you all right? [The Tokyo Bunka Kaikan just opened. It's getting dark and cold, so if you are around Ueno Station, please go there.]
  • 20.
    Result Example 2 •Cluster about advice for victims 20 If you are evacuating with a baby, wrap the baby in a blanket and carry it in a tote bag. No baby buggies! #jishin [Please spread] If you use Twitter by mobile phone, turn off your icons to conserve battery life.
  • 21.
    Contents • Introduction • Proposedtweet Clustering method • Subjective Experiments • Linguistic Similarities in Clusters • Conclusions 21
  • 22.
    Proposed Method’s Validity •Conduct subjective experiments to clarify the proposed method’s validity – Are tweets in same cluster similar to each other ? • The Experiment consists of 2 choice questions 22
  • 23.
    Example of aquestion in subjective experiment 23 Twitter is a source of information Yahoo! Map shows the area of the rolling blackouts The site gives information about power plant and rolling blackouts Which tweet is more topic-similar to me? Choice Tweet A Choice Tweet B Statement Tweet
  • 24.
    How to MakeQuestions ? • Choice tweets consist of two tweets – Inner tweet • Belongs to the cluster to which the statement tweet belongs – Outer tweet • Belongs to the cluster to which the statement tweet does not belong 24 Tweet Tweet Tweet Tweet Cluster Tweet Tweet Tweet ClusterInner Tweet Statement Tweet Outer Tweet
  • 25.
    How to MakeQuestions ? 25 Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Cluster Cluster Tweet Tweet Tweet Tweet Cluster Tweet Tweet Tweet Tweet Cluster Tweet • Two cluster are selected randomly Tweet Tweet Tweet Cluster
  • 26.
    How to MakeQuestions ? 26 Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Cluster Cluster Tweet Tweet Tweet Tweet Cluster Tweet Tweet Tweet Tweet Cluster Tweet • Two cluster are selected randomly Tweet Tweet Tweet Cluster
  • 27.
    How to MakeQuestions ? 27 Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Cluster Cluster Tweet Tweet Tweet Tweet Cluster Tweet Tweet Tweet Tweet Cluster Tweet • A statement Tweet is selected randomly Tweet Tweet Tweet Cluster Statement Tweet
  • 28.
    How to MakeQuestions ? 28 Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Cluster Tweet Tweet Tweet Tweet Cluster Tweet Tweet Tweet Tweet Cluster Tweet • Inner Tweet and outer Tweet are selected randomly Tweet Tweet Tweet Cluster Statement TweetInner Tweet Outer Tweet
  • 29.
    Example of aquestion in subjective experiment 29 Twitter is a source of information Yahoo! Map shows the area of the rolling blackouts The site gives information about power plant and rolling blackouts Which tweet is more topic-similar to me? Choice Tweet A Choice Tweet B Statement Tweet Inner TweetOuter Tweet
  • 30.
    Examinees and Questions •100 questions were selected randomly – Each examinee solved 50 of them • Fourteen Examinees – Seven examinees solved each question – If more than four examinees select a inner tweet, the result is labeled as ‘Correct’. 30
  • 31.
    Subjective Experiment Result •89% of all the question were correct ! 31 89% !
  • 32.
    Subjective Experiment Result •The similarities are obvious – More than six examinees selected the inner tweet in 77% of questions 32 77% !
  • 33.
    The validity wasconfirmed • We confirmed the validity of the proposed method – The rate of the clusters whose nodes are mutually similar in the cluster to all cluster is very high – The similarities of the nodes in each cluster are obvious 33
  • 34.
    Contents • Introduction • Proposedtweet Clustering method • Subjective Experiments • Linguistic Similarities in Clusters • Conclusions 34
  • 35.
    Can classification basedon text mining group them? 35 This tweet was posted by a volunteer center. Yesterday, more than 1000 people read it and learned about dangerous areas and shortages. What should we do? http://t.co/4JpWlXt #jishin RT [please spread] Check that your car has a jack for changing tires. They are useful for rescuing victims from rubble. #jishin #jisin • Some clusters have little linguistic similarities – Which are difficult to group by using text mining Cluster about advice for victims
  • 36.
    Linguistic Similarities inClusters • The quantitative assessments of linguistic similarities is required • Apply Vector Space Model – Calculate the linguistic similarity between two document based on TF-IDF • In this Study, document = tweet 36
  • 37.
    Apply Vector SpaceModel • Calculate linguistic similarities of two tweets for all the combination(34,860 𝐶2) – Including linked and unlinked combination – To calculate reference values 37
  • 38.
    Reference Values • Theresult of calculation for all combination – Average = 0.0156 • When the similarity between two tweets is under that average(0.0156), their linguistic similarity is random at most 38
  • 39.
    Linguistic Similarities ineach cluster • The linguistic similarities in each cluster were also calculated – Defined as the average of the tweets for all the combinations of the nodes that belong to the cluster 39 3 𝐶2 Cluster Tweet1 Tweet2 Tweet1 Tweet2 Tweet3 Tweet3 0.5 0.3 0.1 0.5+0.3+0.1 3 =0.3 Tweet1 Tweet2 Tweet3 All combinations in cluster Linguistic similarity in cluster
  • 40.
    Linguistic Similarities ineach cluster • 8.25 % of all clusters are under 0.0156 – Some of the clusters are as low as randomly selected tweets – Which are difficult to group by using text mining ! 40 8.25%, 0.0156
  • 41.
    Example of Clusterswith low linguistic similarities 1 • Cluster about life in shelter – Linguistic similarity is 0.0108 41 I've experienced two big earthquakes. I spent a few nights in a car and saw many senior citizens who seemed to be suffering from economy class syndrome from remaining in the same posture for a long time. If you have to spend too much time in a car or a cramped shelter, don't forget to stretch your legs. If children are shaking or suffering from fear, hug and comfort them.
  • 42.
    Example of Clusterswith low linguistic similarities 2 • Cluster about advices for victims – Linguistic similarity is 0.0052 42 RT [Summarize the information] Open the door, Cook some rice, Place baggages in an entrance, Buy water, Snacks and a towel, Blankets, Wear shoes .... My friend who survived the Great Hanshin Earthquake evacuated his house in pajamas. So tonight, sleep in clothes just case you have to leave quickly.
  • 43.
    Contents • Introduction • Proposedtweet Clustering method • Subjective Experiments • Linguistic Similarities in Clusters • Conclusions 43
  • 44.
    Conclusions • We proposeda novel method of the classification of tweets by focusing on retweets without using text mining • Most of the obtained clusters have local information which are very useful in disaster situation 44
  • 45.
    Conclusions • A subjectiveexperiment confirmed the validity of our method – Nodes are similar to each other in 89 % clusters – The similarities are obvious • Clusters obtained by our method are topic- similar, even if they are not linguistically similar 45
  • 46.
    Future Works • Applya softClustering method to retweet network – Our proposed method is alternative classification – A tweet can’t belong to multi clusters 46 Tsunami Shelter Donating money Volunteer Donating supplies ? Information for victims Information for rescuers
  • 47.
    Future Works • Applya softClustering method to retweet network – When softClustering is applied to retweet network, a tweet can belong to multi clusters 47 Tsunami Shelter Donating money Volunteer Donating supplies Information for victims Information for rescuers
  • 48.
    Future Works • Reducethe amount of calculations – Information must be provided quickly in disaster situation 48
  • 49.
    Thank you! • Ifyou have good idea for our study, please mail me. 49 baba@crimson.q.t.u-tokyo.ac.jp

Editor's Notes

  • #2 I’m Seigo Baba, student of the University of Tokyo. Today I’ll talk about Classification Method for Shared Information on Twitter Without Text Data.
  • #3 Todays contents is here
  • #4 I’ll talk about introduction
  • #5 In disaster situation , local information must be collected. Victims need information~~ and so on Rescurers need information about and so on
  • #6 How do we collect such local information in disaster situation? From mass media? Maybe not. Mass media focus on general and public information only. Not personalized. But social media are attracting a great deal of attention since they can provide such localized information In particular, many reports argue that twitter, one of the most influential social media, is useful for sharing information during disasters We also address twitter as a source of local information
  • #7 There are a lot of tweets. For example, five thousand tweets were posted per second in ~ So Collecting appropriate tweets is difficult That is to say, classification of tweets is required
  • #8 Previous works about extracting information from twitter based on text mining. However, in some cases, text mining has difficulty extracting information. For example, five tweets ~~, if you do text mining for them, you will get that cluster. Are they topic similar? Absolutely not.
  • #9 How about focusing on retweet? Retweeted tweet means a user who retweeted that tweet has interest in this. If a user retweeted these tweets, it is assumed that he has common interest in them. So Classification based on retweet may group these tweets. The obtained cluster has information about advice for victims immediately after a disaster
  • #10 Now, the purpose of this study is proposing a novel tweet classification method focusing on retweets
  • #11 Next, proposed tweet clustering method
  • #12 This is an outline of proposed method. First, Calculate the similarity of retweet users between two tweets 𝑡 𝑖 and 𝑡 𝑗 for all the combination of tweets. Second, connect two tweet if the similarity of them exceeds a threshold value to construct retweet networks Finally, apply network clustering method to retweet network obtained above
  • #13 (When many users retweet two tweets, they probably have a common interest in them and the topics are similar. In other words, ) Similar tweets are retweeted by similar users. In other words, two tweets whose similarity of retweet users is high may share a topic. The similarity of retweet users between tweets $t_{i}$ and $t_{j}$ is defined as follows: where Ui and Uj means who retweeted $t_{i} ,t_{j}$. That is based on jaccard coefficient. If these users retweeted these tweets
  • #14 (When many users retweet two tweets, they probably have a common interest in them and the topics are similar. In other words, ) Similar tweets are retweeted by similar users. In other words, two tweets whose similarity of retweet users is high may share a topic. The similarity of retweet users between tweets $t_{i}$ and $t_{j}$ is defined as follows: where Ui and Uj means who retweeted $t_{i} ,t_{j}$. That is based on jaccard coefficient. If these users retweeted these tweets
  • #15 connect two tweet if the similarity of them exceeds a threshold value to construct retweet networks. We employed 0.05 as the threshold. It is assumed that nodes in obtained component are topic similar mutually and separated node are not topic similar. So T1, T2,T3 are topic similar and T4 is not topic similar to them
  • #16 This is data used in this paper. we use the log data of tweets written in Japanese that were posted and officially retweeted for 20 days from March 5 to 24, 2011. This period includes the Great Eastern Japan Earthquake that occurred on March 11, 2011. We selected tweets that were retweeted more than 100 times to focus on the information spread and shared. The number of such tweets is 34,860.
  • #17 This figure shows the constructed retweet network. (Each node represents a tweet, and each edge represents a link between tweets whose similarity of retweet users is over the threshold.) The size of the communities is different. Communities with a few nodes are shown in the lower part, and communities with many nodes are shown in the upper part. Small component may have fewer topics. How about large component?
  • #18 It is assumed that large communities have various topics. We applied our clustering method based on Newman method to the entire area of the retweet network to extract clusters that contain similar tweets.
  • #19 We found 2,001 clusters after applying the clustering method The Following slides ~
  • #20 Tweet are originally written in Japanese, but the samples in this presentation are translated into English This cluster has information about shelter. This tweet says that ~ This tweet says that ~ This tweet says that ~ They are topic-similar local information
  • #21 This cluster has information about advice for victims. This tweet says that ~ This tweet says that ~ This tweet says that ~ They are topic-similar local information
  • #23 The proposed method’s validity ~ So we conducted subjective ~
  • #24 The example of a question is here This is statement tweet and these are choice-tweets Maybe you select this tweet (yahoo) whose topic is more topic-similar to that statement tweet
  • #25 Choice tweets consist of two tweets, inner tweet and outer tweet. Inner tweet is the tweet that belongs ~ Outer tweet is the tweet that ~ To make questions, two clusters are selected randomly
  • #26 Choice tweets consist of two tweets, inner tweet and outer tweet. Inner tweet is the tweet that belongs ~ Outer tweet is the tweet that ~ To make questions, two clusters are selected randomly
  • #27 Choice tweets consist of two tweets, inner tweet and outer tweet. Inner tweet is the tweet that belongs ~ Outer tweet is the tweet that ~ To make questions, two clusters are selected randomly
  • #28 Choice tweets consist of two tweets, inner tweet and outer tweet. Inner tweet is the tweet that belongs ~ Outer tweet is the tweet that ~ To make questions, two clusters are selected randomly
  • #29 Choice tweets consist of two tweets, inner tweet and outer tweet. Inner tweet is the tweet that belongs ~ Outer tweet is the tweet that ~ To make questions, two clusters are selected randomly
  • #30 The example of a question is here This is statement tweet and these are choice-tweets Maybe you select this tweet (yahoo) whose topic is more topic-similar to that statement tweet
  • #31 Fourteen people participated in this experiment as examinees, and each question was solved by seven examinees. We randomly selected 100 questions from all the tweets, and each examinee solved 50 of them.
  • #32 This figure shows relationship between number of corresponding examinees and rate of questions. 89% of all the questions were correct
  • #33 When four or five examinees selected the inner tweet, the similarity of the nodes in the cluster is not obvious. However, More than six examinees selected the inner tweet in 77% of questions. From this result, we conclude that the similarities of the nodes in each cluster are obvious.
  • #34 we confirmed the validity of the proposed method where the rate of the clusters whose nodes are mutually similar in the cluster to all clusters is very high and the similarities of the nodes in each cluster are obvious.
  • #36 Some Clusters have little linguistic similarities This cluster is about advice for victims These tweets are topic similar but not linguistic similar
  • #37 The quantitative assessments of linguistic similarities is needed. We applied vector space model that Calculates the linguistic similarity between two document based on TF-IDF. In this study document equals tweet
  • #38 We calculate linguistic similarities of two tweets for all the combination including linked and unlinked combination to calculate reference values
  • #39 Their average was 0.0156 and their standard deviation was 0.0218. When the similarity between two tweets is under the sum of two values (0.0374), their linguistic similarity is random at most.
  • #40 We also calculated the linguistic similarities in each cluster. The linguistic similarity in a cluster is defined as the average of the tweets for all the combinations of the nodes that belong to the cluster; for example, that of the cluster that contains four nodes is the average of 4c2
  • #41 19.1% of all clusters are under 0.0374. This means that some of the clusters obtained by this proposed method group tweets whose linguistic similarities are as low as randomly selected tweets. In other words, the proposed method identifies clusters with low linguistic similarities, but high topic-similarities
  • #42 This is Example of Clusters with low linguistic similarities. These tweets are advice about life in shelter
  • #43 This Cluster is about advices for victims. These tweets are not linguistic similar but topic similar
  • #47 For example, there is a tweet about donating supplies. It is assumed that both victims and rescuers need that tweet. However, in our proposed method this tweet is classified alternatively.