SlideShare a Scribd company logo
1 of 19
Download to read offline
1/19
Finding Missing Tweets using Topic Structure and Browsing Time
Finding Missing Tweets
using Topic Structure and Browsing Time
Yu Suzuki†
, Hiromitsu Ohara‡
, Akiyo Nadamoto‡
†
Nara Institute of Science and Technology, Japan
‡ Konan University, Japan
5. December, 2017
2/19
Finding Missing Tweets using Topic Structure and Browsing Time
Introduction
Introduction
From Social Network Services (SNSs), there are massive volumes of
messages.
Users are not always on-line.
Users miss important information on SNSs.
c.f.) A function on twitter “While you were away.” The structure of
summarization is flat.
Users need to understand in a short time about the topics while the
users are off-line.
A mechanism of summarizing the tweets is useful.
We believe that when we summarize the tweets as a tree structure, the
users can easily understand the topics.
Summarize Tweets Using Topic Structure and Browsing Time
3/19
Finding Missing Tweets using Topic Structure and Browsing Time
Introduction
Why we consider topic structure?
missing tweets topic sub topic
Today’s baseball game is exciting! baseball game
Yesterday I went to baseball stadium baseball place
I’m at Salzburg! travel austria
I’m at baseball stadium! baseball place
· · · · · · · · ·
Tweets with minority topics are ignored if we summarize missing tweets.
Missing tweets are mainly related to “baseball.”
Only one tweet is related to “travel.”
If these tweets are summarized without using topics, the tweet about travel
may not be appeared at the summary.
We visualize this topics of tweets as a tree structure.
First, the users see top-level topics, such as “baseball” and “travel.”
if the users are interested in “baseball,” the users browse “game” and “place.”
Users do not miss a tweet about travel.
How to construct the topic structure?
4/19
Finding Missing Tweets using Topic Structure and Browsing Time
Introduction
Our contribution
1 Generate topic structures of tweets using the Wikipedia category tree
and browsing time
We use Wikipedia category as a knowledge to construct tree structure.
We use browsing time as a tweets which users miss.
2 Visualize the topic structure of tweets using a network graph
We implement our method using Web application.
3 Confirm using real dataset that our proposed method is effective for
commonly known topics
Our method is effective if there are many information about the theme.
Wikipedia only have articles about commonly known topics.
5/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
Overview
2. Generate a Topic Graph
Wikipedia Category Tree
Tweets
1. Clustering of Tweets
C0 = Ichiro C1 = Masahiro
C3 = Human ➡ delete
too wide to cover topics
Ichiro Masahiro
MLB playerSportsJapanese
Topic node:
a parent node of
Tweet clusters
3. Visualization
Ichiro Masahiro
Japanese MLB Player
Tweet list
Now three of the greatest
hitters in Major League history
in one dugout with the Marlins.
Barry Bonds, Ichiro and Don
Kelly. amazing.
Joe Girardi discusses
Masahiro Tanaka pitching on
extended rest after Tuesday
night's 9-0 victory.
Baseball
Sports
Basketball
Mariners
MLB
Players
Japan
Players
Abstract node:
a parent node of
topic nodes
Topic Graph
tweets
correspond to
category
about Ichiro
about Masahiro
6/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
Steps
Overview
1 Extracting missing tweet: Extracting which tweets are submitted
during user’s browsing time and it is before and after.
2 Clustering Tweets into Categories and extracting topics: Using
Repeated Bisection as clustering tools, we divide a set of tweets into
clusters and extract topics in each cluster.
3 Generate a topic graph: Using the topics of tweets and the Wikipedia
category tree, we generate a topic graph of the tweets.
4 Classify topics Classify the topics which are nodes of the topic graph
as known topics and unknown topics.
5 Visualization of topic graphs: We visualize the topic graph and the
corresponding tweets using our implemented Web user interface.
7/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
0. Extraction of missing tweets
0. Extraction of missing tweets
We extract tweets which users have not browse.
We assume that the browsing time is given.
Browsing time may be available if we construct twitter client applications.
8/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
1. Clustering tweets
1. Clustering tweets
Tweets
1. Clustering of Tweets
C0 = Ichiro C1 = Masahiro
C3 = Human ➡ delete
too wide to cover topics
We use repeated-bisection for clustering tweets.
In our experiment, repeated-bisection is the most effective method for
clustering short texts.
Similar to k-means.
We remove noise clusters.
We calculate the cosine similarity between each two texts in a cluster.
We remove the nodes if the similarity is beyond the threshold.
9/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
1. Clustering tweets
Repeated bisection
Given a set of tweets T, we extract a feature vector for each tweet. First, we
divide a tweet into the terms using morphological analysis or POS tagger.
Then, we select noun and unknown terms as feature terms. The reason of
using unknown terms is that these terms consist of slang and newly invented
words which are not recognized by the morphological analysis. To clean the
feature terms, we select the terms which are included in more than two
tweets. Feature vector f(ti ) of tweet ti (ti ∈ T) is defined as follows.
f(ti ) = [tf(ti , w1) · idf(w1), tf(ti , w2) · idf(w2), · · · ,
tf(ti , wm) · idf(wm)] (1)
tf(ti , wj ) =



1 if wk appears at ti
more than once
0 else
(2)
idf(wj ) = − log
df(wj )
|T|
(3)
where wj is a term in T, |T| is the number of tweets in T, tf(ti , wj ) indicates
whether wj appears at ti or not, df(wj ) is the number of tweets which have wj ,
and idf(wj ) is an IDF (Inverted Document Frequency) value of wj where a
document is a tweet.
10/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
2. Topic graph
2. Generate a topic graph
2. Generate a Topic Graph
Ichiro Masahiro
MLB playerSportsJapanese
Topic node:
a parent node of
Tweet clusters
Abstract node:
a parent node of
topic nodes
tweets
correspond to
category
1 Generate a topic node corresponds to a tweet.
2 Generate a semantic node, which corresponds to a topic node.
3 Merge multiple nodes into simple structure of nodes.
11/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
2.1 Generate a topic node
Generate a topic node
Topic node: An Wikipedia article corresponds to a cluster.
Method
1 Repeated bisection method outputs keywords for each category, with related
degrees between keywords and categories.
2 We retrieve articles in Wikipedia, and find the most relevant category.
Many categories have their articles, then the categories are also
candidates of topic nodes.
Example
A category has keywords such that {(“baseball , 1), (“player , 0.5)}.
There are two Wikipedia articles wp and wq:
The title of wp is “baseball team” and wq is “baseball player.”
Calculate scores for each article:
wq = 1 + 0 = 1 , and wq = 1 + 0.5 = 1.5.
We select an article wq, “baseball player,” as a topic node.
12/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
2.2 Generate a semantic node
Generate a semantic node
Semantic node: The categories which correspond to the topic node on
the Wikipedia.
Method
1 Get category names using Wikipedia.
2 Prune unsuitable categories from semantic nodes using black list.
Person born in 19xx, Stub, A list of xx, . . .
Example
Category c0 is tagged by “Ichiro Suzuki, ”
An article “Ichiro Suzuki” has two categories “Yankees Players” and
“Baseball Players.”
“Yankee players” and “Baseball players” are considered as semantic node.
Ichiro Suzuki Kenta Maeda
Yankees Player Baseball Player Baseball PlayerDodgers Player
13/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
2.3 Merge multiple nodes
Merge Multiple Nodes
Ichiro Suzuki Kenta Maeda
Yankees Player Baseball Player Baseball PlayerDodgers Player
Figure: Example of two network graphs
Ichiro Suzuki Kenta Maeda
Yankees Player Baseball Player Dodgers Player
Figure: Two nodes are merged if
two graphs share the common
nodes.
Ichiro Suzuki Kenta Maeda
Yankees Player Baseball Player Dodgers Player
Sportspeople
Figure: If a leaf node and not leaf node correspond to the same article, these nodes
are merged.
14/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
Visualization
Visualize topic nodes and semantic nodes
3. Visualization
Ichiro Masahiro
Japanese MLB Player
Tweet list
Now three of the greatest
hitters in Major League history
in one dugout with the Marlins.
Barry Bonds, Ichiro and Don
Kelly. amazing.
Joe Girardi discusses
Masahiro Tanaka pitching on
extended rest after Tuesday
night's 9-0 victory.
Topic Graph
about Ichiro
about Masahiro
15/19
Finding Missing Tweets using Topic Structure and Browsing Time
Experiments
Experimental Setup
Experimental Setup
Aim of our experiment:
To confirm that our method is effective or not.
Which themes of tweets are appropriate for applying our proposed method.
Evaluation Measure: Precision ratio
We (the second author of our paper) manually select an appropriate
categories for each tweet.
We calculate precision ratio for each category.
precision =
The number of accurately categorized tweets
The number of tweets in the category
Dataset
Category: Politics, Music, Computer, Sports, and Animation/Games (five
categories)
Tweets: We prepared 2,000 tweets for each category. We use Twitter search
API.
16/19
Finding Missing Tweets using Topic Structure and Browsing Time
Experiments
Experimental Setup
Procedure of the experiment
1 Clustering 2,000 tweets for each theme, and extracting topics of each
cluster
2 Generate the topic graph using our proposed method
3 Give clusters and their corresponding Wikipedia article titles to the
observers.
Observers are hired using crowdsourcing (Crowdworks).
4 Observers evaluate whether the article titles are appropriate or not for
representing the clusters using the following five degrees (5:
appropriate, 4: almost appropriate, 3: cannot say, 2: almost
inappropriate, 1: inappropriate).
5 Summarize the observer’s evaluations, and analyze whether our
proposed method has good accuracy or not
17/19
Finding Missing Tweets using Topic Structure and Browsing Time
Experiments
Experimental Results
Experimetal Results
1.0-2.0 2.0-3.0 3.0-4.0 4.0-5.0
25
0
5
10
15
20
Average of evaluation scores
Numberofevaluationscores
Politics
Music
Computer
Sports
Video Games
Table: Numbers of
evaluation scores for
respective bins.
Theme # obsv. Prec.
Politics 8 0.72
Music 11 0.56
Computer 5 0.44
Sports 5 0.42
Animation 4 0.52
& Games
Our method is useful for tweets about politics.
There are many technical terms about politics
Many articles are on the WIkipedia.
Our method is not effective for computer, sports.
There are wide variety of topics.
Less number of articles are on the Wikipedia.
18/19
Finding Missing Tweets using Topic Structure and Browsing Time
Experiments
Experimental Results
Merging Multiple topics
ϰϵ͗zĂŬƵůƚ
:ĂƉĂŶĞƐĞ ďĞǀĞƌĂŐĞƐ
ϯϰ͗^ŽĨƚĂŶŬ
ŽŵƉĂŶŝĞƐ ůŝƐƚĞĚ ŽŶ ƚŚĞ WŝŶŬ ^ŚĞĞƚƐ
ŽŵƉĂŶŝĞƐ ďĂƐĞĚ ŝŶ dŽŬLJŽ
One example of a topic/semantic graph
Black node means topic node, and gray node means semantic node.
There is a topic node about “Yakult” and “Sofrbank.”
Yakult: A Manufacturer of drinks
Softbank: A Carrer of Cell phone
Both two companies are based in Tokyo.
We can connect two nodes using our proposed merging nodes of
multiple topics.
19/19
Finding Missing Tweets using Topic Structure and Browsing Time
Conclusion
Conclusion
We proposed a method for automatically extracting user’s missing tweets
based on topic granularity and missing time of browsing user.
We extract missing tweet based on the missing time.
We propose a method for mapping a set of extracted missing tweets to the
Wikipedia category tree by considering topic structure granularity.
We confirmed that our proposed method is effective for “politics,” but not
effective for “computer” and “sports.”
Future Work
We should consider resources other than Wikipedia as a knowledge base.
Wikipedia is not always suitable for personal topics.
We should consider synonyms.
We should compare the other methods with our method.
We should do a usability test of Web user interface.

More Related Content

Similar to Finding Missing Tweets using Topic Structure and Browsing Time

2.1.5Practice My WikipediaPractice GuideMedia Literacy .docx
2.1.5Practice My WikipediaPractice GuideMedia Literacy .docx2.1.5Practice My WikipediaPractice GuideMedia Literacy .docx
2.1.5Practice My WikipediaPractice GuideMedia Literacy .docxcameroncourtney45
 
Content Strategy Workflow
Content Strategy WorkflowContent Strategy Workflow
Content Strategy Workflowquidsupport
 
Barb Unit 4 slidedoc 2 313 spring 2020 online
Barb Unit 4 slidedoc 2 313 spring 2020 onlineBarb Unit 4 slidedoc 2 313 spring 2020 online
Barb Unit 4 slidedoc 2 313 spring 2020 onlinecoop3674
 
Tweet Summarization and Segmentation: A Survey
Tweet Summarization and Segmentation: A SurveyTweet Summarization and Segmentation: A Survey
Tweet Summarization and Segmentation: A Surveyvivatechijri
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineGan Keng Hoon
 
2016 Presidential Candidate Tracker
2016 Presidential Candidate Tracker2016 Presidential Candidate Tracker
2016 Presidential Candidate TrackerAnwar Jameel
 
UNDERSTANDING THE TEXT FEATURES IN NON-FICTION AND SQ3R
UNDERSTANDING THE TEXT FEATURES IN NON-FICTION AND SQ3RUNDERSTANDING THE TEXT FEATURES IN NON-FICTION AND SQ3R
UNDERSTANDING THE TEXT FEATURES IN NON-FICTION AND SQ3RRommel Luis III Israel
 
Blogging.Podcasting.Aatf2007
Blogging.Podcasting.Aatf2007Blogging.Podcasting.Aatf2007
Blogging.Podcasting.Aatf2007cnewm
 
Order #185993101 writers choice (5 pages, 4 slides)type of serv
Order #185993101 writers choice (5 pages, 4 slides)type of servOrder #185993101 writers choice (5 pages, 4 slides)type of serv
Order #185993101 writers choice (5 pages, 4 slides)type of servJUST36
 
On Joint Modeling of Topical Communities and Personal Interest in Microblogs
On Joint Modeling of Topical Communities and Personal Interest in MicroblogsOn Joint Modeling of Topical Communities and Personal Interest in Microblogs
On Joint Modeling of Topical Communities and Personal Interest in MicroblogsPC LO
 
Project 2 schedule
Project 2 scheduleProject 2 schedule
Project 2 scheduleKatieKrahn
 
Make the Most of Your Station's Facebook and Twitter Pages
Make the Most of Your Station's Facebook and Twitter PagesMake the Most of Your Station's Facebook and Twitter Pages
Make the Most of Your Station's Facebook and Twitter PagesEric Athas
 
Michael Horton STOMP ePortfolio
Michael Horton STOMP ePortfolioMichael Horton STOMP ePortfolio
Michael Horton STOMP ePortfoliosburakharper
 
F.A.T. City Video Analysis Content Define and Explain Fairness .docx
F.A.T. City Video Analysis Content Define and Explain Fairness .docxF.A.T. City Video Analysis Content Define and Explain Fairness .docx
F.A.T. City Video Analysis Content Define and Explain Fairness .docxlmelaine
 
Getting Started with Twitter
Getting Started with TwitterGetting Started with Twitter
Getting Started with TwitterMatt Lingard
 
Smai Project: Topic Modelling
Smai Project: Topic ModellingSmai Project: Topic Modelling
Smai Project: Topic ModellingMohit Sharma
 

Similar to Finding Missing Tweets using Topic Structure and Browsing Time (20)

2.1.5Practice My WikipediaPractice GuideMedia Literacy .docx
2.1.5Practice My WikipediaPractice GuideMedia Literacy .docx2.1.5Practice My WikipediaPractice GuideMedia Literacy .docx
2.1.5Practice My WikipediaPractice GuideMedia Literacy .docx
 
Content Strategy Workflow
Content Strategy WorkflowContent Strategy Workflow
Content Strategy Workflow
 
Barb Unit 4 slidedoc 2 313 spring 2020 online
Barb Unit 4 slidedoc 2 313 spring 2020 onlineBarb Unit 4 slidedoc 2 313 spring 2020 online
Barb Unit 4 slidedoc 2 313 spring 2020 online
 
Tweet Summarization and Segmentation: A Survey
Tweet Summarization and Segmentation: A SurveyTweet Summarization and Segmentation: A Survey
Tweet Summarization and Segmentation: A Survey
 
The Three Core Topic Types
The Three Core Topic TypesThe Three Core Topic Types
The Three Core Topic Types
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search Engine
 
Milestone Exercise for Eng 102
Milestone Exercise for Eng 102Milestone Exercise for Eng 102
Milestone Exercise for Eng 102
 
2016 Presidential Candidate Tracker
2016 Presidential Candidate Tracker2016 Presidential Candidate Tracker
2016 Presidential Candidate Tracker
 
UNDERSTANDING THE TEXT FEATURES IN NON-FICTION AND SQ3R
UNDERSTANDING THE TEXT FEATURES IN NON-FICTION AND SQ3RUNDERSTANDING THE TEXT FEATURES IN NON-FICTION AND SQ3R
UNDERSTANDING THE TEXT FEATURES IN NON-FICTION AND SQ3R
 
Blogging.Podcasting.Aatf2007
Blogging.Podcasting.Aatf2007Blogging.Podcasting.Aatf2007
Blogging.Podcasting.Aatf2007
 
Order #185993101 writers choice (5 pages, 4 slides)type of serv
Order #185993101 writers choice (5 pages, 4 slides)type of servOrder #185993101 writers choice (5 pages, 4 slides)type of serv
Order #185993101 writers choice (5 pages, 4 slides)type of serv
 
SOC 449
SOC 449SOC 449
SOC 449
 
On Joint Modeling of Topical Communities and Personal Interest in Microblogs
On Joint Modeling of Topical Communities and Personal Interest in MicroblogsOn Joint Modeling of Topical Communities and Personal Interest in Microblogs
On Joint Modeling of Topical Communities and Personal Interest in Microblogs
 
Project 2 schedule
Project 2 scheduleProject 2 schedule
Project 2 schedule
 
Make the Most of Your Station's Facebook and Twitter Pages
Make the Most of Your Station's Facebook and Twitter PagesMake the Most of Your Station's Facebook and Twitter Pages
Make the Most of Your Station's Facebook and Twitter Pages
 
Michael Horton STOMP ePortfolio
Michael Horton STOMP ePortfolioMichael Horton STOMP ePortfolio
Michael Horton STOMP ePortfolio
 
F.A.T. City Video Analysis Content Define and Explain Fairness .docx
F.A.T. City Video Analysis Content Define and Explain Fairness .docxF.A.T. City Video Analysis Content Define and Explain Fairness .docx
F.A.T. City Video Analysis Content Define and Explain Fairness .docx
 
Getting Started with Twitter
Getting Started with TwitterGetting Started with Twitter
Getting Started with Twitter
 
About the Social Semantic Web
About the Social Semantic WebAbout the Social Semantic Web
About the Social Semantic Web
 
Smai Project: Topic Modelling
Smai Project: Topic ModellingSmai Project: Topic Modelling
Smai Project: Topic Modelling
 

Recently uploaded

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Finding Missing Tweets using Topic Structure and Browsing Time

  • 1. 1/19 Finding Missing Tweets using Topic Structure and Browsing Time Finding Missing Tweets using Topic Structure and Browsing Time Yu Suzuki† , Hiromitsu Ohara‡ , Akiyo Nadamoto‡ † Nara Institute of Science and Technology, Japan ‡ Konan University, Japan 5. December, 2017
  • 2. 2/19 Finding Missing Tweets using Topic Structure and Browsing Time Introduction Introduction From Social Network Services (SNSs), there are massive volumes of messages. Users are not always on-line. Users miss important information on SNSs. c.f.) A function on twitter “While you were away.” The structure of summarization is flat. Users need to understand in a short time about the topics while the users are off-line. A mechanism of summarizing the tweets is useful. We believe that when we summarize the tweets as a tree structure, the users can easily understand the topics. Summarize Tweets Using Topic Structure and Browsing Time
  • 3. 3/19 Finding Missing Tweets using Topic Structure and Browsing Time Introduction Why we consider topic structure? missing tweets topic sub topic Today’s baseball game is exciting! baseball game Yesterday I went to baseball stadium baseball place I’m at Salzburg! travel austria I’m at baseball stadium! baseball place · · · · · · · · · Tweets with minority topics are ignored if we summarize missing tweets. Missing tweets are mainly related to “baseball.” Only one tweet is related to “travel.” If these tweets are summarized without using topics, the tweet about travel may not be appeared at the summary. We visualize this topics of tweets as a tree structure. First, the users see top-level topics, such as “baseball” and “travel.” if the users are interested in “baseball,” the users browse “game” and “place.” Users do not miss a tweet about travel. How to construct the topic structure?
  • 4. 4/19 Finding Missing Tweets using Topic Structure and Browsing Time Introduction Our contribution 1 Generate topic structures of tweets using the Wikipedia category tree and browsing time We use Wikipedia category as a knowledge to construct tree structure. We use browsing time as a tweets which users miss. 2 Visualize the topic structure of tweets using a network graph We implement our method using Web application. 3 Confirm using real dataset that our proposed method is effective for commonly known topics Our method is effective if there are many information about the theme. Wikipedia only have articles about commonly known topics.
  • 5. 5/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method Overview 2. Generate a Topic Graph Wikipedia Category Tree Tweets 1. Clustering of Tweets C0 = Ichiro C1 = Masahiro C3 = Human ➡ delete too wide to cover topics Ichiro Masahiro MLB playerSportsJapanese Topic node: a parent node of Tweet clusters 3. Visualization Ichiro Masahiro Japanese MLB Player Tweet list Now three of the greatest hitters in Major League history in one dugout with the Marlins. Barry Bonds, Ichiro and Don Kelly. amazing. Joe Girardi discusses Masahiro Tanaka pitching on extended rest after Tuesday night's 9-0 victory. Baseball Sports Basketball Mariners MLB Players Japan Players Abstract node: a parent node of topic nodes Topic Graph tweets correspond to category about Ichiro about Masahiro
  • 6. 6/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method Steps Overview 1 Extracting missing tweet: Extracting which tweets are submitted during user’s browsing time and it is before and after. 2 Clustering Tweets into Categories and extracting topics: Using Repeated Bisection as clustering tools, we divide a set of tweets into clusters and extract topics in each cluster. 3 Generate a topic graph: Using the topics of tweets and the Wikipedia category tree, we generate a topic graph of the tweets. 4 Classify topics Classify the topics which are nodes of the topic graph as known topics and unknown topics. 5 Visualization of topic graphs: We visualize the topic graph and the corresponding tweets using our implemented Web user interface.
  • 7. 7/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method 0. Extraction of missing tweets 0. Extraction of missing tweets We extract tweets which users have not browse. We assume that the browsing time is given. Browsing time may be available if we construct twitter client applications.
  • 8. 8/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method 1. Clustering tweets 1. Clustering tweets Tweets 1. Clustering of Tweets C0 = Ichiro C1 = Masahiro C3 = Human ➡ delete too wide to cover topics We use repeated-bisection for clustering tweets. In our experiment, repeated-bisection is the most effective method for clustering short texts. Similar to k-means. We remove noise clusters. We calculate the cosine similarity between each two texts in a cluster. We remove the nodes if the similarity is beyond the threshold.
  • 9. 9/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method 1. Clustering tweets Repeated bisection Given a set of tweets T, we extract a feature vector for each tweet. First, we divide a tweet into the terms using morphological analysis or POS tagger. Then, we select noun and unknown terms as feature terms. The reason of using unknown terms is that these terms consist of slang and newly invented words which are not recognized by the morphological analysis. To clean the feature terms, we select the terms which are included in more than two tweets. Feature vector f(ti ) of tweet ti (ti ∈ T) is defined as follows. f(ti ) = [tf(ti , w1) · idf(w1), tf(ti , w2) · idf(w2), · · · , tf(ti , wm) · idf(wm)] (1) tf(ti , wj ) =    1 if wk appears at ti more than once 0 else (2) idf(wj ) = − log df(wj ) |T| (3) where wj is a term in T, |T| is the number of tweets in T, tf(ti , wj ) indicates whether wj appears at ti or not, df(wj ) is the number of tweets which have wj , and idf(wj ) is an IDF (Inverted Document Frequency) value of wj where a document is a tweet.
  • 10. 10/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method 2. Topic graph 2. Generate a topic graph 2. Generate a Topic Graph Ichiro Masahiro MLB playerSportsJapanese Topic node: a parent node of Tweet clusters Abstract node: a parent node of topic nodes tweets correspond to category 1 Generate a topic node corresponds to a tweet. 2 Generate a semantic node, which corresponds to a topic node. 3 Merge multiple nodes into simple structure of nodes.
  • 11. 11/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method 2.1 Generate a topic node Generate a topic node Topic node: An Wikipedia article corresponds to a cluster. Method 1 Repeated bisection method outputs keywords for each category, with related degrees between keywords and categories. 2 We retrieve articles in Wikipedia, and find the most relevant category. Many categories have their articles, then the categories are also candidates of topic nodes. Example A category has keywords such that {(“baseball , 1), (“player , 0.5)}. There are two Wikipedia articles wp and wq: The title of wp is “baseball team” and wq is “baseball player.” Calculate scores for each article: wq = 1 + 0 = 1 , and wq = 1 + 0.5 = 1.5. We select an article wq, “baseball player,” as a topic node.
  • 12. 12/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method 2.2 Generate a semantic node Generate a semantic node Semantic node: The categories which correspond to the topic node on the Wikipedia. Method 1 Get category names using Wikipedia. 2 Prune unsuitable categories from semantic nodes using black list. Person born in 19xx, Stub, A list of xx, . . . Example Category c0 is tagged by “Ichiro Suzuki, ” An article “Ichiro Suzuki” has two categories “Yankees Players” and “Baseball Players.” “Yankee players” and “Baseball players” are considered as semantic node. Ichiro Suzuki Kenta Maeda Yankees Player Baseball Player Baseball PlayerDodgers Player
  • 13. 13/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method 2.3 Merge multiple nodes Merge Multiple Nodes Ichiro Suzuki Kenta Maeda Yankees Player Baseball Player Baseball PlayerDodgers Player Figure: Example of two network graphs Ichiro Suzuki Kenta Maeda Yankees Player Baseball Player Dodgers Player Figure: Two nodes are merged if two graphs share the common nodes. Ichiro Suzuki Kenta Maeda Yankees Player Baseball Player Dodgers Player Sportspeople Figure: If a leaf node and not leaf node correspond to the same article, these nodes are merged.
  • 14. 14/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method Visualization Visualize topic nodes and semantic nodes 3. Visualization Ichiro Masahiro Japanese MLB Player Tweet list Now three of the greatest hitters in Major League history in one dugout with the Marlins. Barry Bonds, Ichiro and Don Kelly. amazing. Joe Girardi discusses Masahiro Tanaka pitching on extended rest after Tuesday night's 9-0 victory. Topic Graph about Ichiro about Masahiro
  • 15. 15/19 Finding Missing Tweets using Topic Structure and Browsing Time Experiments Experimental Setup Experimental Setup Aim of our experiment: To confirm that our method is effective or not. Which themes of tweets are appropriate for applying our proposed method. Evaluation Measure: Precision ratio We (the second author of our paper) manually select an appropriate categories for each tweet. We calculate precision ratio for each category. precision = The number of accurately categorized tweets The number of tweets in the category Dataset Category: Politics, Music, Computer, Sports, and Animation/Games (five categories) Tweets: We prepared 2,000 tweets for each category. We use Twitter search API.
  • 16. 16/19 Finding Missing Tweets using Topic Structure and Browsing Time Experiments Experimental Setup Procedure of the experiment 1 Clustering 2,000 tweets for each theme, and extracting topics of each cluster 2 Generate the topic graph using our proposed method 3 Give clusters and their corresponding Wikipedia article titles to the observers. Observers are hired using crowdsourcing (Crowdworks). 4 Observers evaluate whether the article titles are appropriate or not for representing the clusters using the following five degrees (5: appropriate, 4: almost appropriate, 3: cannot say, 2: almost inappropriate, 1: inappropriate). 5 Summarize the observer’s evaluations, and analyze whether our proposed method has good accuracy or not
  • 17. 17/19 Finding Missing Tweets using Topic Structure and Browsing Time Experiments Experimental Results Experimetal Results 1.0-2.0 2.0-3.0 3.0-4.0 4.0-5.0 25 0 5 10 15 20 Average of evaluation scores Numberofevaluationscores Politics Music Computer Sports Video Games Table: Numbers of evaluation scores for respective bins. Theme # obsv. Prec. Politics 8 0.72 Music 11 0.56 Computer 5 0.44 Sports 5 0.42 Animation 4 0.52 & Games Our method is useful for tweets about politics. There are many technical terms about politics Many articles are on the WIkipedia. Our method is not effective for computer, sports. There are wide variety of topics. Less number of articles are on the Wikipedia.
  • 18. 18/19 Finding Missing Tweets using Topic Structure and Browsing Time Experiments Experimental Results Merging Multiple topics ϰϵ͗zĂŬƵůƚ :ĂƉĂŶĞƐĞ ďĞǀĞƌĂŐĞƐ ϯϰ͗^ŽĨƚĂŶŬ ŽŵƉĂŶŝĞƐ ůŝƐƚĞĚ ŽŶ ƚŚĞ WŝŶŬ ^ŚĞĞƚƐ ŽŵƉĂŶŝĞƐ ďĂƐĞĚ ŝŶ dŽŬLJŽ One example of a topic/semantic graph Black node means topic node, and gray node means semantic node. There is a topic node about “Yakult” and “Sofrbank.” Yakult: A Manufacturer of drinks Softbank: A Carrer of Cell phone Both two companies are based in Tokyo. We can connect two nodes using our proposed merging nodes of multiple topics.
  • 19. 19/19 Finding Missing Tweets using Topic Structure and Browsing Time Conclusion Conclusion We proposed a method for automatically extracting user’s missing tweets based on topic granularity and missing time of browsing user. We extract missing tweet based on the missing time. We propose a method for mapping a set of extracted missing tweets to the Wikipedia category tree by considering topic structure granularity. We confirmed that our proposed method is effective for “politics,” but not effective for “computer” and “sports.” Future Work We should consider resources other than Wikipedia as a knowledge base. Wikipedia is not always suitable for personal topics. We should consider synonyms. We should compare the other methods with our method. We should do a usability test of Web user interface.