Advertisement
Advertisement

More Related Content

Advertisement

Finding Missing Tweets using Topic Structure and Browsing Time

  1. 1/19 Finding Missing Tweets using Topic Structure and Browsing Time Finding Missing Tweets using Topic Structure and Browsing Time Yu Suzuki† , Hiromitsu Ohara‡ , Akiyo Nadamoto‡ † Nara Institute of Science and Technology, Japan ‡ Konan University, Japan 5. December, 2017
  2. 2/19 Finding Missing Tweets using Topic Structure and Browsing Time Introduction Introduction From Social Network Services (SNSs), there are massive volumes of messages. Users are not always on-line. Users miss important information on SNSs. c.f.) A function on twitter “While you were away.” The structure of summarization is flat. Users need to understand in a short time about the topics while the users are off-line. A mechanism of summarizing the tweets is useful. We believe that when we summarize the tweets as a tree structure, the users can easily understand the topics. Summarize Tweets Using Topic Structure and Browsing Time
  3. 3/19 Finding Missing Tweets using Topic Structure and Browsing Time Introduction Why we consider topic structure? missing tweets topic sub topic Today’s baseball game is exciting! baseball game Yesterday I went to baseball stadium baseball place I’m at Salzburg! travel austria I’m at baseball stadium! baseball place · · · · · · · · · Tweets with minority topics are ignored if we summarize missing tweets. Missing tweets are mainly related to “baseball.” Only one tweet is related to “travel.” If these tweets are summarized without using topics, the tweet about travel may not be appeared at the summary. We visualize this topics of tweets as a tree structure. First, the users see top-level topics, such as “baseball” and “travel.” if the users are interested in “baseball,” the users browse “game” and “place.” Users do not miss a tweet about travel. How to construct the topic structure?
  4. 4/19 Finding Missing Tweets using Topic Structure and Browsing Time Introduction Our contribution 1 Generate topic structures of tweets using the Wikipedia category tree and browsing time We use Wikipedia category as a knowledge to construct tree structure. We use browsing time as a tweets which users miss. 2 Visualize the topic structure of tweets using a network graph We implement our method using Web application. 3 Confirm using real dataset that our proposed method is effective for commonly known topics Our method is effective if there are many information about the theme. Wikipedia only have articles about commonly known topics.
  5. 5/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method Overview 2. Generate a Topic Graph Wikipedia Category Tree Tweets 1. Clustering of Tweets C0 = Ichiro C1 = Masahiro C3 = Human ➡ delete too wide to cover topics Ichiro Masahiro MLB playerSportsJapanese Topic node: a parent node of Tweet clusters 3. Visualization Ichiro Masahiro Japanese MLB Player Tweet list Now three of the greatest hitters in Major League history in one dugout with the Marlins. Barry Bonds, Ichiro and Don Kelly. amazing. Joe Girardi discusses Masahiro Tanaka pitching on extended rest after Tuesday night's 9-0 victory. Baseball Sports Basketball Mariners MLB Players Japan Players Abstract node: a parent node of topic nodes Topic Graph tweets correspond to category about Ichiro about Masahiro
  6. 6/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method Steps Overview 1 Extracting missing tweet: Extracting which tweets are submitted during user’s browsing time and it is before and after. 2 Clustering Tweets into Categories and extracting topics: Using Repeated Bisection as clustering tools, we divide a set of tweets into clusters and extract topics in each cluster. 3 Generate a topic graph: Using the topics of tweets and the Wikipedia category tree, we generate a topic graph of the tweets. 4 Classify topics Classify the topics which are nodes of the topic graph as known topics and unknown topics. 5 Visualization of topic graphs: We visualize the topic graph and the corresponding tweets using our implemented Web user interface.
  7. 7/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method 0. Extraction of missing tweets 0. Extraction of missing tweets We extract tweets which users have not browse. We assume that the browsing time is given. Browsing time may be available if we construct twitter client applications.
  8. 8/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method 1. Clustering tweets 1. Clustering tweets Tweets 1. Clustering of Tweets C0 = Ichiro C1 = Masahiro C3 = Human ➡ delete too wide to cover topics We use repeated-bisection for clustering tweets. In our experiment, repeated-bisection is the most effective method for clustering short texts. Similar to k-means. We remove noise clusters. We calculate the cosine similarity between each two texts in a cluster. We remove the nodes if the similarity is beyond the threshold.
  9. 9/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method 1. Clustering tweets Repeated bisection Given a set of tweets T, we extract a feature vector for each tweet. First, we divide a tweet into the terms using morphological analysis or POS tagger. Then, we select noun and unknown terms as feature terms. The reason of using unknown terms is that these terms consist of slang and newly invented words which are not recognized by the morphological analysis. To clean the feature terms, we select the terms which are included in more than two tweets. Feature vector f(ti ) of tweet ti (ti ∈ T) is defined as follows. f(ti ) = [tf(ti , w1) · idf(w1), tf(ti , w2) · idf(w2), · · · , tf(ti , wm) · idf(wm)] (1) tf(ti , wj ) =    1 if wk appears at ti more than once 0 else (2) idf(wj ) = − log df(wj ) |T| (3) where wj is a term in T, |T| is the number of tweets in T, tf(ti , wj ) indicates whether wj appears at ti or not, df(wj ) is the number of tweets which have wj , and idf(wj ) is an IDF (Inverted Document Frequency) value of wj where a document is a tweet.
  10. 10/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method 2. Topic graph 2. Generate a topic graph 2. Generate a Topic Graph Ichiro Masahiro MLB playerSportsJapanese Topic node: a parent node of Tweet clusters Abstract node: a parent node of topic nodes tweets correspond to category 1 Generate a topic node corresponds to a tweet. 2 Generate a semantic node, which corresponds to a topic node. 3 Merge multiple nodes into simple structure of nodes.
  11. 11/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method 2.1 Generate a topic node Generate a topic node Topic node: An Wikipedia article corresponds to a cluster. Method 1 Repeated bisection method outputs keywords for each category, with related degrees between keywords and categories. 2 We retrieve articles in Wikipedia, and find the most relevant category. Many categories have their articles, then the categories are also candidates of topic nodes. Example A category has keywords such that {(“baseball , 1), (“player , 0.5)}. There are two Wikipedia articles wp and wq: The title of wp is “baseball team” and wq is “baseball player.” Calculate scores for each article: wq = 1 + 0 = 1 , and wq = 1 + 0.5 = 1.5. We select an article wq, “baseball player,” as a topic node.
  12. 12/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method 2.2 Generate a semantic node Generate a semantic node Semantic node: The categories which correspond to the topic node on the Wikipedia. Method 1 Get category names using Wikipedia. 2 Prune unsuitable categories from semantic nodes using black list. Person born in 19xx, Stub, A list of xx, . . . Example Category c0 is tagged by “Ichiro Suzuki, ” An article “Ichiro Suzuki” has two categories “Yankees Players” and “Baseball Players.” “Yankee players” and “Baseball players” are considered as semantic node. Ichiro Suzuki Kenta Maeda Yankees Player Baseball Player Baseball PlayerDodgers Player
  13. 13/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method 2.3 Merge multiple nodes Merge Multiple Nodes Ichiro Suzuki Kenta Maeda Yankees Player Baseball Player Baseball PlayerDodgers Player Figure: Example of two network graphs Ichiro Suzuki Kenta Maeda Yankees Player Baseball Player Dodgers Player Figure: Two nodes are merged if two graphs share the common nodes. Ichiro Suzuki Kenta Maeda Yankees Player Baseball Player Dodgers Player Sportspeople Figure: If a leaf node and not leaf node correspond to the same article, these nodes are merged.
  14. 14/19 Finding Missing Tweets using Topic Structure and Browsing Time Our Proposed Method Visualization Visualize topic nodes and semantic nodes 3. Visualization Ichiro Masahiro Japanese MLB Player Tweet list Now three of the greatest hitters in Major League history in one dugout with the Marlins. Barry Bonds, Ichiro and Don Kelly. amazing. Joe Girardi discusses Masahiro Tanaka pitching on extended rest after Tuesday night's 9-0 victory. Topic Graph about Ichiro about Masahiro
  15. 15/19 Finding Missing Tweets using Topic Structure and Browsing Time Experiments Experimental Setup Experimental Setup Aim of our experiment: To confirm that our method is effective or not. Which themes of tweets are appropriate for applying our proposed method. Evaluation Measure: Precision ratio We (the second author of our paper) manually select an appropriate categories for each tweet. We calculate precision ratio for each category. precision = The number of accurately categorized tweets The number of tweets in the category Dataset Category: Politics, Music, Computer, Sports, and Animation/Games (five categories) Tweets: We prepared 2,000 tweets for each category. We use Twitter search API.
  16. 16/19 Finding Missing Tweets using Topic Structure and Browsing Time Experiments Experimental Setup Procedure of the experiment 1 Clustering 2,000 tweets for each theme, and extracting topics of each cluster 2 Generate the topic graph using our proposed method 3 Give clusters and their corresponding Wikipedia article titles to the observers. Observers are hired using crowdsourcing (Crowdworks). 4 Observers evaluate whether the article titles are appropriate or not for representing the clusters using the following five degrees (5: appropriate, 4: almost appropriate, 3: cannot say, 2: almost inappropriate, 1: inappropriate). 5 Summarize the observer’s evaluations, and analyze whether our proposed method has good accuracy or not
  17. 17/19 Finding Missing Tweets using Topic Structure and Browsing Time Experiments Experimental Results Experimetal Results 1.0-2.0 2.0-3.0 3.0-4.0 4.0-5.0 25 0 5 10 15 20 Average of evaluation scores Numberofevaluationscores Politics Music Computer Sports Video Games Table: Numbers of evaluation scores for respective bins. Theme # obsv. Prec. Politics 8 0.72 Music 11 0.56 Computer 5 0.44 Sports 5 0.42 Animation 4 0.52 & Games Our method is useful for tweets about politics. There are many technical terms about politics Many articles are on the WIkipedia. Our method is not effective for computer, sports. There are wide variety of topics. Less number of articles are on the Wikipedia.
  18. 18/19 Finding Missing Tweets using Topic Structure and Browsing Time Experiments Experimental Results Merging Multiple topics ϰϵ͗zĂŬƵůƚ :ĂƉĂŶĞƐĞ ďĞǀĞƌĂŐĞƐ ϯϰ͗^ŽĨƚĂŶŬ ŽŵƉĂŶŝĞƐ ůŝƐƚĞĚ ŽŶ ƚŚĞ WŝŶŬ ^ŚĞĞƚƐ ŽŵƉĂŶŝĞƐ ďĂƐĞĚ ŝŶ dŽŬLJŽ One example of a topic/semantic graph Black node means topic node, and gray node means semantic node. There is a topic node about “Yakult” and “Sofrbank.” Yakult: A Manufacturer of drinks Softbank: A Carrer of Cell phone Both two companies are based in Tokyo. We can connect two nodes using our proposed merging nodes of multiple topics.
  19. 19/19 Finding Missing Tweets using Topic Structure and Browsing Time Conclusion Conclusion We proposed a method for automatically extracting user’s missing tweets based on topic granularity and missing time of browsing user. We extract missing tweet based on the missing time. We propose a method for mapping a set of extracted missing tweets to the Wikipedia category tree by considering topic structure granularity. We confirmed that our proposed method is effective for “politics,” but not effective for “computer” and “sports.” Future Work We should consider resources other than Wikipedia as a knowledge base. Wikipedia is not always suitable for personal topics. We should consider synonyms. We should compare the other methods with our method. We should do a usability test of Web user interface.
Advertisement