Session 19: Social Media II
担当: デンソーアイティーラボラトリ 山本
【ICDE2013勉強会】
資料中の図は論文を引用しております。
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
発表論文
} (1) A Unified Model for Stable and Temporal Topic Detection fr...
} 【やりたいこと】
Stable TopicとTemporal Topic考慮した上でのトピック抽出
} Stable Topic及びTemporal Topicの定義
} Stable Topic :いつも誰かがそのテーマについて言及...
【アプローチ】
}
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
A Unified Model for Stable and Temporal Topic Detection fro...
} special smoothing
} burst word対策
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
A Unified Model for Stable and Te...
} 結果
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
A Unified Model for Stable and Temporal Topic Detection from
Soc...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
Crowdsourced Enumeration Queries
7
・既存のDBを利用した検索
データベースの中に
存在するデータがすべて...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
Crowdsourced Enumeration Queries
} 内容
} クラウドソーシングの検索タスクに対する回答集合数の推定....
} 推定関数は種を単位とするか,個体間のダイバージェンスを考慮する
か,均等度を考慮に入れるか等によっていろいろあり。
} CHAO 84 estimator
} CHAO92 estimator(今回利用したもの)
} sample ...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
Crowdsourced Enumeration Queries
} chao92を利用して国数を推定してみた。
} 原因 worker...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
Crowdsourced Enumeration Queries
} ストリーカーの存在
} 完全な回答を提示する回答者も存在(ストリー...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
Crowdsourced Enumeration Queries
} Chao 92を改良した
Streaker-tolerant Est...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 3000万件のURL中、39%のurlが一つのみのtag(図1)
} 投稿数を...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 3000万件のURL中、39%のurlが一つのみのtag(図1)
} 投稿数を...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 3000万件のURL中、39%のurlが一つのみのtag(図1)
} 投稿数を...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 3000万件のURL中、39%のurlが一つのみのtag(図1)
} 投稿数を...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 3000万件のURL中、39%のurlが一つのみのtag(図1)
} 投稿数を...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 限られた予算において、インセンティブ配分を最適
化することによって品質の向上の最...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 限られた予算において、インセンティブ配分を最適
化することによって品質の向上の最...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 限られた予算において、インセンティブ配分を最適
化することによって品質の向上の最...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 限られた予算において、インセンティブ配分を最適
化することによって品質の向上の最...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 限られた予算において、インセンティブ配分を最適
化することによって品質の向上の最...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
¨ 以下の5種類を提案
¨ Free Choice (FC)
¨ Round R...
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
¨ Free Choice: 50% posts are over-tagging,...
Upcoming SlideShare
Loading in …5
×

ICDE2013勉強会 Session 19: Social Media II

2,474
-1

Published on

ICDE2013勉強会 で利用したスライドです。

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,474
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ICDE2013勉強会 Session 19: Social Media II

  1. 1. Session 19: Social Media II 担当: デンソーアイティーラボラトリ 山本 【ICDE2013勉強会】 資料中の図は論文を引用しております。 13年6月29日土曜日
  2. 2. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) 発表論文 } (1) A Unified Model for Stable and Temporal Topic Detection from Social Media Data } ソーシャルメディアにおけるコンテンツが一時的なトピックかそれ とも恒久的なトピックかを考慮した上で判定 } (2) Crowdsourced Enumeration Queries (Best Paper) } クラウドソーシングの検索タスクに対する回答集合数 (母集団)の推定. } 生物統計学における固有種数の推定手法を応用(CHAO92) } (3) On Incentive-based Tagging } tag情報の品質をインセンティブをワーカー与えることによって 向上させる。 2 13年6月29日土曜日
  3. 3. } 【やりたいこと】 Stable TopicとTemporal Topic考慮した上でのトピック抽出 } Stable Topic及びTemporal Topicの定義 } Stable Topic :いつも誰かがそのテーマについて言及している } Temporal Topic: 時系列上でみて、急激にそのテーマについて言及 する回数が激増・激減するようなテーマ。通常は実生活のイベント が影響 Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) A Unified Model for Stable and Temporal Topic Detection from Social Media Data 3 important and useful to distinguish temporal topics from stable topics in social media. However, such a discrimination is very challenging because the user-generated texts in social media are very short in length and thus lack useful linguistic features for precise analysis using traditional approaches. In this paper, we propose a novel solution to detect both stable and temporal topics simultaneously from social media data. Specifically, a unified user-temporal mixture model is proposed to distinguish temporal topics from stable topics. To improve this model’s performance, we design a regularization framework that exploits prior spatial information in a social network, as well as a burst-weighted smoothing scheme that exploits temporal prior information in the time dimension. We conduct extensive experiments to evaluate our proposal on two real data sets obtained from Del.icio.us and Twitter. The experimental results verify that our mixture model is able to distinguish temporal topics from stable topics in a single detection process. Our mixture model enhanced with the spatial regularization and the burst-weighted smoothing scheme significantly outperforms competitor approaches, in terms of topic detection accuracy and discrimination in stable and temporal topics. I. INTRODUCTION User-generated contents (UGC) in Web 2.0 are valuable resources capturing people’s interests, thoughts and actions. Such contents cover a wide variety of topics that present online and offline lives. For example, the microblog services gather many short but quickly-updated texts that contain both temporal and stable topics. Such topics form a huge and rich repository of various kinds of interesting information. Stable topics are often on users’ regular interests and their daily routine discussions, which usually evolve at a rather slow speed. The extraction of such stable topics enables us to personalize the results and to improve the result relevance in many applications such as computational advertising, content targeting, personal recommendation and web search. In contrast, temporal topics are on popular real-life events or hot spots. In many circumstances, temporal topics, e.g., breaking events in the real world, bring about popular discus- sion and wide diffusion on the Internet, where social networks further boost the discussion and diffusion. Take Twitter, the most popular microblog service, as an example. Many social events can be discovered in Twitter’s posts (tweets), such illustrated in Figure 1. We can tell the difference between them from the temporal distributions and the description keywords. A temporal topic has its text related to a certain event like “Independence Day celebration” in a certain period of time, and its popularity goes through a sharp increase at the occurring time of the event. A stable topic has its description on user’s regular interest like “Pet Adoption” and its temporal distribution exhibits no sharp, spike-like fluctuation. Fig. 1. Stable and Temporal Topics in Twitter It is important and useful to distinguish the temporal topics from the stable topics since they convey different kinds of information. However, temporal topics are discussed with less urgent themes in the background, and therefore temporal topics are deeply mixed with stable topics in social media. As a result, it is a challenging problem to detect and differenti- ate temporal and stable topics from large amounts of user- generated social media data. Research on traditional topic detection and tracking employs on-line incremental clustering [1] or retrospective off-line clus- tering [25] for documents and extracts representative features for clusters as a summary of the events. These methods are suitable for conventional web pages where most documents are long, rich in keywords, and related to certain popular events.13年6月29日土曜日
  4. 4. 【アプローチ】 } Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) A Unified Model for Stable and Temporal Topic Detection from Social Media Data 4 niques for topic detection from social media data [12]. IV. A MIXTURE MODEL FOR DETECTING STABLE AND TEMPORAL TOPICS In this section, we propose a user-temporal mixture topic model that integrates user and temporal features, followed by an EM-based algorithm for inferring model parameters. A. User-Temporal Model SYMBOL DESCRIPTION u, t, w user, time stamp, keyword U, T, W set of users, time stamps and keywords M[u, t, w] frequency of w used by u within time stamp t λU , λT parameter controlling the branch selection θi stable topic indexed by i θj temporal topic indexed by j ΘU , ΘT stable and temporal topic set TABLE I NOTATIONS In the user-temporal mixture model, we pre-categorize topics into two types: stable topics and temporal topics. Stable topics summarize theme reflected from regular postings according to the stable interest of a user or a community. While temporal topics capture the popular events or controversial news igniting hot discussion in a certain period. In this model, we aim to detect both temporal and stable topics in one generating process. Table I lists the relevant notations we use. The mixture model is represented in Equation 1. For a stable topic θ we pay particular attention to its user u who generates Whether a keyword a temporal topic is dec contributions by the us For instance, if many certain period t, w wo with higher probability topics. Thus, keywords clustered into tempora to that of their keywor The topics generated individually. Both typ during the learning pr can filter out the stable branch. It also helps re disturbance from break B. Estimation of Mode Given an observati procedure of our model of generating the obser whole document collec 2, where p(w|u, t) is d L(C) = U The goal of parame 2. As this equation c Maximum Likelihood mixing user and temporal features, each branch deciding a different topic type. Parameters λU and λT in Equation 1 are the probability coefficients controlling the branch choice, which also denote the proportions of stable and temporal topics in the data set. p(w|u, t) = λU θi∈ΘU p(θi|u)p(w|θi)+λT θj ∈ΘT p(θj|t)p(w|θj) (1) For the user branch, a stable topic is chosen according to the interest of a particular user. For the time branch, a temporal topic is generated according to the time stamp of a post, which means the post belongs to the topics that are popular for a short period of time around that time stamp. Temporal topics have their distribution on the time dimension, which indicates its popularity probabilities. The time period during which a temporal topic has its highest probability is its popularity period. In our setting, the user interest is assumed to be stable through time, and we ignore the possible slight evolution of user interest. maximizati ing the so- depends on In our m p(θi|u), p( and θj. For The detail temporal m E-step: where B(w ・user uがword w を時間 t に言及する確率 要はstableなトピックは人に依存、テンポラルなトピックは時間に依存 ze s. gs le al l, ne e. le es to that of their keywords. The topics generated in the two branches are not estimated individually. Both types of topics interact with each other during the learning procedure. This two-branch assumption can filter out the stable components from burst topics by stable branch. It also helps refine the quality of stable topics without disturbance from breaking events as time elapses. B. Estimation of Model Parameters Given an observation matrix M(U, T, W), the learning procedure of our model is to estimate the maximum probability of generating the observed samples. The log-likelihood of the whole document collection C by our approach is in Equation 2, where p(w|u, t) is defined according to Equation 1. L(C) = U T W M[u, t, w] log p(w|u, t) (2) The goal of parameter estimation is to maximize Equation 2. As this equation cannot be solved directly by applying Maximum Likelihood Estimation (MLE), we apply an EM approach instead. In an expectation (E) step of the EM ・user-time-associated document collection Cにおけるlog-likelihood E-Mアルゴリズムを利用すれば、 In the user-temporal mixture model, we pre-categorize topics into two types: stable topics and temporal topics. Stable topics summarize theme reflected from regular postings according to the stable interest of a user or a community. While temporal topics capture the popular events or controversial news igniting hot discussion in a certain period. In this model, we aim to detect both temporal and stable topics in one generating process. Table I lists the relevant notations we use. The mixture model is represented in Equation 1. For a stable topic θi we pay particular attention to its user u who generates it. For a temporal topic θj we pay more attention to when, indicated by time t, it is generated. Like PLSA [10], [11], our user-temporal model consists of three layers and two branches mixing user and temporal features, each branch deciding a different topic type. Parameters λU and λT in Equation 1 are the probability coefficients controlling the branch choice, which also denote the proportions of stable and temporal topics in the data set. p(w|u, t) = λU θi∈ΘU p(θi|u)p(w|θi)+λT θj ∈ΘT p(θj|t)p(w|θj) (1) For the user branch, a stable topic is chosen according to the pr of wh 2, 2. M ap ap va m in de p( an Th tem In the user-temporal mixture model, we pre-categorize topics into two types: stable topics and temporal topics. Stable topics summarize theme reflected from regular postings according to the stable interest of a user or a community. While temporal topics capture the popular events or controversial news igniting hot discussion in a certain period. In this model, we aim to detect both temporal and stable topics in one generating process. Table I lists the relevant notations we use. The mixture model is represented in Equation 1. For a stable topic θi we pay particular attention to its user u who generates it. For a temporal topic θj we pay more attention to when, indicated by time t, it is generated. Like PLSA [10], [11], our user-temporal model consists of three layers and two branches mixing user and temporal features, each branch deciding a different topic type. Parameters λU and λT in Equation 1 are the probability coefficients controlling the branch choice, which also denote the proportions of stable and temporal topics in the data set. p(w|u, t) = λU θi∈ΘU p(θi|u)p(w|θi)+λT θj ∈ΘT p(θj|t)p(w|θj) (1) For the user branch, a stable topic is chosen according to the pr of w 2, 2. M ap ap va m in de p( an Th te In the user-temporal mixture model, we pre-categorize topics into two types: stable topics and temporal topics. Stable topics summarize theme reflected from regular postings according to the stable interest of a user or a community. While temporal topics capture the popular events or controversial news igniting hot discussion in a certain period. In this model, we aim to detect both temporal and stable topics in one generating process. Table I lists the relevant notations we use. The mixture model is represented in Equation 1. For a stable topic θi we pay particular attention to its user u who generates it. For a temporal topic θj we pay more attention to when, indicated by time t, it is generated. Like PLSA [10], [11], our user-temporal model consists of three layers and two branches mixing user and temporal features, each branch deciding a different topic type. Parameters λU and λT in Equation 1 are the probability coefficients controlling the branch choice, which also denote the proportions of stable and temporal topics in the data set. p(w|u, t) = λU θi∈ΘU p(θi|u)p(w|θi)+λT θj ∈ΘT p(θj|t)p(w|θj) (1) For the user branch, a stable topic is chosen according to the procedure of of generating whole docum 2, where p(w L(C The goal o 2. As this e Maximum L approach ins approach, po variables bas maximization ing the so-ca depends on th In our mo p(θi|u), p(θj and θj. For si The detailed temporal mo E-step: In the user-temporal mixture model, we pre-categorize topics into two types: stable topics and temporal topics. Stable topics summarize theme reflected from regular postings according to the stable interest of a user or a community. While temporal topics capture the popular events or controversial news igniting hot discussion in a certain period. In this model, we aim to detect both temporal and stable topics in one generating process. Table I lists the relevant notations we use. The mixture model is represented in Equation 1. For a stable topic θi we pay particular attention to its user u who generates it. For a temporal topic θj we pay more attention to when, indicated by time t, it is generated. Like PLSA [10], [11], our user-temporal model consists of three layers and two branches mixing user and temporal features, each branch deciding a different topic type. Parameters λU and λT in Equation 1 are the probability coefficients controlling the branch choice, which also denote the proportions of stable and temporal topics in the data set. p(w|u, t) = λU θi∈ΘU p(θi|u)p(w|θi)+λT θj ∈ΘT p(θj|t)p(w|θj) (1) For the user branch, a stable topic is chosen according to the procedure of of generating whole docum 2, where p(w L(C The goal o 2. As this e Maximum L approach ins approach, po variables bas maximization ing the so-ca depends on th In our mo p(θi|u), p(θj and θj. For si The detailed temporal mo E-step: が求まる stableなトピック テンポラルなトピック 13年6月29日土曜日
  5. 5. } special smoothing } burst word対策 Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) A Unified Model for Stable and Temporal Topic Detection from Social Media Data 5 pected complete data : T) (18) al regularization, we (θj|t) in the M-step. opted to estimate all anced model with the njoys a similar form rly, just as the spatial izer R(C, T) is non- d decreases R(C, T) ) +1 and R(C, T) into lution for ψ (2) n+1, and (θj|t + 1) (m) n+1 (19) and p(θi|u)n+1 re- poral regularization egularization meter γ; 9); temporal topic. An example of two kinds of words is shown in Figure 2. Three burst words “mj”, “moonwalk” and “michaeljackson” have their distribution curves with sharp spikes. We can see that although the trends of these words do not always synchronize, they all go through a drastic increase and reach peaks in July 2009. The bursts in their curve are ignited by a real life event, i.e., Michael Jackson’s death. An effective topic model should capture these words into one topic. On the other hand, abstract words like “news” and “world” maintains high occurrences throughout the year in Figure 2 but they convey little information. Although they are relevant to the event in July, they also have relationships to many other topics. For example, word “news” could be used to represent various different news. However, such abstract words shadow the spikes of more meaningful words. The high occurrences of such abstract words during the burst period of the burst words may overwhelm the latter and render them unnoticed. Fig. 2. Normalized Word Frequency Distribution on “Michael Jackson’s Death” in 2009 To boost interesting temporal topics, we propose a smooth- ing technique that merges correlated words into one temporal Bursty Degreeを計測 (Yao at al ICDE’2010) して補正をかける。 友達間で同一のトピックに対して盛り 上がっているときは、補正をかける。 13年6月29日土曜日
  6. 6. } 結果 Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) A Unified Model for Stable and Temporal Topic Detection from Social Media Data 6 ds: tes ric ing ted rds flat zed nce the ults ure his ure al- the nd, ics by The tor ral hod DA of est Twitter data set by hiring 3 volunteers as annotators. For each topic, we extracted keywords with the highest probabilities to represent its content. Each topic was labeled by two different annotators, and if they disagreed a third annotator was introduced. Three exclusive labels were provided to indicate the quality of temporal topic detection. • Excellent: a nicely presented temporal topic • Good: a topic containing bursty features • Poor: a topic without obvious bursty features Excellent Good Poor EUTB 42.5% 32.5% 25% TOT 10% 40% 50% Individual Detection 20% 37.5% 42.5% TimeUserLDA 29.5% 38% 32.5% Twitter-LDA 13.5% 39% 47.5% TABLE III COMPARISON ON TEMPORAL TOPIC QUALITY The labeling results are summarized in Table III. Up to 75% of the temporal topics detected by EUTB were labeled as “Ex- cellent” or “Good”, and 42.5% were regarded as “Excellent”. Among all competitors, TimeUserLDA performs best. 67.5% of the detected temporal topics were judged as “Excellent” or “Good”, and 29.5% were regarded as “Excellent”. Other competitors got merely or slightly more than 50% of their detected topics labeled as “Excellent” or “Good”. In particular, the competitors got significantly less “Excellent” labels. These results demonstrate that our proposed user-temporal mixture PLSA on slices Individual Detection TOT model EUTB TimeUserLDA latest michaeljackson news michaeljackson news headline july world jackson jackson news breaking breaking mj michael investigative news jackson moonwalk michaeljackson michaeljackson headline michaeljackson death death event investigative death news investigative TABLE IV TOPIC “MICHAEL JACKSON” DETECTED BY DIFFERENT APPROACHES T77 T78 T87 T89 T60 T71 2009.1.12-2009.1.31 2009.6.15-2009.6.27 2009.4.24-2009.5.6 2009.5.27-2009.6.6 2009.1.24-2009.1.27 2009.1.1-2009.1.6 obama 0.144 moon 0.090 flu 0.158 google 0.061 droid 0.125 2008 0.099 inauguration 0.106 space 0.068 swineflu 0.124 googlewave 0.059 go 0.113 webcomics 0.046 ユーザインタビューによる テストの結果、提案手法(EUTB) によるトッピック抽出は 評価が高かった。 13年6月29日土曜日
  7. 7. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) Crowdsourced Enumeration Queries 7 ・既存のDBを利用した検索 データベースの中に 存在するデータがすべて。 (closed -world assumption) ・クラウドソースを利用した検索 データはweb/頭の中に存在 母集団の数がわからん。。 based on the CROWD annotations and optional fre tations of columns and tables in the schema. Fig an example HTML-based UI that would be pre worker for the following crowd table definition: CREATE CROWD TABLE ice_cream_flavor { name VARCHAR PRIMARY KEY } Although CrowdDB supports alternate user inte showing previously received answers), this pape a pure form of the “getting it all” question. alternative UIs is the subject of future work. During query processing, the system automat one or more HITs using the AMT web service A 13年6月29日土曜日
  8. 8. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) Crowdsourced Enumeration Queries } 内容 } クラウドソーシングの検索タスクに対する回答集合数の推定. } 生物統計学における固有種数の推定手法を応用(CHAO92) } 固有種数の推定手法とは? } ある特定地域の個体数を調べ、種の種類や密度を推定。 } 同種法を用いて類推した例としては、例えば地球上の恐竜の種類の 推定、等が有名(図)。 } 8 Estimating the diversity of dinosaurs (Steve C. Wang and Peter Dodson ) 13年6月29日土曜日
  9. 9. } 推定関数は種を単位とするか,個体間のダイバージェンスを考慮する か,均等度を考慮に入れるか等によっていろいろあり。 } CHAO 84 estimator } CHAO92 estimator(今回利用したもの) } sample coverageという概念を利用 Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) Crowdsourced Enumeration Queries 9 hao84 Estimator hao develops a simple estimator for species rich- based solely on the number of rare species found ple: ˆNchao84 = c + f2 1 2f2 d that it actually is a lower bound, but it per- l on her test data sets. She also found that the works best when there are relatively rare species, ten the case in real species estimation scenarios. hao92 Estimator hao develops another estimator based on the no- ple coverage. The sample coverage C is the sum babilities pi of the observed classes. However, underlying distribution p1...pN is unknown, this om the Good-Turing estimator[20] is used: ˆC = 1 f1/n 92 estimator attempts to explicitly characterize orate the skew of the underlying distribution us- e cient of variance (CV), denoted , a metric e used to describe the variance in a probability for the ice cream flavors. In this paper we clea ified workers’ answers manually; other work h techniques for crowd-based verification [24, 1, 1 Figure 5(a-c) shows the average cardinality es time, i.e., for increasing numbers of HITs, for th UN countries, and ice cream flavors using th estimators. Error bars can be computed using v mators provided in [8, 6], however we omit the readability. The horizontal line indicates the t ity if it is known. Below each graph, a tab “f1-ratio” and the actual number of received u over time. We define f1-ratio as f1/ P i fi, th the singletons as compared to the overall rec items. Recall that the presence of singletons is dicator that there are more undetected items; are relatively few singletons, we have likely app plateau of the SAC. The f1-ratio can be used tion of whether or not the sample size is su cie cardinality estimation. Since estimators use the quencies of f1 compared to the other fi’s, a will make it more di cult for the estimators Also note that the ratio between the unique it predicted cardinality is the completeness estim c: 観測された種の数 f1:一度のみ観測された種の数 f2:二度観測された種の数 sed solely on the number of rare species found ˆNchao84 = c + f2 1 2f2 hat it actually is a lower bound, but it per- n her test data sets. She also found that the ks best when there are relatively rare species, the case in real species estimation scenarios. o92 Estimator develops another estimator based on the no- coverage. The sample coverage C is the sum bilities pi of the observed classes. However, erlying distribution p1...pN is unknown, this the Good-Turing estimator[20] is used: ˆC = 1 f1/n estimator attempts to explicitly characterize te the skew of the underlying distribution us- cient of variance (CV), denoted , a metric sed to describe the variance in a probability 8]; we can use the CV to compare the skew ass distributions. The CV is defined as the ation divided by the mean. Given the pi’s at describe the probability of the ith class be- with mean ¯p = P i pi/N = 1/N, the CV is ⇥P ⇤ techniques for crowd-based verification Figure 5(a-c) shows the average cardi time, i.e., for increasing numbers of HIT UN countries, and ice cream flavors u estimators. Error bars can be computed mators provided in [8, 6], however we o readability. The horizontal line indicat ity if it is known. Below each graph “f1-ratio” and the actual number of re over time. We define f1-ratio as f1/ P the singletons as compared to the ove items. Recall that the presence of sing dicator that there are more undetecte are relatively few singletons, we have li plateau of the SAC. The f1-ratio can b tion of whether or not the sample size i cardinality estimation. Since estimator quencies of f1 compared to the other will make it more di cult for the esti Also note that the ratio between the u predicted cardinality is the completene 3.3.1 US States For the US states (Figure 5(a)), all fairly well; Chao92 remains closer to Chao84. The estimates are stable at the true value even earlier. Note this C: sample coverage (観測された種の確率piの和) since the underlying distribution p1...pN is unknown, this estimate from the Good-Turing estimator[20] is used: ˆC = 1 f1/n The Chao92 estimator attempts to explicitly characterize and incorporate the skew of the underlying distribution us- ing the coe cient of variance (CV), denoted , a metric that can be used to describe the variance in a probability distribution [8]; we can use the CV to compare the skew of di↵erent class distributions. The CV is defined as the standard deviation divided by the mean. Given the pi’s (p1 · · · pN ) that describe the probability of the ith class be- ing selected, with mean ¯p = P i pi/N = 1/N, the CV is expressed as = ⇥P i(pi ¯p)2 /N ⇤1/2 / ¯p [8]. A higher CV indicates higher variance amongst the pi’s, while a CV of 0 indicates that each item is equally likely. The true CV cannot be calculated without knowledge of the pi’s, so Chao92 uses an estimate, ˆ. ˆ2 = max ( c ˆC X i i(i 1)fi n(n 1) 1, 0 ) (2) The estimator that uses the coe cient of variance is below; note that if ˆ2 = 0 (i.e., indicating a uniform distribution), the estimator reduces to c/ ˆC ˆNchao92 = c ˆC + n(1 ˆC) ˆC ˆ2 plateau of the SAC. The f1- tion of whether or not the sa cardinality estimation. Sinc quencies of f1 compared to will make it more di cult Also note that the ratio bet predicted cardinality is the 3.3.1 US States For the US states (Figur fairly well; Chao92 remains Chao84. The estimates are the true value even earlier. all fifty states are acquired ( may be be surprising that t as well as it does, as one m would be more commonly a few explanations for this age coe cient of variance 0.53; in [8], Chao notes tha reasonable for  0.5. F typically do not submit th samples drawn without rep tribution will result in a les original. We discuss sampli in Section 4. Individual wor di↵erent skewed distribution states before those in the m orporate the skew of the underlying distribution us- coe cient of variance (CV), denoted , a metric n be used to describe the variance in a probability ution [8]; we can use the CV to compare the skew rent class distributions. The CV is defined as the rd deviation divided by the mean. Given the pi’s pN ) that describe the probability of the ith class be- ected, with mean ¯p = P i pi/N = 1/N, the CV is ed as = ⇥P i(pi ¯p)2 /N ⇤1/2 / ¯p [8]. A higher CV es higher variance amongst the pi’s, while a CV of 0 es that each item is equally likely. true CV cannot be calculated without knowledge of , so Chao92 uses an estimate, ˆ. ˆ2 = max ( c ˆC X i i(i 1)fi n(n 1) 1, 0 ) (2) timator that uses the coe cient of variance is below; at if ˆ2 = 0 (i.e., indicating a uniform distribution), mator reduces to c/ ˆC ˆNchao92 = c ˆC + n(1 ˆC) ˆC ˆ2 Experimental Results an over 25,000 HITs on AMT to compare the perfor- Also note that the ratio betw predicted cardinality is the c 3.3.1 US States For the US states (Figure fairly well; Chao92 remains Chao84. The estimates are the true value even earlier. all fifty states are acquired (o may be be surprising that th as well as it does, as one mi would be more commonly c a few explanations for this age coe cient of variance ˆ 0.53; in [8], Chao notes that reasonable for  0.5. Fu typically do not submit the samples drawn without repla tribution will result in a less original. We discuss samplin in Section 4. Individual work di↵erent skewed distribution states before those in the mi 3.3.2 UN Countries その他  Abundance-based Coverage Estimator 等様々な手法が存在 13年6月29日土曜日
  10. 10. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) Crowdsourced Enumeration Queries } chao92を利用して国数を推定してみた。 } 原因 worker behaviorに起因 10 200 400 600 800 050100150200250300 # answers chao92estimate actual expected Fig. 4. Estimated Cardinality (A, B, C, D, F, A, G, B, A, ….) … A B C D E F G H I J K... (A, B, G, H, F, I, A, E, E, K, ….) (a) Database Sampling (B) Crowd Based Sampling = sampling process with replacement = sampling process without replacement Worker Processes Worker ArrivalProcess A B C D E F G H I J K... A B C D E F G H I J K... A B C D E F G H I J K... Fig. 5. Sampling Process workers complete different amounts of work and arrive/depart from the experiment at different points in time. The next subsection formalizes a model of how answers arrive from the crowd in response to a set enumeration query, as well as a description of how crowd behaviors impact the sample of answers received. We then use simulation to demonstrate the principles of how these behaviors play off one another and thereby influence an estimation algorithm. B. A Model for Human Enumerations Species estimation algorithms assume a with-replacement sample from some unknown distribution describing item likeli- hoods (visualized in Figure 5(a)). The order in which elements 1) Sampling Without Replacement: When a worker submits multiple items for a set enumeration query, each answer is different from his previous ones. In other words, individuals are sampling without replacement from some underlying dis- tribution that describes the likelihood of selecting each answer. Of course, this behavior is beneficial with respect to the goal of acquiring all the items in the set, as low-probability items be- come more likely after the high-probability items have already been provided by that worker (we do not pay for duplicated work from a single worker). A negative side effect of workers sampling without replacement is that the estimation algorithm receives less information about the relative frequency of items, and thus the skew, of the underlying data distribution; having うまくいかない。。 (A, B, C, D, F, A, G, B, A, ….) … A B C D E F G H I J K... (A, B, G, H, F, I, A, E, E, K, ….) (a) Database Sampling (B) Crowd Based Sampling = sampling process with replacement = sampling process without replacement Worker Processes Worker ArrivalProcess A B C D E F G H I J K... A B C D E F G H I J K... A B C D E F G H I J K... Fig. 5. Sampling Process t s 1) Sampling Without Replacement: When a worker submits multiple items for a set enumeration query, each answer is different from his previous ones. In other words, individuals ・種推定においては、アイテム尺度が    未知の分布から標本が抽出される。 ・人間による列挙では、ある内在する  アイテム分布に基づき標本(回答)が抽出される。 13年6月29日土曜日
  11. 11. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) Crowdsourced Enumeration Queries } ストリーカーの存在 } 完全な回答を提示する回答者も存在(ストリーカー) } 特徴としては、重複なしの標本抽出をする。その結果真値 よりも過大に推定されてしまう。 } 開始時に200アイテムすべてを回答するストリーカーを追 加して検証 11 (b) forms of skew (c) impact of streaker 500 1000 1500 2000 # answers ws=T, dd=T ws=F, dd=T ws=T, dd=F ws=F, dd=F 500 1000 1500 2000 0100200300400 # answers chao92estimate ws=T, dd=T ws=F, dd=T ws=T, dd=F ws=F, dd=F ation simulations illustrating the impact of worker behaviors13年6月29日土曜日
  12. 12. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) Crowdsourced Enumeration Queries } Chao 92を改良した Streaker-tolerant Estimatorという手法を開発 } ストリーカの許容範囲を設定 } Detect and remove outliners 12 (a) UN 1 200 400 600 800 0100200300 # answers chao92estimate Φorig = 0.14 Φnew = 0.087 (b) UN 2 200 400 600 800 0100200300 # answers chao92estimate Φorig = 0.11 Φnew = 0.099 (f) States 1 100 Φorig = 0.046 Φ = 0.053 (g) States 2 100 Φorig = 0.028 Φ = 0.024 (a) UN 1 200 400 600 800 0100200300 # answers chao92estimate Φorig = 0.14 Φnew = 0.087 (b) UN 2 200 400 600 800 0100200300 # answers chao92estimate Φorig = 0.11 Φnew = 0.099 (f) States 1 100 Φorig = 0.046 Φnew = 0.053 (g) States 2 100 Φorig = 0.028 Φnew = 0.02413年6月29日土曜日
  13. 13. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 3000万件のURL中、39%のurlが一つのみのtag(図1) } 投稿数を増やせばいいの?→品質がむしろ悪くなる (図2) 13 13年6月29日土曜日
  14. 14. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 3000万件のURL中、39%のurlが一つのみのtag(図1) } 投稿数を増やせばいいの?→品質がむしろ悪くなる (図2) 13 13年6月29日土曜日
  15. 15. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 3000万件のURL中、39%のurlが一つのみのtag(図1) } 投稿数を増やせばいいの?→品質がむしろ悪くなる (図2) 13 13年6月29日土曜日
  16. 16. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 3000万件のURL中、39%のurlが一つのみのtag(図1) } 投稿数を増やせばいいの?→品質がむしろ悪くなる (図2) 13 13年6月29日土曜日
  17. 17. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 3000万件のURL中、39%のurlが一つのみのtag(図1) } 投稿数を増やせばいいの?→品質がむしろ悪くなる (図2) 13 13年6月29日土曜日
  18. 18. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 限られた予算において、インセンティブ配分を最適 化することによって品質の向上の最大化 ワーカーに対してどのbudget を割り振るかが課題 13年6月29日土曜日
  19. 19. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 限られた予算において、インセンティブ配分を最適 化することによって品質の向上の最大化 Selected ワーカーに対してどのbudget を割り振るかが課題 13年6月29日土曜日
  20. 20. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 限られた予算において、インセンティブ配分を最適 化することによって品質の向上の最大化 Selected Quality Metric for Tag Data ワーカーに対してどのbudget を割り振るかが課題 13年6月29日土曜日
  21. 21. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 限られた予算において、インセンティブ配分を最適 化することによって品質の向上の最大化 Selected Quality Metric for Tag Data ワーカーに対してどのbudget を割り振るかが課題 13年6月29日土曜日
  22. 22. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 限られた予算において、インセンティブ配分を最適 化することによって品質の向上の最大化 Selected Quality Metric for Tag Data ワーカーに対してどのbudget を割り振るかが課題 13年6月29日土曜日
  23. 23. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging ¨ 以下の5種類を提案 ¨ Free Choice (FC) ¨ Round Robin (RR) ¨ Fewest Post First (FP) ¤ タグが付けられていないものを優先 ¨ Most Unstable First (MU) ¤ rfd(Relative Frequency Distribution)の値をみて最も不確 かなものを選択 ¨ Hybrid (FP-MU) ¨ 以上の手法をDP(theoretically optimal solution)と比較 13年6月29日土曜日
  24. 24. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging ¨ Free Choice: 50% posts are over-tagging, wasted } 16 ¨ FP & FP-MU are close to optimal ¨ Budget = 1,000 ¤ 0.7% more posts comparing with initial no. ¤ 6.7% quality improvement ¨ Free Choice: 50% posts are over- tagging, wasted 13年6月29日土曜日
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×