Breaking the Kubernetes Kill Chain: Host Path Mount
ICDE2013勉強会 Session 19: Social Media II
1. Session 19: Social Media II
担当: デンソーアイティーラボラトリ 山本
【ICDE2013勉強会】
資料中の図は論文を引用しております。
13年6月29日土曜日
2. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
発表論文
} (1) A Unified Model for Stable and Temporal Topic Detection from
Social Media Data
} ソーシャルメディアにおけるコンテンツが一時的なトピックかそれ
とも恒久的なトピックかを考慮した上で判定
} (2) Crowdsourced Enumeration Queries (Best Paper)
} クラウドソーシングの検索タスクに対する回答集合数
(母集団)の推定.
} 生物統計学における固有種数の推定手法を応用(CHAO92)
} (3) On Incentive-based Tagging
} tag情報の品質をインセンティブをワーカー与えることによって
向上させる。
2
13年6月29日土曜日
3. } 【やりたいこと】
Stable TopicとTemporal Topic考慮した上でのトピック抽出
} Stable Topic及びTemporal Topicの定義
} Stable Topic :いつも誰かがそのテーマについて言及している
} Temporal Topic: 時系列上でみて、急激にそのテーマについて言及
する回数が激増・激減するようなテーマ。通常は実生活のイベント
が影響
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
A Unified Model for Stable and Temporal Topic Detection from
Social Media Data
3
important and useful to distinguish temporal topics from stable
topics in social media. However, such a discrimination is very
challenging because the user-generated texts in social media are
very short in length and thus lack useful linguistic features for
precise analysis using traditional approaches.
In this paper, we propose a novel solution to detect both
stable and temporal topics simultaneously from social media data.
Specifically, a unified user-temporal mixture model is proposed
to distinguish temporal topics from stable topics. To improve this
model’s performance, we design a regularization framework that
exploits prior spatial information in a social network, as well
as a burst-weighted smoothing scheme that exploits temporal
prior information in the time dimension. We conduct extensive
experiments to evaluate our proposal on two real data sets
obtained from Del.icio.us and Twitter. The experimental results
verify that our mixture model is able to distinguish temporal
topics from stable topics in a single detection process. Our
mixture model enhanced with the spatial regularization and
the burst-weighted smoothing scheme significantly outperforms
competitor approaches, in terms of topic detection accuracy and
discrimination in stable and temporal topics.
I. INTRODUCTION
User-generated contents (UGC) in Web 2.0 are valuable
resources capturing people’s interests, thoughts and actions.
Such contents cover a wide variety of topics that present
online and offline lives. For example, the microblog services
gather many short but quickly-updated texts that contain both
temporal and stable topics. Such topics form a huge and rich
repository of various kinds of interesting information.
Stable topics are often on users’ regular interests and their
daily routine discussions, which usually evolve at a rather
slow speed. The extraction of such stable topics enables us to
personalize the results and to improve the result relevance in
many applications such as computational advertising, content
targeting, personal recommendation and web search.
In contrast, temporal topics are on popular real-life events
or hot spots. In many circumstances, temporal topics, e.g.,
breaking events in the real world, bring about popular discus-
sion and wide diffusion on the Internet, where social networks
further boost the discussion and diffusion. Take Twitter, the
most popular microblog service, as an example. Many social
events can be discovered in Twitter’s posts (tweets), such
illustrated in Figure 1. We can tell the difference between
them from the temporal distributions and the description
keywords. A temporal topic has its text related to a certain
event like “Independence Day celebration” in a certain period
of time, and its popularity goes through a sharp increase at the
occurring time of the event. A stable topic has its description
on user’s regular interest like “Pet Adoption” and its temporal
distribution exhibits no sharp, spike-like fluctuation.
Fig. 1. Stable and Temporal Topics in Twitter
It is important and useful to distinguish the temporal topics
from the stable topics since they convey different kinds of
information. However, temporal topics are discussed with less
urgent themes in the background, and therefore temporal topics
are deeply mixed with stable topics in social media. As a
result, it is a challenging problem to detect and differenti-
ate temporal and stable topics from large amounts of user-
generated social media data.
Research on traditional topic detection and tracking employs
on-line incremental clustering [1] or retrospective off-line clus-
tering [25] for documents and extracts representative features
for clusters as a summary of the events. These methods are
suitable for conventional web pages where most documents are
long, rich in keywords, and related to certain popular events.13年6月29日土曜日
4. 【アプローチ】
}
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
A Unified Model for Stable and Temporal Topic Detection from
Social Media Data
4
niques for topic detection from social media data [12].
IV. A MIXTURE MODEL FOR DETECTING STABLE AND
TEMPORAL TOPICS
In this section, we propose a user-temporal mixture topic
model that integrates user and temporal features, followed by
an EM-based algorithm for inferring model parameters.
A. User-Temporal Model
SYMBOL DESCRIPTION
u, t, w user, time stamp, keyword
U, T, W set of users, time stamps and keywords
M[u, t, w] frequency of w used by u within time stamp t
λU , λT parameter controlling the branch selection
θi stable topic indexed by i
θj temporal topic indexed by j
ΘU , ΘT stable and temporal topic set
TABLE I
NOTATIONS
In the user-temporal mixture model, we pre-categorize
topics into two types: stable topics and temporal topics.
Stable topics summarize theme reflected from regular postings
according to the stable interest of a user or a community. While
temporal topics capture the popular events or controversial
news igniting hot discussion in a certain period. In this model,
we aim to detect both temporal and stable topics in one
generating process. Table I lists the relevant notations we use.
The mixture model is represented in Equation 1. For a stable
topic θ we pay particular attention to its user u who generates
Whether a keyword
a temporal topic is dec
contributions by the us
For instance, if many
certain period t, w wo
with higher probability
topics. Thus, keywords
clustered into tempora
to that of their keywor
The topics generated
individually. Both typ
during the learning pr
can filter out the stable
branch. It also helps re
disturbance from break
B. Estimation of Mode
Given an observati
procedure of our model
of generating the obser
whole document collec
2, where p(w|u, t) is d
L(C) =
U
The goal of parame
2. As this equation c
Maximum Likelihood
mixing user and temporal features, each branch deciding a
different topic type. Parameters λU and λT in Equation 1
are the probability coefficients controlling the branch choice,
which also denote the proportions of stable and temporal topics
in the data set.
p(w|u, t) = λU
θi∈ΘU
p(θi|u)p(w|θi)+λT
θj ∈ΘT
p(θj|t)p(w|θj)
(1)
For the user branch, a stable topic is chosen according to the
interest of a particular user. For the time branch, a temporal
topic is generated according to the time stamp of a post, which
means the post belongs to the topics that are popular for a
short period of time around that time stamp. Temporal topics
have their distribution on the time dimension, which indicates
its popularity probabilities. The time period during which a
temporal topic has its highest probability is its popularity
period. In our setting, the user interest is assumed to be stable
through time, and we ignore the possible slight evolution of
user interest.
maximizati
ing the so-
depends on
In our m
p(θi|u), p(
and θj. For
The detail
temporal m
E-step:
where B(w
・user uがword w を時間 t に言及する確率
要はstableなトピックは人に依存、テンポラルなトピックは時間に依存
ze
s.
gs
le
al
l,
ne
e.
le
es
to that of their keywords.
The topics generated in the two branches are not estimated
individually. Both types of topics interact with each other
during the learning procedure. This two-branch assumption
can filter out the stable components from burst topics by stable
branch. It also helps refine the quality of stable topics without
disturbance from breaking events as time elapses.
B. Estimation of Model Parameters
Given an observation matrix M(U, T, W), the learning
procedure of our model is to estimate the maximum probability
of generating the observed samples. The log-likelihood of the
whole document collection C by our approach is in Equation
2, where p(w|u, t) is defined according to Equation 1.
L(C) =
U T W
M[u, t, w] log p(w|u, t) (2)
The goal of parameter estimation is to maximize Equation
2. As this equation cannot be solved directly by applying
Maximum Likelihood Estimation (MLE), we apply an EM
approach instead. In an expectation (E) step of the EM
・user-time-associated document collection Cにおけるlog-likelihood
E-Mアルゴリズムを利用すれば、
In the user-temporal mixture model, we pre-categorize
topics into two types: stable topics and temporal topics.
Stable topics summarize theme reflected from regular postings
according to the stable interest of a user or a community. While
temporal topics capture the popular events or controversial
news igniting hot discussion in a certain period. In this model,
we aim to detect both temporal and stable topics in one
generating process. Table I lists the relevant notations we use.
The mixture model is represented in Equation 1. For a stable
topic θi we pay particular attention to its user u who generates
it. For a temporal topic θj we pay more attention to when,
indicated by time t, it is generated. Like PLSA [10], [11], our
user-temporal model consists of three layers and two branches
mixing user and temporal features, each branch deciding a
different topic type. Parameters λU and λT in Equation 1
are the probability coefficients controlling the branch choice,
which also denote the proportions of stable and temporal topics
in the data set.
p(w|u, t) = λU
θi∈ΘU
p(θi|u)p(w|θi)+λT
θj ∈ΘT
p(θj|t)p(w|θj)
(1)
For the user branch, a stable topic is chosen according to the
pr
of
wh
2,
2.
M
ap
ap
va
m
in
de
p(
an
Th
tem
In the user-temporal mixture model, we pre-categorize
topics into two types: stable topics and temporal topics.
Stable topics summarize theme reflected from regular postings
according to the stable interest of a user or a community. While
temporal topics capture the popular events or controversial
news igniting hot discussion in a certain period. In this model,
we aim to detect both temporal and stable topics in one
generating process. Table I lists the relevant notations we use.
The mixture model is represented in Equation 1. For a stable
topic θi we pay particular attention to its user u who generates
it. For a temporal topic θj we pay more attention to when,
indicated by time t, it is generated. Like PLSA [10], [11], our
user-temporal model consists of three layers and two branches
mixing user and temporal features, each branch deciding a
different topic type. Parameters λU and λT in Equation 1
are the probability coefficients controlling the branch choice,
which also denote the proportions of stable and temporal topics
in the data set.
p(w|u, t) = λU
θi∈ΘU
p(θi|u)p(w|θi)+λT
θj ∈ΘT
p(θj|t)p(w|θj)
(1)
For the user branch, a stable topic is chosen according to the
pr
of
w
2,
2.
M
ap
ap
va
m
in
de
p(
an
Th
te
In the user-temporal mixture model, we pre-categorize
topics into two types: stable topics and temporal topics.
Stable topics summarize theme reflected from regular postings
according to the stable interest of a user or a community. While
temporal topics capture the popular events or controversial
news igniting hot discussion in a certain period. In this model,
we aim to detect both temporal and stable topics in one
generating process. Table I lists the relevant notations we use.
The mixture model is represented in Equation 1. For a stable
topic θi we pay particular attention to its user u who generates
it. For a temporal topic θj we pay more attention to when,
indicated by time t, it is generated. Like PLSA [10], [11], our
user-temporal model consists of three layers and two branches
mixing user and temporal features, each branch deciding a
different topic type. Parameters λU and λT in Equation 1
are the probability coefficients controlling the branch choice,
which also denote the proportions of stable and temporal topics
in the data set.
p(w|u, t) = λU
θi∈ΘU
p(θi|u)p(w|θi)+λT
θj ∈ΘT
p(θj|t)p(w|θj)
(1)
For the user branch, a stable topic is chosen according to the
procedure of
of generating
whole docum
2, where p(w
L(C
The goal o
2. As this e
Maximum L
approach ins
approach, po
variables bas
maximization
ing the so-ca
depends on th
In our mo
p(θi|u), p(θj
and θj. For si
The detailed
temporal mo
E-step:
In the user-temporal mixture model, we pre-categorize
topics into two types: stable topics and temporal topics.
Stable topics summarize theme reflected from regular postings
according to the stable interest of a user or a community. While
temporal topics capture the popular events or controversial
news igniting hot discussion in a certain period. In this model,
we aim to detect both temporal and stable topics in one
generating process. Table I lists the relevant notations we use.
The mixture model is represented in Equation 1. For a stable
topic θi we pay particular attention to its user u who generates
it. For a temporal topic θj we pay more attention to when,
indicated by time t, it is generated. Like PLSA [10], [11], our
user-temporal model consists of three layers and two branches
mixing user and temporal features, each branch deciding a
different topic type. Parameters λU and λT in Equation 1
are the probability coefficients controlling the branch choice,
which also denote the proportions of stable and temporal topics
in the data set.
p(w|u, t) = λU
θi∈ΘU
p(θi|u)p(w|θi)+λT
θj ∈ΘT
p(θj|t)p(w|θj)
(1)
For the user branch, a stable topic is chosen according to the
procedure of
of generating
whole docum
2, where p(w
L(C
The goal o
2. As this e
Maximum L
approach ins
approach, po
variables bas
maximization
ing the so-ca
depends on th
In our mo
p(θi|u), p(θj
and θj. For si
The detailed
temporal mo
E-step:
が求まる
stableなトピック テンポラルなトピック
13年6月29日土曜日
5. } special smoothing
} burst word対策
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
A Unified Model for Stable and Temporal Topic Detection from
Social Media Data
5
pected complete data
:
T) (18)
al regularization, we
(θj|t) in the M-step.
opted to estimate all
anced model with the
njoys a similar form
rly, just as the spatial
izer R(C, T) is non-
d decreases R(C, T)
)
+1 and R(C, T) into
lution for ψ
(2)
n+1, and
(θj|t + 1)
(m)
n+1
(19)
and p(θi|u)n+1 re-
poral regularization
egularization
meter γ;
9);
temporal topic.
An example of two kinds of words is shown in Figure 2.
Three burst words “mj”, “moonwalk” and “michaeljackson”
have their distribution curves with sharp spikes. We can
see that although the trends of these words do not always
synchronize, they all go through a drastic increase and reach
peaks in July 2009. The bursts in their curve are ignited by
a real life event, i.e., Michael Jackson’s death. An effective
topic model should capture these words into one topic.
On the other hand, abstract words like “news” and “world”
maintains high occurrences throughout the year in Figure 2
but they convey little information. Although they are relevant
to the event in July, they also have relationships to many other
topics. For example, word “news” could be used to represent
various different news. However, such abstract words shadow
the spikes of more meaningful words. The high occurrences of
such abstract words during the burst period of the burst words
may overwhelm the latter and render them unnoticed.
Fig. 2. Normalized Word Frequency Distribution on “Michael Jackson’s
Death” in 2009
To boost interesting temporal topics, we propose a smooth-
ing technique that merges correlated words into one temporal
Bursty Degreeを計測
(Yao at al ICDE’2010)
して補正をかける。
友達間で同一のトピックに対して盛り
上がっているときは、補正をかける。
13年6月29日土曜日
6. } 結果
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
A Unified Model for Stable and Temporal Topic Detection from
Social Media Data
6
ds:
tes
ric
ing
ted
rds
flat
zed
nce
the
ults
ure
his
ure
al-
the
nd,
ics
by
The
tor
ral
hod
DA
of
est
Twitter data set by hiring 3 volunteers as annotators. For each
topic, we extracted keywords with the highest probabilities
to represent its content. Each topic was labeled by two
different annotators, and if they disagreed a third annotator was
introduced. Three exclusive labels were provided to indicate
the quality of temporal topic detection.
• Excellent: a nicely presented temporal topic
• Good: a topic containing bursty features
• Poor: a topic without obvious bursty features
Excellent Good Poor
EUTB 42.5% 32.5% 25%
TOT 10% 40% 50%
Individual Detection 20% 37.5% 42.5%
TimeUserLDA 29.5% 38% 32.5%
Twitter-LDA 13.5% 39% 47.5%
TABLE III
COMPARISON ON TEMPORAL TOPIC QUALITY
The labeling results are summarized in Table III. Up to 75%
of the temporal topics detected by EUTB were labeled as “Ex-
cellent” or “Good”, and 42.5% were regarded as “Excellent”.
Among all competitors, TimeUserLDA performs best. 67.5%
of the detected temporal topics were judged as “Excellent”
or “Good”, and 29.5% were regarded as “Excellent”. Other
competitors got merely or slightly more than 50% of their
detected topics labeled as “Excellent” or “Good”. In particular,
the competitors got significantly less “Excellent” labels. These
results demonstrate that our proposed user-temporal mixture
PLSA on slices Individual Detection TOT model EUTB TimeUserLDA
latest michaeljackson news michaeljackson news
headline july world jackson jackson
news breaking breaking mj michael
investigative news jackson moonwalk michaeljackson
michaeljackson headline michaeljackson death death
event investigative death news investigative
TABLE IV
TOPIC “MICHAEL JACKSON” DETECTED BY DIFFERENT APPROACHES
T77 T78 T87 T89 T60 T71
2009.1.12-2009.1.31 2009.6.15-2009.6.27 2009.4.24-2009.5.6 2009.5.27-2009.6.6 2009.1.24-2009.1.27 2009.1.1-2009.1.6
obama 0.144 moon 0.090 flu 0.158 google 0.061 droid 0.125 2008 0.099
inauguration 0.106 space 0.068 swineflu 0.124 googlewave 0.059 go 0.113 webcomics 0.046
ユーザインタビューによる
テストの結果、提案手法(EUTB)
によるトッピック抽出は
評価が高かった。
13年6月29日土曜日
7. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
Crowdsourced Enumeration Queries
7
・既存のDBを利用した検索
データベースの中に
存在するデータがすべて。
(closed -world assumption)
・クラウドソースを利用した検索
データはweb/頭の中に存在
母集団の数がわからん。。
based on the CROWD annotations and optional fre
tations of columns and tables in the schema. Fig
an example HTML-based UI that would be pre
worker for the following crowd table definition:
CREATE CROWD TABLE ice_cream_flavor {
name VARCHAR PRIMARY KEY
}
Although CrowdDB supports alternate user inte
showing previously received answers), this pape
a pure form of the “getting it all” question.
alternative UIs is the subject of future work.
During query processing, the system automat
one or more HITs using the AMT web service A
13年6月29日土曜日
8. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
Crowdsourced Enumeration Queries
} 内容
} クラウドソーシングの検索タスクに対する回答集合数の推定.
} 生物統計学における固有種数の推定手法を応用(CHAO92)
} 固有種数の推定手法とは?
} ある特定地域の個体数を調べ、種の種類や密度を推定。
} 同種法を用いて類推した例としては、例えば地球上の恐竜の種類の
推定、等が有名(図)。
}
8
Estimating the diversity of dinosaurs
(Steve C. Wang and Peter Dodson )
13年6月29日土曜日
9. } 推定関数は種を単位とするか,個体間のダイバージェンスを考慮する
か,均等度を考慮に入れるか等によっていろいろあり。
} CHAO 84 estimator
} CHAO92 estimator(今回利用したもの)
} sample coverageという概念を利用
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
Crowdsourced Enumeration Queries
9
hao84 Estimator
hao develops a simple estimator for species rich-
based solely on the number of rare species found
ple:
ˆNchao84 = c +
f2
1
2f2
d that it actually is a lower bound, but it per-
l on her test data sets. She also found that the
works best when there are relatively rare species,
ten the case in real species estimation scenarios.
hao92 Estimator
hao develops another estimator based on the no-
ple coverage. The sample coverage C is the sum
babilities pi of the observed classes. However,
underlying distribution p1...pN is unknown, this
om the Good-Turing estimator[20] is used:
ˆC = 1 f1/n
92 estimator attempts to explicitly characterize
orate the skew of the underlying distribution us-
e cient of variance (CV), denoted , a metric
e used to describe the variance in a probability
for the ice cream flavors. In this paper we clea
ified workers’ answers manually; other work h
techniques for crowd-based verification [24, 1, 1
Figure 5(a-c) shows the average cardinality es
time, i.e., for increasing numbers of HITs, for th
UN countries, and ice cream flavors using th
estimators. Error bars can be computed using v
mators provided in [8, 6], however we omit the
readability. The horizontal line indicates the t
ity if it is known. Below each graph, a tab
“f1-ratio” and the actual number of received u
over time. We define f1-ratio as f1/
P
i fi, th
the singletons as compared to the overall rec
items. Recall that the presence of singletons is
dicator that there are more undetected items;
are relatively few singletons, we have likely app
plateau of the SAC. The f1-ratio can be used
tion of whether or not the sample size is su cie
cardinality estimation. Since estimators use the
quencies of f1 compared to the other fi’s, a
will make it more di cult for the estimators
Also note that the ratio between the unique it
predicted cardinality is the completeness estim
c: 観測された種の数
f1:一度のみ観測された種の数
f2:二度観測された種の数
sed solely on the number of rare species found
ˆNchao84 = c +
f2
1
2f2
hat it actually is a lower bound, but it per-
n her test data sets. She also found that the
ks best when there are relatively rare species,
the case in real species estimation scenarios.
o92 Estimator
develops another estimator based on the no-
coverage. The sample coverage C is the sum
bilities pi of the observed classes. However,
erlying distribution p1...pN is unknown, this
the Good-Turing estimator[20] is used:
ˆC = 1 f1/n
estimator attempts to explicitly characterize
te the skew of the underlying distribution us-
cient of variance (CV), denoted , a metric
sed to describe the variance in a probability
8]; we can use the CV to compare the skew
ass distributions. The CV is defined as the
ation divided by the mean. Given the pi’s
at describe the probability of the ith class be-
with mean ¯p =
P
i pi/N = 1/N, the CV is
⇥P ⇤
techniques for crowd-based verification
Figure 5(a-c) shows the average cardi
time, i.e., for increasing numbers of HIT
UN countries, and ice cream flavors u
estimators. Error bars can be computed
mators provided in [8, 6], however we o
readability. The horizontal line indicat
ity if it is known. Below each graph
“f1-ratio” and the actual number of re
over time. We define f1-ratio as f1/
P
the singletons as compared to the ove
items. Recall that the presence of sing
dicator that there are more undetecte
are relatively few singletons, we have li
plateau of the SAC. The f1-ratio can b
tion of whether or not the sample size i
cardinality estimation. Since estimator
quencies of f1 compared to the other
will make it more di cult for the esti
Also note that the ratio between the u
predicted cardinality is the completene
3.3.1 US States
For the US states (Figure 5(a)), all
fairly well; Chao92 remains closer to
Chao84. The estimates are stable at
the true value even earlier. Note this
C: sample coverage
(観測された種の確率piの和)
since the underlying distribution p1...pN is unknown, this
estimate from the Good-Turing estimator[20] is used:
ˆC = 1 f1/n
The Chao92 estimator attempts to explicitly characterize
and incorporate the skew of the underlying distribution us-
ing the coe cient of variance (CV), denoted , a metric
that can be used to describe the variance in a probability
distribution [8]; we can use the CV to compare the skew
of di↵erent class distributions. The CV is defined as the
standard deviation divided by the mean. Given the pi’s
(p1 · · · pN ) that describe the probability of the ith class be-
ing selected, with mean ¯p =
P
i pi/N = 1/N, the CV is
expressed as =
⇥P
i(pi ¯p)2
/N
⇤1/2
/ ¯p [8]. A higher CV
indicates higher variance amongst the pi’s, while a CV of 0
indicates that each item is equally likely.
The true CV cannot be calculated without knowledge of
the pi’s, so Chao92 uses an estimate, ˆ.
ˆ2
= max
(
c
ˆC
X
i
i(i 1)fi n(n 1) 1, 0
)
(2)
The estimator that uses the coe cient of variance is below;
note that if ˆ2
= 0 (i.e., indicating a uniform distribution),
the estimator reduces to c/ ˆC
ˆNchao92 =
c
ˆC
+
n(1 ˆC)
ˆC
ˆ2
plateau of the SAC. The f1-
tion of whether or not the sa
cardinality estimation. Sinc
quencies of f1 compared to
will make it more di cult
Also note that the ratio bet
predicted cardinality is the
3.3.1 US States
For the US states (Figur
fairly well; Chao92 remains
Chao84. The estimates are
the true value even earlier.
all fifty states are acquired (
may be be surprising that t
as well as it does, as one m
would be more commonly
a few explanations for this
age coe cient of variance
0.53; in [8], Chao notes tha
reasonable for 0.5. F
typically do not submit th
samples drawn without rep
tribution will result in a les
original. We discuss sampli
in Section 4. Individual wor
di↵erent skewed distribution
states before those in the m
orporate the skew of the underlying distribution us-
coe cient of variance (CV), denoted , a metric
n be used to describe the variance in a probability
ution [8]; we can use the CV to compare the skew
rent class distributions. The CV is defined as the
rd deviation divided by the mean. Given the pi’s
pN ) that describe the probability of the ith class be-
ected, with mean ¯p =
P
i pi/N = 1/N, the CV is
ed as =
⇥P
i(pi ¯p)2
/N
⇤1/2
/ ¯p [8]. A higher CV
es higher variance amongst the pi’s, while a CV of 0
es that each item is equally likely.
true CV cannot be calculated without knowledge of
, so Chao92 uses an estimate, ˆ.
ˆ2
= max
(
c
ˆC
X
i
i(i 1)fi n(n 1) 1, 0
)
(2)
timator that uses the coe cient of variance is below;
at if ˆ2
= 0 (i.e., indicating a uniform distribution),
mator reduces to c/ ˆC
ˆNchao92 =
c
ˆC
+
n(1 ˆC)
ˆC
ˆ2
Experimental Results
an over 25,000 HITs on AMT to compare the perfor-
Also note that the ratio betw
predicted cardinality is the c
3.3.1 US States
For the US states (Figure
fairly well; Chao92 remains
Chao84. The estimates are
the true value even earlier.
all fifty states are acquired (o
may be be surprising that th
as well as it does, as one mi
would be more commonly c
a few explanations for this
age coe cient of variance ˆ
0.53; in [8], Chao notes that
reasonable for 0.5. Fu
typically do not submit the
samples drawn without repla
tribution will result in a less
original. We discuss samplin
in Section 4. Individual work
di↵erent skewed distribution
states before those in the mi
3.3.2 UN Countries
その他
Abundance-based Coverage Estimator 等様々な手法が存在
13年6月29日土曜日
10. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
Crowdsourced Enumeration Queries
} chao92を利用して国数を推定してみた。
} 原因 worker behaviorに起因
10
200 400 600 800
050100150200250300
# answers
chao92estimate
actual
expected
Fig. 4. Estimated Cardinality
(A, B, C, D, F, A, G, B, A, ….)
…
A B C D E F G H I J K...
(A, B, G, H, F, I, A, E, E, K, ….)
(a) Database Sampling (B) Crowd Based Sampling
= sampling process with replacement
= sampling process without replacement
Worker
Processes
Worker
ArrivalProcess
A B C D E F G H I J K... A B C D E F G H I J K... A B C D E F G H I J K...
Fig. 5. Sampling Process
workers complete different amounts of work and arrive/depart
from the experiment at different points in time.
The next subsection formalizes a model of how answers
arrive from the crowd in response to a set enumeration query,
as well as a description of how crowd behaviors impact
the sample of answers received. We then use simulation to
demonstrate the principles of how these behaviors play off
one another and thereby influence an estimation algorithm.
B. A Model for Human Enumerations
Species estimation algorithms assume a with-replacement
sample from some unknown distribution describing item likeli-
hoods (visualized in Figure 5(a)). The order in which elements
1) Sampling Without Replacement: When a worker submits
multiple items for a set enumeration query, each answer is
different from his previous ones. In other words, individuals
are sampling without replacement from some underlying dis-
tribution that describes the likelihood of selecting each answer.
Of course, this behavior is beneficial with respect to the goal of
acquiring all the items in the set, as low-probability items be-
come more likely after the high-probability items have already
been provided by that worker (we do not pay for duplicated
work from a single worker). A negative side effect of workers
sampling without replacement is that the estimation algorithm
receives less information about the relative frequency of items,
and thus the skew, of the underlying data distribution; having
うまくいかない。。
(A, B, C, D, F, A, G, B, A, ….)
…
A B C D E F G H I J K...
(A, B, G, H, F, I, A, E, E, K, ….)
(a) Database Sampling (B) Crowd Based Sampling
= sampling process with replacement
= sampling process without replacement
Worker
Processes
Worker
ArrivalProcess
A B C D E F G H I J K... A B C D E F G H I J K... A B C D E F G H I J K...
Fig. 5. Sampling Process
t
s
1) Sampling Without Replacement: When a worker submits
multiple items for a set enumeration query, each answer is
different from his previous ones. In other words, individuals
・種推定においては、アイテム尺度が
未知の分布から標本が抽出される。
・人間による列挙では、ある内在する
アイテム分布に基づき標本(回答)が抽出される。
13年6月29日土曜日
11. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
Crowdsourced Enumeration Queries
} ストリーカーの存在
} 完全な回答を提示する回答者も存在(ストリーカー)
} 特徴としては、重複なしの標本抽出をする。その結果真値
よりも過大に推定されてしまう。
} 開始時に200アイテムすべてを回答するストリーカーを追
加して検証
11
(b) forms of skew (c) impact of streaker
500 1000 1500 2000
# answers
ws=T, dd=T
ws=F, dd=T
ws=T, dd=F
ws=F, dd=F
500 1000 1500 2000
0100200300400
# answers
chao92estimate
ws=T, dd=T
ws=F, dd=T
ws=T, dd=F
ws=F, dd=F
ation simulations illustrating the impact of worker behaviors13年6月29日土曜日
13. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 3000万件のURL中、39%のurlが一つのみのtag(図1)
} 投稿数を増やせばいいの?→品質がむしろ悪くなる
(図2)
13
13年6月29日土曜日
14. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 3000万件のURL中、39%のurlが一つのみのtag(図1)
} 投稿数を増やせばいいの?→品質がむしろ悪くなる
(図2)
13
13年6月29日土曜日
15. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 3000万件のURL中、39%のurlが一つのみのtag(図1)
} 投稿数を増やせばいいの?→品質がむしろ悪くなる
(図2)
13
13年6月29日土曜日
16. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 3000万件のURL中、39%のurlが一つのみのtag(図1)
} 投稿数を増やせばいいの?→品質がむしろ悪くなる
(図2)
13
13年6月29日土曜日
17. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 3000万件のURL中、39%のurlが一つのみのtag(図1)
} 投稿数を増やせばいいの?→品質がむしろ悪くなる
(図2)
13
13年6月29日土曜日
18. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 限られた予算において、インセンティブ配分を最適
化することによって品質の向上の最大化
ワーカーに対してどのbudget
を割り振るかが課題
13年6月29日土曜日
19. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 限られた予算において、インセンティブ配分を最適
化することによって品質の向上の最大化
Selected
ワーカーに対してどのbudget
を割り振るかが課題
13年6月29日土曜日
20. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 限られた予算において、インセンティブ配分を最適
化することによって品質の向上の最大化
Selected
Quality Metric
for Tag Data
ワーカーに対してどのbudget
を割り振るかが課題
13年6月29日土曜日
21. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 限られた予算において、インセンティブ配分を最適
化することによって品質の向上の最大化
Selected
Quality Metric
for Tag Data
ワーカーに対してどのbudget
を割り振るかが課題
13年6月29日土曜日
22. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 限られた予算において、インセンティブ配分を最適
化することによって品質の向上の最大化
Selected
Quality Metric
for Tag Data
ワーカーに対してどのbudget
を割り振るかが課題
13年6月29日土曜日
23. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
¨ 以下の5種類を提案
¨ Free Choice (FC)
¨ Round Robin (RR)
¨ Fewest Post First (FP)
¤ タグが付けられていないものを優先
¨ Most Unstable First (MU)
¤ rfd(Relative Frequency Distribution)の値をみて最も不確
かなものを選択
¨ Hybrid (FP-MU)
¨ 以上の手法をDP(theoretically optimal solution)と比較
13年6月29日土曜日
24. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
¨ Free Choice: 50% posts are over-tagging, wasted
}
16
¨ FP & FP-MU are close to
optimal
¨ Budget = 1,000
¤ 0.7% more posts comparing
with initial no.
¤ 6.7% quality improvement
¨ Free Choice: 50%
posts are over-
tagging, wasted
13年6月29日土曜日