Paper reading best of both world

Best of both worlds:
human-machine
collaboration for object
annotation (CVPR2015)
Olga Russakovsky@1, Li-Jia Li@2, Li Fei-Fei@1
@1: Stanford University, @2: Snapchat(Yahoo! Labs)
1
Presenter : 品川政太朗（NAIST）
paper reading
※ All of images are quoted from the paper.

諸注意 2
• “bbox” は “boundary(bounding) box”の略
• “TP” は “True Positive”
• “TN” は “True Negative”.
𝑛𝑢𝑚 "yes" is correct answer
𝑛𝑢𝑚 𝑎𝑛𝑠𝑤𝑒𝑟 "𝑦𝑒𝑠"
𝑛𝑢𝑚 "no" is correct answer
𝑛𝑢𝑚 𝑎𝑛𝑠𝑤𝑒𝑟 "𝑛𝑜"
reference number is same in the paper
[paper] http://ai.stanford.edu/~olga/papers/RussakovskyCVPR15.pdf
[supplements] http://ai.stanford.edu/~olga/papers/RussakovskyCVPR15_supp.pdf
[CVPR poster] http://ai.stanford.edu/~olga/posters/cvpr15-poster.pdf
[slides made by first author] http://ai.stanford.edu/~olga/slides/best_of_both_worlds_slides.pdf

3
画像内のすべての物体をできるだけ
速くアノテーションしてください

4
正解
速く、漏れなくアノテーションするのは
人間にとって骨が折れる

5楽をする方法はないか？
有望な方法：物体検出技術による自動アノテーション
RCNN(Regions with CNN) [Girshick et al. 2014]
detect bbox and classify internal bbox using CNN (so strong)
問題点：
現状の物体検出技術
でもアノテーションでき
る物体は限られている
green : 成功
yellow : bboxにずれ有
pink : 検出失敗
complex task -> ask human (human-in-the-loop)
Human Machine Collaboration

6Human Machine Collaboration
トレードオフが存在
trivial tasks（yes/no問題）
(less accuracy, low cost)
complex tasks（bbox描画）
(high accuracy, high cost)
accuracy low cost
 binary question-and-answer [6,59,60] low cost (not accurate)
 attribute-based feedback [40,39,34]
 free-form object annotation [58] accurate (but high cost)
研究課題 : どのような質問をすれば、アノテーションの正確性を上げ
て、かつコストを下げることができるか
（人間側は常に正解を返せると仮定）
一番バランスするところが一番いいはず
強化学習（MDP）で対話のpolicy(どんな質問をするか)を最適化

7Related Work (1/3)
Recognition with humans in the loop
image classification [6,59,12]
image segmentation [26]
attribute-based classification [32,40,3]
image clustering [34]
image annotation [54,55,47]
human interaction [31]
object annotation in video[58]
[6,59,12,60]はhuman machine collaborationにおける
human time と annotation accuracyの関係に言及
→only single type of human response
[26,13,54]はmultiple modality feedback(varying costs)
predict the success of each modality
→they do not incorporate iterative improvement

8
Better object detection
weakly supervised data [42,23,52,8,24,15]
active learning [32,56]
mine the web for object names and exemplars
[8,11,15]
→minimize human annotation
Related Work (2/3)

9
Cheaper manual annotation
some development of crowdsourcing techniques
・annotation games[57, 12, 30]
・tricks to reduce the annotation search space[13,4]
・effective user interface design[50,58]
・making use of existing annotations[5]
making use of weak human supervision[26,7]
accurately computing the number of required workers[46]
[10,46,28,62] iterative improvement to perform a task with
accuracy per unit of human cost
Related Work (3/3)

10
Olga Russakovsky
• postdoctoral fellow at Carnegie Mellon Univ.
(PhD student when this paper published)
• large-scale recognition, ML, HCI
Li-Jia Li
• Snapchat
• PhD degree from Stanford Univ.
Li Fei-Fei
• Associate Professor, Stanford Univ.
• CVの鬼
(Crowdsourcing) + (large-scale object recognition)
+ (to reduce annotation cost) = this paper ?
この論文のみ
著者 (Stanford Vision Lab team)

11
Utility : （Recallのようなもの）
画像内で全ての数の物体が検出
されたラベルの数
（例）正解：卵２個、椅子１台、人２人
検出：卵１個、椅子１台、人２人
𝑈𝑡𝑖𝑙𝑖𝑡𝑦 = 2
Precision :
アノテーションされたオブジェクトの
うち正確なアノテーションがされたラ
ベルの数
Budget :
人がアノテーションにかかる時間
Problem Formulation
最初に閾値 (𝑈∗
, 𝑃∗
, 𝐵∗
)
を設定

12
𝔼[𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(𝑌)] = 𝑖∈𝑌 𝑝 𝑖
|𝑌|
(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∅ = 1)
𝒴 = 𝐵𝑖, 𝐶𝑖, 𝑝𝑖 𝑖=1
𝑁
, 𝑌 ⊆ 𝒴
𝑓 𝐵𝑖, 𝐶𝑖 =
1 (𝑐𝑙𝑎𝑠𝑠 𝐶𝑖 𝑜𝑏𝑗𝑒𝑐𝑡𝑠 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑙𝑦 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑒𝑑 𝑏𝑦 𝑏𝑏𝑜𝑥 𝐵𝑖)
0 (𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒)
Problem Formulation
𝒴：すべての物体集合（Nコ）
𝑌：あるラベルの物体集合
𝐵：bboxの頂点（位置、大きさ）
𝐶：クラスラベル
𝑝：検出が正確である確率
数式での(U, P, B)
𝔼[𝑈𝑡𝑖𝑙𝑖𝑡𝑦(𝑌)] =
𝑖∈𝑌
𝑝𝑖 𝑓(𝐵𝑖, 𝐶𝑖)
そもそも画像に存在しないもの
は正確に検出していると定義

13
states 𝒮
actions 𝒜
transition 𝒯
rewards ℛ
現在のアノテーション状態
MDP formulation
システムからユーザへの質問
予測されるユーザの返答
(utility of labeling)/costが上昇
すれば増加
ℛ 𝑎(𝑠, 𝑠′
) =
𝔼 𝑈𝑡𝑖𝑙𝑖𝑡𝑦 𝒴(𝑠′) − 𝔼 𝑈𝑡𝑖𝑙𝑖𝑡𝑦 𝒴 𝑠
𝑐𝑜𝑠𝑡(𝑎)
𝑎∗ 𝑠 = argmax
𝑎
𝑠′
𝑃𝑎 𝑠, 𝑠′ 𝑅 𝑎 𝑠, 𝑠′ + 𝑉 𝑠′
𝑉 𝑠 =
𝑠′
𝑃 𝑎∗ 𝑠, 𝑠′
𝑅 𝑎∗ 𝑠, 𝑠′
+ 𝑉 𝑠′
rewards
(𝑎 ∈ 𝐴)
𝑉(𝑠)を最大化する
𝑎∗ 𝑠 を選ぶ
2 step lookahead search
ℛ 𝑎 𝑠, 𝑠′ = −𝑖𝑛𝑓
𝑖𝑓 𝑐𝑜𝑠𝑡 𝑎 > 𝐵𝑢𝑑𝑔𝑒𝑡

14
𝑃 𝑢 𝑇 𝐼, 𝑈 𝑇−1 =
𝑘=1
𝐾
𝑃 𝑢 𝑇 𝐸 𝑘
𝑇
𝑃 𝐸 𝑘
𝑇
𝐼, 𝑈 𝑇−1
transition probabilities 𝒯
MDP formulation
𝐸1
𝑇
, 𝐸2
𝑇
, ⋯ , 𝐸 𝐾
𝑇
: 質問𝑎 𝑇に対して回答可能な返答
𝑢 𝑇 : 時刻Tでのユーザの実際の返答
(𝑈 𝑇−1 = 𝑢1, 𝑢2, ⋯ , 𝑢 𝑇−1)

15
𝑃 𝑢 𝑇 𝐼, 𝑈 𝑇−1 =
𝑘=1
𝐾
𝑇
𝑃 𝐸 𝑘
𝑇
𝐼, 𝑈 𝑇−1
MDP formulation
𝐸1
𝑇
, 𝐸2
𝑇
, ⋯ , 𝐸 𝐾
𝑇
:
𝑢 𝑇: 時刻Tでの
ユーザの返答
回答可能な返答
(𝑈 𝑇−1 = 𝑢1, 𝑢2, ⋯ , 𝑢 𝑇−1)
𝑇
, 𝐼, 𝑈 𝑇−1 を簡略化
𝒂 𝑻に対して𝑬 𝒌
𝑻
が正解であるときに
ユーザーが𝑬 𝒌
𝑻
を返答に選ぶ確率
１）ユーザの返答のノイズは画像に対して
独立
２）ユーザの返答同士は独立

16
𝑃 𝑢 𝑇 𝐼, 𝑈 𝑇−1 =
𝑘=1
𝐾
𝑇
𝑃 𝐸 𝑘
𝑇
𝐼, 𝑈 𝑇−1
𝑃 𝐸 𝑘
𝑇
𝐼, 𝑈 𝑇−1 ∝ 𝑃 𝐸 𝑘
𝑇
𝐼
𝑡=1
𝑇−1
𝑃 𝑢 𝑡 𝐸 𝑘
𝑇
, 𝐼, 𝑈𝑡−1
𝑃 𝐸 𝐼, 𝑈 𝑇−1 ∝ 𝑃 𝐸 𝑢 𝑇
𝑡=1
T−1∖ 𝑇
𝑃 𝑢 𝑡 𝐸, 𝐼, 𝑈𝑡−1
MDP formulation
𝐸1
𝑇
, 𝐸2
𝑇
, ⋯ , 𝐸 𝐾
𝑇
:
𝑢 𝑇: 時刻Tでの
ユーザの返答
回答可能な返答
(𝑈 𝑇−1 = 𝑢1, 𝑢2, ⋯ , 𝑢 𝑇−1)
物体検出モデル
最初に物体検出システムを使用する場合
人にbboxを描かせる場合(時刻 𝑇で描かせる)
※ユーザの返答は前の返答や画像に対して独立と仮定
𝒂 𝑻に対して𝑬 𝒌
𝑻
が正解である確率

17
Task
(MDP action)
Template TP TN Cost
Verify-box
Is box B tight around an instance of
class C ?
0.87 0.98 5.34s
Verify-image
Does the image contain an object of
class C ?
0.77 0.93 5.89s
Verify-cover
Are there more instance of class C not
covered by the set of boxes B ?
0.75 0.74 7.57s
Draw-box
Draw a new instance of class C not
already in set of boxes B.
0.72 0.84 10.21s
Name-image
Name an object class in the image
besides the known object classes C .
0.71 0.96 5.71s
Verify-object Is box B tight around some object? 0.75 0.92 9.67s
Name-box
If box B is tight around an object other
than the objects in 𝐶 𝐵, name the object.
0.98 0.88 9.46s
Requests from system to human

18
Task
(MDP action)
Template CV model
Verify-box
Is box B tight around an instance of
class C ?
𝑃(det(𝐵, 𝐶)|𝐼)
Verify-image
Does the image contain an object of
class C ?
𝑃(cls(𝐶)|𝐼)
Verify-cover
Are there more instance of class C not
covered by the set of boxes B ?
𝑃 more 𝐵, 𝐶 𝐼
Draw-box
Draw a new instance of class C not
already in set of boxes B.
𝑃(morecls(𝐶)|𝐼)
Name-image
Name an object class in the image
besides the known object classes C .
𝑃(morecls(𝐶)|𝐼)
Verify-object
Is box B tight around some object? 𝑃(obj(B)のbbox
はtightか)
Name-box
If box B is tight around an object other
than the objects in 𝐶 𝐵, name the object.
𝑃(new B, C )
CV model

19
𝑃 new B, C = P(obj(B))
𝐶∈𝑐
(1 − 𝑃(det(𝐵, 𝐶)))
𝑃 more B, C |𝐼 =
𝑃(𝑐𝑙𝑠(𝐶)|𝐼)
𝑃(𝑚𝑜𝑟𝑒|𝑛)
if n=0
else
𝑛 = 𝑟𝑜𝑢𝑛𝑑_𝑛𝑒𝑎𝑟𝑒𝑠𝑡_𝑖𝑛𝑡(𝔼[𝑛𝑐(𝐵, 𝐶)])
𝔼 𝑛𝑐 ℬ, 𝐶 =
𝐵∈ℬ
𝑃(det(𝐵, 𝐶)|𝐼)
𝑛𝑐 ℬ, 𝐶 ∶ クラスCを満たしているbbox 𝑠𝑒𝑡 ℬの数

20
Verify-box (Task 1/7)
focus on an object
(existence known)
(bbox exists)
(bbox quality unknown)
Q: Do the bbox exists
tightly around the object ?
(yes/no)
In this case,
“yes” is correct answer.
Request answer

21
Verify-image (Task 2/7)
focus on an object
(existence unknown)
Q: Do the object exists in
the image ? (yes/no)
Request answer
In this case,
“no” is correct answer.

22
Verify-cover (Task 3/7)
focus on multiple objects
at least, a object exists
(existence known)
(bbox exists)
(bbox fitness known)
however, multiple objects
(existence unknown)
Request answer
Q: Are the all of objects
completely annotated ?
(yes/no)
In this case,

23
Draw-box (Task 4/7)
focus on multiple objects
at least, a object exists
(existence known)
(bbox exists)
(bbox quality known)
however, multiple objects
(existence unknown)
Request answer
Q: Are the all of objects
completely annotated ?
(yes -> draw a box / no)
In this case,

24
Name-image (Task 5/7)
some object
(existence known)
unannotated objects
(existence unknown)
Request answer
Q: Are there any
unannotated objects in the
image ? (yes -> input the
name of the object / no)
In this case,
“umbrella” is an example.

25
Verify-object (Task 6/7)
focus on a bbox
(bbox exists)
(bbox quality unknown)
Request answer
Q: Is this bbox good ?
(yes/no)
In this case,
“yes” is correct answer.
difference from Verify-box :
not focus on object

26
Name-object(box) (Task 7/7)
Request answer
focus on a bbox
(bbox exists)
(bbox quality may be
good)
(object name unknown)
Q: Is this bbox good ?
(yes -> input the object
name / no )
In this case,

27
Experiment Setup
dataset : ImageNet Large Scale Visual Recognition Challenge
(ILSVRC)2014 detection dataset
40万訓練用画像, 20万バリデーション用画像
validationはval1とval2に分割 (val2をテストに用いる)
val2は2216画像, １画像少なくとも4つアノテーション有
CV model :
物体検出器 -> pretrained R-CNN [Girshick et al. 2014]
訓練画像はILSVRC2013 detection training set
検出や分類でprobability <0.1 となったものは結果を破棄する
検出器の出力に以下の理由でnon-maximum suppressionをかけ
る
1) 同じ物体を何度も検出するのを避けるため
2) 計算量を削減するため
アノテーション成功とする目標値は IOU=0.7

28
Intersection over union (IOU)
IOUが高いほど良いbboxといえる
bbox内の物体の領域の割合 = high IOUが必ずしも成り立たない
例があるので現在のCV技術では高いIOUを獲得するのが難しいも
のも存在する（例：コークスクリュー）⇒人間の手が必要

29
Experimental Results
• Computer Vision model + Humanが
他の手法よりも優れている
• CVのみはBudget=0（人は無関係）
setting :
2K images of ILSVRC2014 detection
validation (that have at least 4 objects)
• In the budget < 120 [s]
CV+H is higher than others
• MDP is effective
• ILSVRC-DET [43] also use
human-in-the-loop
it takes long time to be ready to
require annotators to draw bbox,
446.9 [s/image]
only binary question
CV only

30
Utility of returned labeling 𝑼
• Req.prec -> requested precision 𝑷∗
• 高いprecision⇒低いutility
• システムが注意深くなっていると解釈できる
Fraction of feasible images
• Req.util -> requested utility 𝑼∗
• 得られたutility 𝑼が𝑼 ≥ 𝑼∗であるような画像
の割合
Precision of returned labeling 𝑷
• expected precision of the labeling
Constraint (𝑼∗
, 𝑷∗
, 𝑩∗
)

32
Discussion
 ７つのタスク設定はこれで十分だろうか？
個人的には少し冗長に思える
Verify-objectのbudgetがName-object’sより低いならName-object’sいらな
いのでは・・・
 他の既存の手法とも比べて欲しい
何が効いているのかよく分からない
 結局家の中のような大量のアノテーションが必要な画像はどの
程度できるようになったのか分からない

33
Good/Bad Annotations
(Instruction of crowdsourcing)
Good
• tight bbox
• each bbox covers most of
an object
Bad
• Redundant bbox
• each bbox covers only a part
of an object
• bbox covers multiple objects

Paper reading best of both world

Recommended

Recommended

More Related Content

Similar to Paper reading best of both world

Similar to Paper reading best of both world (20)

More from Shinagawa Seitaro

More from Shinagawa Seitaro (11)

Recently uploaded

Recently uploaded (20)

Paper reading best of both world