1. Best of both worlds:
human-machine
collaboration for object
annotation (CVPR2015)
Olga Russakovsky@1, Li-Jia Li@2, Li Fei-Fei@1
@1: Stanford University, @2: Snapchat(Yahoo! Labs)
1
Presenter : 品川 政太朗(NAIST)
paper reading
※ All of images are quoted from the paper.
2. 諸注意 2
• “bbox” は “boundary(bounding) box”の略
• “TP” は “True Positive”
• “TN” は “True Negative”.
𝑛𝑢𝑚 "yes" is correct answer
𝑛𝑢𝑚 𝑎𝑛𝑠𝑤𝑒𝑟 "𝑦𝑒𝑠"
𝑛𝑢𝑚 "no" is correct answer
𝑛𝑢𝑚 𝑎𝑛𝑠𝑤𝑒𝑟 "𝑛𝑜"
reference number is same in the paper
[paper] http://ai.stanford.edu/~olga/papers/RussakovskyCVPR15.pdf
[supplements] http://ai.stanford.edu/~olga/papers/RussakovskyCVPR15_supp.pdf
[CVPR poster] http://ai.stanford.edu/~olga/posters/cvpr15-poster.pdf
[slides made by first author] http://ai.stanford.edu/~olga/slides/best_of_both_worlds_slides.pdf
7. 7Related Work (1/3)
Recognition with humans in the loop
image classification [6,59,12]
image segmentation [26]
attribute-based classification [32,40,3]
image clustering [34]
image annotation [54,55,47]
human interaction [31]
object annotation in video[58]
[6,59,12,60]はhuman machine collaborationにおける
human time と annotation accuracyの関係に言及
→only single type of human response
[26,13,54]はmultiple modality feedback(varying costs)
predict the success of each modality
→they do not incorporate iterative improvement
8. 8
Better object detection
weakly supervised data [42,23,52,8,24,15]
active learning [32,56]
mine the web for object names and exemplars
[8,11,15]
→minimize human annotation
Related Work (2/3)
9. 9
Cheaper manual annotation
some development of crowdsourcing techniques
・annotation games[57, 12, 30]
・tricks to reduce the annotation search space[13,4]
・effective user interface design[50,58]
・making use of existing annotations[5]
making use of weak human supervision[26,7]
accurately computing the number of required workers[46]
[10,46,28,62] iterative improvement to perform a task with
accuracy per unit of human cost
Related Work (3/3)
10. 10
Olga Russakovsky
• postdoctoral fellow at Carnegie Mellon Univ.
(PhD student when this paper published)
• large-scale recognition, ML, HCI
Li-Jia Li
• Snapchat
• PhD degree from Stanford Univ.
Li Fei-Fei
• Associate Professor, Stanford Univ.
• CVの鬼
(Crowdsourcing) + (large-scale object recognition)
+ (to reduce annotation cost) = this paper ?
この論文のみ
著者 (Stanford Vision Lab team)
17. 17
Task
(MDP action)
Template TP TN Cost
Verify-box
Is box B tight around an instance of
class C ?
0.87 0.98 5.34s
Verify-image
Does the image contain an object of
class C ?
0.77 0.93 5.89s
Verify-cover
Are there more instance of class C not
covered by the set of boxes B ?
0.75 0.74 7.57s
Draw-box
Draw a new instance of class C not
already in set of boxes B.
0.72 0.84 10.21s
Name-image
Name an object class in the image
besides the known object classes C .
0.71 0.96 5.71s
Verify-object Is box B tight around some object? 0.75 0.92 9.67s
Name-box
If box B is tight around an object other
than the objects in 𝐶 𝐵, name the object.
0.98 0.88 9.46s
Requests from system to human
18. 18
Task
(MDP action)
Template CV model
Verify-box
Is box B tight around an instance of
class C ?
𝑃(det(𝐵, 𝐶)|𝐼)
Verify-image
Does the image contain an object of
class C ?
𝑃(cls(𝐶)|𝐼)
Verify-cover
Are there more instance of class C not
covered by the set of boxes B ?
𝑃 more 𝐵, 𝐶 𝐼
Draw-box
Draw a new instance of class C not
already in set of boxes B.
𝑃(morecls(𝐶)|𝐼)
Name-image
Name an object class in the image
besides the known object classes C .
𝑃(morecls(𝐶)|𝐼)
Verify-object
Is box B tight around some object? 𝑃(obj(B)のbbox
はtightか)
Name-box
If box B is tight around an object other
than the objects in 𝐶 𝐵, name the object.
𝑃(new B, C )
CV model
19. 19
𝑃 new B, C = P(obj(B))
𝐶∈𝑐
(1 − 𝑃(det(𝐵, 𝐶)))
𝑃 more B, C |𝐼 =
𝑃(𝑐𝑙𝑠(𝐶)|𝐼)
𝑃(𝑚𝑜𝑟𝑒|𝑛)
if n=0
else
𝑛 = 𝑟𝑜𝑢𝑛𝑑_𝑛𝑒𝑎𝑟𝑒𝑠𝑡_𝑖𝑛𝑡(𝔼[𝑛𝑐(𝐵, 𝐶)])
𝔼 𝑛𝑐 ℬ, 𝐶 =
𝐵∈ℬ
𝑃(det(𝐵, 𝐶)|𝐼)
𝑛𝑐 ℬ, 𝐶 ∶ クラスCを満たしているbbox 𝑠𝑒𝑡 ℬの数
20. 20
Verify-box (Task 1/7)
focus on an object
(existence known)
(bbox exists)
(bbox quality unknown)
Q: Do the bbox exists
tightly around the object ?
(yes/no)
In this case,
“yes” is correct answer.
Request answer
21. 21
Verify-image (Task 2/7)
focus on an object
(existence unknown)
Q: Do the object exists in
the image ? (yes/no)
Request answer
In this case,
“no” is correct answer.
22. 22
Verify-cover (Task 3/7)
focus on multiple objects
at least, a object exists
(existence known)
(bbox exists)
(bbox fitness known)
however, multiple objects
(existence unknown)
Request answer
Q: Are the all of objects
completely annotated ?
(yes/no)
In this case,
“no” is correct answer.
23. 23
Draw-box (Task 4/7)
focus on multiple objects
at least, a object exists
(existence known)
(bbox exists)
(bbox quality known)
however, multiple objects
(existence unknown)
Request answer
Q: Are the all of objects
completely annotated ?
(yes -> draw a box / no)
In this case,
“no” is correct answer.
24. 24
Name-image (Task 5/7)
some object
(existence known)
unannotated objects
(existence unknown)
Request answer
Q: Are there any
unannotated objects in the
image ? (yes -> input the
name of the object / no)
In this case,
“umbrella” is an example.
25. 25
Verify-object (Task 6/7)
focus on a bbox
(bbox exists)
(bbox quality unknown)
Request answer
Q: Is this bbox good ?
(yes/no)
In this case,
“yes” is correct answer.
difference from Verify-box :
not focus on object
26. 26
Name-object(box) (Task 7/7)
Request answer
focus on a bbox
(bbox exists)
(bbox quality may be
good)
(object name unknown)
Q: Is this bbox good ?
(yes -> input the object
name / no )
In this case,
“no” is correct answer.
27. 27
Experiment Setup
dataset : ImageNet Large Scale Visual Recognition Challenge
(ILSVRC)2014 detection dataset
40万訓練用画像, 20万バリデーション用画像
validationはval1とval2に分割 (val2をテストに用いる)
val2は2216画像, 1画像少なくとも4つアノテーション有
CV model :
物体検出器 -> pretrained R-CNN [Girshick et al. 2014]
訓練画像はILSVRC2013 detection training set
検出や分類でprobability <0.1 となったものは結果を破棄する
検出器の出力に以下の理由でnon-maximum suppressionをかけ
る
1) 同じ物体を何度も検出するのを避けるため
2) 計算量を削減するため
アノテーション成功とする目標値は IOU=0.7
28. 28
Intersection over union (IOU)
IOUが高いほど良いbboxといえる
bbox内の物体の領域の割合 = high IOUが必ずしも成り立たない
例があるので現在のCV技術では高いIOUを獲得するのが難しいも
のも存在する(例:コークスクリュー)⇒人間の手が必要
29. 29
Experimental Results
• Computer Vision model + Humanが
他の手法よりも優れている
• CVのみはBudget=0(人は無関係)
setting :
2K images of ILSVRC2014 detection
validation (that have at least 4 objects)
• In the budget < 120 [s]
CV+H is higher than others
• MDP is effective
• ILSVRC-DET [43] also use
human-in-the-loop
it takes long time to be ready to
require annotators to draw bbox,
446.9 [s/image]
only binary question
CV only
30. 30
Utility of returned labeling 𝑼
• Req.prec -> requested precision 𝑷∗
• 高いprecision⇒低いutility
• システムが注意深くなっていると解釈できる
Fraction of feasible images
• Req.util -> requested utility 𝑼∗
• 得られたutility 𝑼が𝑼 ≥ 𝑼∗であるような画像
の割合
Precision of returned labeling 𝑷
• expected precision of the labeling
Constraint (𝑼∗
, 𝑷∗
, 𝑩∗
)
33. 33
Good/Bad Annotations
(Instruction of crowdsourcing)
Good
• tight bbox
• each bbox covers most of
an object
Bad
• Redundant bbox
• each bbox covers only a part
of an object
• bbox covers multiple objects