Yolo v1

論文輪読会 YOLO_v1
日付：2019/03/06
作成者：毛利拓也
https://arxiv.org/abs/1506.02640

1
はじめに
• YOLOは進化を続け、3モデルがある。最新モデルはv3で今回はv1の論文は解説。
• https://pjreddie.com/darknet/yolov1/
• https://pjreddie.com/darknet/yolov2/
• https://pjreddie.com/darknet/yolo/
• YOLOはリアルタイムでの物体検出が可能なモデル。YOLO_v2は精度(mAP)と速度
(FPS)を兼ね備えて、SSDより高いパフォーマンスを出す。Old YOLOはYOLO_v1
• v2のリンク先より抜粋

2
はじめに
• ゴール
1. YOLO_v1の特徴を理解する。
2. YOLO_v1の推論の仕組みを理解する。
3. YOLO_v1の学習の仕組みを理解する。特に損失関数。
➜ 解説は1章の特徴、2章の仕組みにフォーカス。
• スコープ
• Abstract
• 1.Introdauction
• 2.Unified detection
• 3.Comparison to Other Detections Systems
• 4.Experiments
• 5.Real-Time Detection In The Wild
• 6.Conclusion
本日、解説する範囲
ConclusionはAbstractと同じため省略

3
Abstract
• Prior work on object detection repurposes classifiers to perform
detection. Instead, we frame object detection as a regression
problem to spatially separated bounding boxes and associated class
probabilities.
➜ 従来のモデルは分類機を物体検出に再利用。私達は物体検出を空間に離散化
されたバウンディングボックスとそのクラス確率を回帰問題として組み立てた。
• Our unified architecture is extremely fast.～while still achieving
double the mAP of other real-time detectors.
➜ YOLOは速い(45FPS) しかし、mAPは他のリアルタイム検出器の2倍(Table1)
• Compared to state-of-the-art detection systems, YOLO makes more
localization errors but is less likely to predict false positives on
background.
➜ R-CNNに比べ位置誤差は多い、しかし、背景に物体有りと誤認することは少ない。

4
1.Introduction
• YOLO is refreshingly simple: see Figure 1. A single convolutional
network simultaneously predicts multiple bounding boxes and class
probabilities for those boxes.
➜ YOLOは図1を見て分かるとおり、すっきりでシンプルである。1個の畳み込みネット
ワークが複数のバウンディングボックスとそれらのクラス確率を推論する。
• This unified model has several benefits over traditional methods of
object detection. First, YOLO is extremely fast.
➜ この統合されたモデルは従来モデルよりも種々の利点がある。第1に、極端に速い。

5
1.Introduction
• Second, YOLO reasons globally about the image when making
predictions. Unlike sliding window and region proposal-based
techniques, YOLO sees the entire image
➜ 第2に、YOLOは画像の全体像を見て推論する。従来のスライディングウィンドウや
領域提案の手法と異なり、 YOLOは画像全体を見る。
（Fast R-CNNの領域提案は画像の全体像が見えないため、背景の誤認が多い）
• Third, YOLO learns generalizable representations of objects.
➜ 第3に、 YOLOは物体の汎化性が高い表現を学ぶ。
（自然の画像で訓練し、美術品でテストした場合でもYOLOは他の方法より優秀。）
• While it can quickly identify objects in images it struggles to
precisely localize some objects, especially small ones.
➜ YOLOの位置推定は速いが、正確な位置推定が苦手（汗）。。。特に小さい物
体が複数個あると苦手。

6
2.Unified Detection
• Our system divides the input image into an S × S grid. If the center
of an object falls into a grid cell, that grid cell is responsible for
detecting that object.
➜ YOLOは入力画像をS×Sのグリッドに分割する。物体の中心座標が存在するグリ
ッドが推論の責任を負う。（物体の中心座標があるグリッドだけがバウンディングボックス
を推論するようネットワークは訓練されている。（2.2で解説））
物体の中心がある3グリッドだけがバウンディングボックスを推論

7
2.Unified Detection
• Each grid cell predicts B bounding boxes and confidence scores for
those boxes. These confidence scores reflect how confident the
model is that the box contains an object and also how accurate it
thinks the box is that it predicts.
➜ 各グリッドはB個のバウンディングボックスとその信頼度を推論する。信頼度はどの程
度ボックスが物体を含むか反映し、推論の正確性を表す。
青のグリッドが2個のバウンディングボックスを推論するイメージ（B=2）

8
• Formally we define confidence as Pr(Object) ∗ IOUtruth pred . If no
object exists in that cell, the confidence scores should be zero.
Otherwise we want the confidence score to equal the intersection
over union (IOU) between the predicted box and the ground truth
① グリッドに物体がない：Pr(Object)＝0のため、訓練時に信頼度＝0 で学習
② グリッドに物体がある：Pr(Object)＝1のため、訓練時に信頼度＝IOUで学習
③ 但し、中心座標でないグリッドはPr(Object)＝1ではあるが、訓練時にバウンディングボック
スを作成しないため、信頼度=0で学習
2.Unified Detection
②Pr(Object)＝1のため、信頼度=IOU
①Pr(Object)＝0のため、信頼度＝0
③物体の中心でないため、信頼度=0

9
2.Unified Detection
• 推論（バウンディングボックス）と正解（教師データ）の2領域の重なり具合をIOUで
評価する。
• IOU(Intersection over Union)は0≦IOU≦1で2領域の重複を評価して、IOU
が1だと推論と正解の領域が同じで、0だと推論と正解の領域に重なりがない。

10
• Each bounding box consists of 5 predictions: x, y, w, h, and
confidence. The (x, y) coordinates represent the center of the box
relative to the bounds of the grid cell. The width and height are
predicted relative to the whole image. Finally the confidence
prediction represents the IOU between the predicted box and any
ground truth box.
➜ バウンディングボックスは（x、y、w、h、F）から構成され、座標（ x、y ）はボック
スの中心の位置、（w、h）はボックスの大きさを表現。信頼度Fは位置の信頼度で
正解座標と推論座標のIOUで表現。
2.Unified Detection
バウンディングボックスの信頼度F ＝ IOU

11
2.Unified Detection
• YOLOは信頼度IOUの閾値でバウンディングボックスの表示有無を制御する。
IOU≧0.5など
• 因みに、閾値をIOU≧0まで下げると、信頼度が低いボックスも表示され、7×7（グリッ
ド数）×2個（ボックス数）のバウンディングボックスが表示される。

12
2.Unified Detection
• Each grid cell ～. We only predict one set of class probabilities per
grid cell, regardless of the number of boxes B. At test time we
multiply the conditional class probabilities and the individual box
confidence predictions,
Pr(Classi|Object) ∗ Pr(Object) ∗ IOUtruth pred = Pr(Classi) ∗ IOUtruth pred
These scores encode both the probability of that class appearing in
the box and how well the predicted box fits the object.
➜ 1グリッドでB個のバウンディングボックスを作成する場合でも、グリッド毎に1つのクラ
ス確率のセットを推論する。（1グリッド内のB個のボックスは同じクラスになる。）
➜ 推論のとき、条件付きのクラス確率Pr(Classi)と物体の中心があるグリッドの信頼
度IOUの積を計算して、クラス確信度を推論する。
➜ これらのクラス確信度は物体の位置とクラス分類の精度をエンコードしたものである。

13
2.Unified Detection
• 7×7の全グリッドで物体があればPr(Object) は1、なければ0を推論
• グリッドに物体中心があれば、中心グリッドは信頼度IOUを推論（位置推定は物体中心グリッド
だけを使用して訓練するため）。とはいえ、実際の推論は複数グリッドで複数個のボックスを作成
することもあるので、Non Maximum Suppressionで不要なボックスは削除（2.3で解説）
• グリッドに物体があれば、物体ありのグリッドでクラス確率Pr(Classi)を推論（クラス分類は中心
以外のグリッドも使用して学習）
• 物体の中心があるグリッドでバウンディングボックスが作成され、クラス確信度=信頼度IOU×クラス
確率 Pr(Classi)になる。
物体中心がある3グリッド
信頼度：IOU
物体がある全グリッド
クラス確率： Pr(Classi)
中心の3グリッド
信頼度IOU×クラス確率
Pr(Classi)
7×7の全グリッド
物体有無：Pr(Object)

14
2.1.Network Design
• 出力層は7×7（各グリッド）×30の3次元Tensorを出力
• 3次元目は（座標＋信頼度）5 × ボックス数2 ＋クラス数20 で計算
• 各グリッドのボックスは2個あるが、クラスは1つ
• ネットワークの学習では左記の特徴マップを訓練する

15
2.1.Network Design
• 出力層の7×7(グリッド数)×30の3次元Tensorを49×30の2次元Tensorで図解
• 物体の中心座標があるグリッドが推論する。（グリッドの1と48番）
• 中心グリッドの中で高いIOUを持つボックスを出力する。グリッド1番はボックス1、グリッド
48番はボックス2が出力（高いIOUのボックスは信頼度を上げ、低いIOUのボックス信
頼度を下げるようネットワークを訓練。（2.2で解説））
• クラス確率はc1+c2+・・・+c20=1 で最も高いクラスがc2の場合、1番目のバウンデ
ィングボックスのクラス確信度はF1×c2、48番目のバウンディングボックスはF2×c2。

16
2.2.Training
• YOLO predicts multiple bounding boxes per grid cell. At training
time we only want one bounding box predictor to be responsible for
each object. We assign one predictor to be “responsible” for
predicting an object based on which prediction has the highest
current IOU with the ground truth. This leads to specialization
between the bounding box predictors.
➜ YOLOはグリッド毎に2個のバウンディングボックスを推論する。しかし、訓練では1物
体に1個のバウンディングボックスが欲しい。そこで、物体の正解座標とのIOUが最も高
いバウンディングボックス1個を学習に使用する。これが、複数のバウンディングボックスか
らの絞り込む方法である。

17
2.2.Training
• 損失はボックス中心座標＋ボックス大きさ＋信頼度＋不信頼度＋クラス分類の合計
• 全グリッドiとその中の全バウンディングボックスjの組み合わせに対して和をとり、推論xと
正解x^に対して、回帰計算。インディケーター関数により訓練するバウンディングボック
スを巧みに絞り込む。
クラス分類
ボックスの中心座標
ボックスの大きさ
信頼度
不信頼度

18
2.2.Training
• Note that the loss function only penalizes classification error if an
object is present in that grid cell.
➜ 5項目のクラス分類：グリッドに物体が存在するPr(Object)=1の場合に計算。
インディケーター関数はi番目のグリッドに物体があるときは1、ないときは0
（クラス分類は物体が存在する全グリッドが対象。）
推論のクラス確信度と正解
クラスが同じだと損失0

19
2.2.Training
• It also only penalizes bounding box coordinate error if that predictor
is “responsible” for the ground truth box. (i.e. has the highest IOU
of any predictor in that grid cell).
➜ 1,2,3項目は物体中心があるグリッドの中で最大IOUのボックスを計算。
インディケーター関数は物体中心があるi番目のグリッドと最大IOUのj番目のボックスの
組み合わせが1で、それ以外は0。3項目は信頼度C^=1が正解で学習。
推論と正解の中心座
標が同じだと損失0
推論と正解の大きさ
が同じだと損失0
中心グリッドの信頼度の
推論が1だと損失0

20
2.2.Training
• 4項目は物体中心があるグリッドの中で最大IOUでないボックスを計算。
• インディケーター関数noobjは物体中心があるi番目のグリッドと最大IOUでないj番目
のボックスの組み合わせが1で、最大IOUの組み合わせは0。
• 4項目は信頼度C^=0が正解で学習。
最大のIOUでないの
で信頼度0が理想
中心グリッドの信頼度
の推論が0だと損失0

21
2.3.Inference
• The grid design enforces spatial diversity in the bounding box
predictions. Often it is clear which grid cell an object falls in to and
the network only predicts one box for each object. However, some
large objects or objects near the border of multiple cells can be well
localized by multiple cells. Non-maximal suppression can be used to
fix these multiple detections.
➜ グリッドのデザインの都合で複数の異なるバウンディングボックスが推論されることがあ
る。通常、物体の中心点がどこのグリッドに落ちるか自明で、ネットワークは物体1個に
対して1個のバウンディングボックスを推論する。しかし、グリッドを跨ぐ大きな物体やグリッ
ドの狭間にある場合は物体1個に対して複数のバウンディングボックスを推論する。これ
らのボックスの絞り込みにNon-maximal suppression(NMS)が利用される。
※NMSは同じクラスのバウンディングボックス通しのIOUが閾値(0.5など)以上の場合
（被りがある場合）、最大のクラス確信度を持つバウンディングボックスを除いて削除す
る。

22
2.4.Limitation of YOLO
• YOLO imposes strong spatial constraints on bounding box
predictions since each grid cell only predicts two boxes and can only
have one class. This spatial constraint limits the number of nearby
objects that our model can predict. Our model struggles with small
objects that appear in groups, such as flocks of birds.
➜ YOLOはバウンディングボックスの位置推定に強い制約を課している。各グリッドセル
は2つのバウンディングボックスと1つのクラスしか推論できないからだ。この位置の制約は
YOLOが物体周辺で推論可能な物体数を制限している。YOLOは鳥の群れのように
小さな物体の集合体の推論が苦手。

23
2.4.Limitation of YOLO
• Finally, while we train on a loss function that approximates
detection performance, our loss function treats errors the same in
small bounding boxes versus large bounding boxes. A small error in
a large box is generally benign but a small error in a small box has a
much greater effect on IOU. Our main source of error is incorrect
localizations.
➜ 物体検出の結果を近似した損失関数を使って、ネットワークを訓練するので、損失
関数は大きい物体と小さい物体を同じ基準で扱ってしまう。大きい物体の誤差に比べ
て、小さい物体の誤差はIOUの影響で圧倒的に大きくなる。損失関数の主な原因は
不正な位置推定にある。

24
まとめ
• ゴール
1. YOLO_v1の特徴を理解する。
1. YOLOは45fpsとぶっちぎりに速い。(Table1を参照)
2. 画像の全体像を見て推論するので、背景の誤検出が少ない。
3. 汎化性が高い表現を学習するので、応用が利く、デバイス向き。
4. 1グリッドは最大2ボックス、1クラスの制約があるため、1グリッドに小さな物体
が集中したり、物体が複数クラスだと弱い。
2. YOLO_v1の推論の仕組みを理解する。
1. 推論は物体の中心のグリッドが担当する。（物体数＝推論するグリッド数）
2. 1グリッドで2ボックス推論するが、高い信頼度のボックスで推論する。
3. クラス確信度＝信頼度IOU×クラス確率
➜ 位置の精度とクラス分類の積になっている。

25
まとめ
• ゴール1. あ
2. あ
3. YOLO_v1の学習の仕組みを理解する。特に損失関数。
1. グリッドに物体があればクラス分類の損失を計算
2. 物体の中心点があるグリッドで最大IOUのボックスは信頼度、ボックスの中心
位置、ボックスの大きさの損失を計算
3. 物体の中心点があるグリッドで最大IOUでないボックスは不信頼度の損失を
計算

26
参考資料
• https://www.slideshare.net/ssuser07aa33/introduction-to-yolo-
detection-model

Yolo v1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Yolo v1