影像切割與深度學習

影像切割與深度學習
PwcHeng
2016 春

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1/26
初探影像切割監督式實例其他元件 References
這張圖和⿃有關
這⽅框內是⿃
PwcHeng 影像切割與深度學習
Classiﬁcation
Detection/Localization

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1/26
藍⾊區域屬於⿃
影像切割
(Semantic Segmentation)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2/26
這張圖⽚的中⼼位置
屬於岩⽯嗎？還是屬於⿃？

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3/26

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4/26

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5/26

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6/26

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7/26

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8/26
1
= max
∈
( )

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9/26
∑
=
∑
∈
(
·
)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10/26
∑
=
∑
∈
(
·
)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11/26
頂層的表格內彙整了
圖⽚中⼤範圍的資訊

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12/26
整合範圍⼤⼩
底層像素：1 × 1
頂層表格：10 × 10

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13/26
監督式學習
屬於狗屬於背景
CNN
[. . . ]-[Softmax]-[Cross Entropy]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13/26
監督式學習
CNN
⽬標：預測屬於狗

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13/26
監督式學習
CNN
⽬標：預測屬於背景

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14/26
每張圖⽚蘊含⻑ × 寬組教材
數⼗萬、甚⾄上百萬次 CNN 傳遞？

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15/26
捲積−轉機？

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15/26
捲積−轉機？
底層表格（像素）
⾒樹不⾒林−像素只匯集了 1 × 1 範圍的資訊
頂層表格
解析度低−歷經多次 stride > 1 的轉換
⾒林不⾒樹−歷經多層 max-pooling layers

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16/26
Transpose of Conv.：擴散資訊，upsample 表格
:
:
:
Kerenl
Stride
Pad
4 × 4
2
1
=
[
conv transpose
( )]
0
[1], [2]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16/26
將底層表格融⼊上層運算
:
:
:
Kerenl
Stride
Pad
4 × 4
2
1
=
[
conv transpose
( )]
+
[
conv1×1
( )]
0
[1], [2]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16/26
將底層表格融⼊上層運算
:
:
:
Kerenl
Stride
Pad
4 × 4
2
1
=
[
conv transpose
( )]
+
[
conv1×1
( )]
底層計算結果須保留到上層，無法提前釋放記憶體
0
[1], [2]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17/26
上下結構對稱：上層部分架構更深，學習能⼒越強
根據 max 位置放⼤表格：上層計算結合⼩巧的下層資訊
Indices of Max
=
[
unpool
(
indices-of-max
( ))]
0
[3], [4]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18/26
Inference one 480 × 360 image on Titan: ≈ 50ms
模型對⼩型物體切割表現不好
模型無法區分同類別中的不同個體
模型預測結果偶有雜訊出現，不太平滑

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19/26
先找出物體的位置和⼤⼩，再根據物體區域資訊做預測
分次處理屬於同⼀類別的不同物體
模型處理的輸⼊圖⽚幾乎都被物體所佔滿
匯集屬於物體範圍的資訊
0
[5]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19/26
先找出物體的位置和⼤⼩，再根據物體區域資訊做預測
分次處理屬於同⼀類別的不同物體
模型處理的輸⼊圖⽚幾乎都被物體所佔滿
匯集屬於物體範圍的資訊
各⽅框獨⽴匯集資訊，⼀張圖⽚需多次捲積處理
Selective Search, Edge Box . . . 沒辦法和 CNN ⼀起訓練
0
[5]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20/26
Region Proposal Network—RPN
根據局部資訊（CNN 特徵表），判別區域是否有物體
所有運算皆可微分：RPN 可以和其他元件⼀同學習
anchor#1
. . .
[. . . ]-[Softmax]-[Cross Entropy]：有沒有像 anchor 的物體
[. . . ]-[Smooth L1]：將 anchor 轉為物體框框 (x, y, w, h)
0
[6], [7]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20/26
anchor#1 anchor#2
. . .
0
[6], [7]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20/26
anchor#1 anchor#2
. . .
採⽤ Non-maximum Suppression 整合各區域預測結果，某
些情況下無法正確地找出所有重疊物體（有趣想法：[7]）
0
[6], [7]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21/26
RoI Warping Layer (RoI Pooling Layer)
將不同⻑寬的 CNN 特徵表，warp ⾄固定⻑寬
[CNN]-[RoI Warping]-[Classiﬁer, Regressor, . . . ] 的組合
能處理不同⻑寬的輸⼊
CNN Warp
CNN Warp
0
[8], [9], [10], [11]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21/26
RoI Warping Layer (RoI Pooling Layer)
將不同⻑寬的 CNN 特徵表，warp ⾄固定⻑寬
[CNN]-[RoI Warping]-[Classiﬁer, Regressor, . . . ] 的組合
能處理不同⻑寬的輸⼊
CNN Warp
CNN Warp
當 object proposal 很⼩時，範圍內資訊會濃縮在上層特徵
表中的⼀格。放⼤單格如複製重複資訊（有趣想法：[11]）
0
[8], [9], [10], [11]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22/26
Multi-task Network Cascades
將影像切割任務視為「找物體框框」、「切割出框框內
前景」、「預測前景部分物體類別」三步驟
各項⼩任務共享相同的 CNN 特徵表
CNN
RPN: 找出各區塊潛藏的物體框框
NMS：整合各區塊的框框
Classiﬁer: 框內各個位置是否為前景？
Classiﬁer: 前景部分是什麼類別？
[框框]-[RoI Warping]
不看背景的資訊
0
[8]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23/26
Inference one image (shorter side 600 ) on K40: 360ms. 圖⽚來源：[8]
0
[8]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24/26

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25/26
Condional Random Field
「⾞⼦、⾺路」、「天空、⾶機」常⼀起出現
CRF 可⽤ RNN 實做，與 CNN ⼀同訓練 [12]
Conv3, dilation = 1 Conv3, dilation = 2

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25/26
基於整張圖⽚資訊做預測
⽤ RNN 掃過所有區域的 CNN 特徵表 [13], [14]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25/26
基於整張圖⽚資訊做預測
⽤ RNN 掃過所有區域的 CNN 特徵表 [13], [14]
Dilated Convolution [15]
增加少許參數數量，即可讓模型快速匯集區域資訊
可取代部分 stride > 1 的運算，維持特徵表解析度

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26/26
總結
匯集像素點「周圍局部資訊」預測像素點類別
Max-pooling 與 stride > 1 的運算可能與影像切割不合
總結整體資訊才能做出全⾯的⼀致預測

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26/26
總結
匯集像素點「周圍局部資訊」預測像素點類別
Max-pooling 與 stride > 1 的運算可能與影像切割不合
總結整體資訊才能做出全⾯的⼀致預測
資料量對效能影響明顯 [16] 圖⽚來源：[17]
⽬前資料難取，Semi-supervised Learning 頗重要
[17][18]
整合圖像外的知識 (如⽂字)，影像切割應能更上層樓

References I
[1] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
semantic segmentation,” in IEEE conference on computer vision and pattern
recognition, CVPR 2015, boston, ma, usa, june 7-12, 2015, 2015,
pp. 3431–3440. doi: 10.1109/CVPR.2015.7298965. [Online]. Available:
http://dx.doi.org/10.1109/CVPR.2015.7298965.
[2] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional
networks,” in Computer vision - ECCV 2014 - 13th european conference,
zurich, switzerland, september 6-12, 2014, proceedings, part I, 2014,
pp. 818–833. doi: 10.1007/978-3-319-10590-1_53. [Online]. Available:
http://dx.doi.org/10.1007/978-3-319-10590-1_53.

References II
[3] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional
encoder-decoder architecture for image segmentation,” Corr, vol.
abs/1511.00561, 2015. [Online]. Available:
http://arxiv.org/abs/1511.00561.
[4] A. Kendall, V. Badrinarayanan, and R. Cipolla, “Bayesian segnet: Model
uncertainty in deep convolutional encoder-decoder architectures for scene
understanding,” Corr, vol. abs/1511.02680, 2015. [Online]. Available:
[5] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic
segmentation,” in 2015 IEEE international conference on computer vision,
ICCV 2015, santiago, chile, december 7-13, 2015, 2015, pp. 1520–1528. doi:
10.1109/ICCV.2015.178. [Online]. Available:
http://dx.doi.org/10.1109/ICCV.2015.178.

References III
[6] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time
object detection with region proposal networks,” Corr, vol. abs/1506.01497,
2015. [Online]. Available: http://arxiv.org/abs/1506.01497.
[7] R. Stewart and M. Andriluka, “End-to-end people detection in crowded
scenes,” Corr, vol. abs/1506.04878, 2015. [Online]. Available:
[8] J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via
multi-task network cascades,” Corr, vol. abs/1512.04412, 2015. [Online].
Available: http://arxiv.org/abs/1512.04412.

References IV
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
convolutional networks for visual recognition,” in Computer vision - ECCV
2014 - 13th european conference, zurich, switzerland, september 6-12, 2014,
proceedings, part III, 2014, pp. 346–361. doi:
10.1007/978-3-319-10578-9_23. [Online]. Available:
http://dx.doi.org/10.1007/978-3-319-10578-9_23.
[10] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical
features for scene labeling,” IEEE trans. pattern anal. mach. intell., vol. 35,
no. 8, pp. 1915–1929, 2013. doi: 10.1109/TPAMI.2012.231. [Online].
Available: http://dx.doi.org/10.1109/TPAMI.2012.231.

References V
[11] F. Yang, W. Choi, and Y. Lin, “Exploit all the layers: Fast and accurate cnn
object detector with scale dependent pooling and cascaded rejection
classiﬁers,” in Proceedings of the ieee international conference on computer
vision and pattern recognition, 2016.
[12] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du,
C. Huang, and P. H. S. Torr, “Conditional random ﬁelds as recurrent neural
networks,” in 2015 IEEE international conference on computer vision, ICCV
2015, santiago, chile, december 7-13, 2015, 2015, pp. 1529–1537. doi:
10.1109/ICCV.2015.179. [Online]. Available:
http://dx.doi.org/10.1109/ICCV.2015.179.

References VI
[13] Z. Yan, H. Zhang, Y. Jia, T. Breuel, and Y. Yu, “Combining the best of
convolutional layers and recurrent layers: A hybrid network for semantic
segmentation,” Corr, vol. abs/1603.04871, 2016. [Online]. Available:
[14] F. Visin, K. Kastner, K. Cho, M. Matteucci, A. C. Courville, and Y. Bengio,
“Renet: A recurrent neural network based alternative to convolutional
networks,” Corr, vol. abs/1505.00393, 2015. [Online]. Available:
[15] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated
convolutions,” Corr, vol. abs/1511.07122, 2015. [Online]. Available:

References VII
[16] A. Handa, V. Patraucean, V. Badrinarayanan, S. Stent, and R. Cipolla,
“Scenenet: Understanding real world indoor scenes with synthetic data,”
Corr, vol. abs/1511.07041, 2015. [Online]. Available:
[17] D. Lin, J. Dai, J. Jia, K. He, and J. Sun, “Scribblesup: Scribble-supervised
convolutional networks for semantic segmentation,” Corr, vol.
abs/1604.05144, 2016. [Online]. Available:
[18] P. H. O. Pinheiro and R. Collobert, “From image-level to pixel-level labeling
with convolutional networks,” in IEEE conference on computer vision and
pattern recognition, CVPR 2015, boston, ma, usa, june 7-12, 2015, 2015,
pp. 1713–1721. doi: 10.1109/CVPR.2015.7298780. [Online]. Available:
http://dx.doi.org/10.1109/CVPR.2015.7298780.

影像切割與深度學習

Recommended

Recommended

More Related Content

Similar to 影像切割與深度學習

Similar to 影像切割與深度學習 (20)

影像切割與深度學習