Recurrent Instance Segmentation
with Linguistic Referring Expressions
Alba María Herrera Palacio
ADVISORS:
Xavier Giró-i-Nieto
Carles Ventura
Carina Silberer
MASTER THESIS DISSERTATION, September 2019
INTRODUCTION
INTRODUCTION | MOTIVATION
3
Natural Language Expressions
PREVIOUS WORK [1]
[1] A. Khoreva et al., Video Object Segmentation with Language Referring Expressions. ACCV 2018
Model
time
Model
One-shot RVOS [2]
INTRODUCTION | VIDEO OBJECT SEGMENTATION
Model
time
Model
Referring expression
“the woman”
4[2] C. Ventura et al., RVOS: End-to-End Recurrent Network for Video Object Segmentation. CVPR 2019
- Word or phrase
- Unambiguous
- Any form of linguistic description
INTRODUCTION | IMAGE SEGMENTATION WITH REFERRING EXPRESSIONS
REFERRING EXPRESSIONS
5
“male reading book with
fanny pack on waist”
“far right girl white shirt”“woman with phone”“left woman in blue”
IMAGE SEGMENTATION
WITH REFERRING
EXPRESSIONS
METHODOLOGY | GENERAL RECURRENT ARCHITECTURE
7
REFERRING
EXPRESSIONS
ENCODER
MASK
DECODER
IMAGE
ENCODER
JointRepresentation
“male reading book with
fanny pack on waist”
“far right girl white shirt”
“woman with phone”
“left woman in blue”
METHODOLOGY | PROPOSED ARCHITECTURE
8
REFERRING EXPRESSION ENCODER
METHODOLOGY | REFERRING EXPRESSION ENCODER
REFERRING EXPRESSION ENCODER
Pooled output
BERT
embedding
Encoded layers
None
Dimensionality
reduction
Linear layer
PCA
9
IMAGE ENCODER
METHODOLOGY | PROPOSED ARCHITECTURE
10
IMAGE ENCODER
METHODOLOGY | IMAGE ENCODER
RVOS [2] ENCODER
MULTI-RESOLUTION
VISUAL FEATURES
11[2] C. Ventura et al., RVOS: End-to-End Recurrent Network for Video Object Segmentation. CVPR 2019
MASK DECODER
METHODOLOGY | PROPOSED ARCHITECTURE
12
MASK DECODER
METHODOLOGY | MASK DECODER
LANGUAGE & VISION FUSION
1
M
width
height depth
1
N
13
MASK DECODER
METHODOLOGY | MASK DECODER
RVOS [2] SPATIAL RECURRENCE
space
14[2] C. Ventura et al., RVOS: End-to-End Recurrent Network for Video Object Segmentation. CVPR 2019
EXPERIMENTS | IMAGE DATASET
RefCOCO by UNC
15
EXPERIMENTS | QUANTITATIVE RESULTS
RefCOCO - Embedding configurations
16
EXPERIMENTS | QUANTITATIVE RESULTS
RefCOCO - Order of referents and batch size
17
True
positives
False
negatives
False
positives
EXPERIMENTS | QUALITATIVE RESULTS
“man on the left”
“right gal”
“left horse”
“right horse” 18
“right gal”
“man on the left”
“right horse”
“left horse”
EXPERIMENTS | FAILURE CASE
“sitting guy with cake” “woman with ponytail” “woman wearing red” “green shirt thanks for playing”
19
True
positives
False
negatives
False
positives
TOWARDS VIDEO
METHODOLOGY | VIDEO BASELINE ARCHITECTURE
MAttNet [3] + RVOS
[3] Licheng Yu et al., MAttNet: Modular Attention Network for Referring Expression Comprehension. CVPR 2018
21
EXPERIMENTS | VIDEO DATASET
DAVIS 2017
+ EXPRESSIONS BY KHOREVA [1]
22[1] A. Khoreva et al., Video Object Segmentation with Language Referring Expressions. ACCV 2018
EXPERIMENTS | VIDEO QUALITATIVE RESULTS
23
REFERRING EXPRESSIONS:
"a brown deer on the left"
"a brown deer on the right with branched horns"
FRAME 0 FRAME 2 FRAME 4
MAttNet + RVOS
EXPERIMENTS | ADDITIONAL RESULTS
QUANTITATIVE
24
FAILURE CASE (MAttNet)
FIRST FRAME GROUND TRUTH MATTNET MASK
REFERRING EXPRESSIONS:
"a white golf car"
"a man in a black tshirt"
"two golf sticks"
CONCLUSIONS
FUTURE WORK &
FUTURE WORK | FIRST STEPS
temporal
26
spatial
temporal
CONCLUSIONS | THESIS SCOPE
27
1. Follows a global trend of solving multimodal tasks with deep neural networks.
2. Interest in both computer vision and natural language fields.
3. Some promising results prove that the architecture learns to take into account
language information.
4. Experiments compare the approach of using referring expressions to single
modality.
5. The architecture used for images is designed to be extended to videos by
adding temporal recurrency and training with video.
CONCLUSIONS | VIDEO BASELINE PUBLICATION
28
Workshop on Multimodal Understanding and
Learning for Embodied Applications
“Thank you for your attention
Do you have any questions?
Feel free to ask!
Special thanks to my advisors and the UPF COLT group members
for their support.
29
APPENDICES
APPENDIX I | COMPARISONS
RefCOCO - With & without referring expressions
31
APPENDIX II | COMPARISON STATE OF THE ART
RefCOCO - State-of-the-art
32

Recurrent Instance Segmentation with Linguistic Referring Expressions

  • 1.
    Recurrent Instance Segmentation withLinguistic Referring Expressions Alba María Herrera Palacio ADVISORS: Xavier Giró-i-Nieto Carles Ventura Carina Silberer MASTER THESIS DISSERTATION, September 2019
  • 2.
  • 3.
    INTRODUCTION | MOTIVATION 3 NaturalLanguage Expressions PREVIOUS WORK [1] [1] A. Khoreva et al., Video Object Segmentation with Language Referring Expressions. ACCV 2018
  • 4.
    Model time Model One-shot RVOS [2] INTRODUCTION| VIDEO OBJECT SEGMENTATION Model time Model Referring expression “the woman” 4[2] C. Ventura et al., RVOS: End-to-End Recurrent Network for Video Object Segmentation. CVPR 2019
  • 5.
    - Word orphrase - Unambiguous - Any form of linguistic description INTRODUCTION | IMAGE SEGMENTATION WITH REFERRING EXPRESSIONS REFERRING EXPRESSIONS 5 “male reading book with fanny pack on waist” “far right girl white shirt”“woman with phone”“left woman in blue”
  • 6.
  • 7.
    METHODOLOGY | GENERALRECURRENT ARCHITECTURE 7 REFERRING EXPRESSIONS ENCODER MASK DECODER IMAGE ENCODER JointRepresentation “male reading book with fanny pack on waist” “far right girl white shirt” “woman with phone” “left woman in blue”
  • 8.
    METHODOLOGY | PROPOSEDARCHITECTURE 8 REFERRING EXPRESSION ENCODER
  • 9.
    METHODOLOGY | REFERRINGEXPRESSION ENCODER REFERRING EXPRESSION ENCODER Pooled output BERT embedding Encoded layers None Dimensionality reduction Linear layer PCA 9
  • 10.
    IMAGE ENCODER METHODOLOGY |PROPOSED ARCHITECTURE 10
  • 11.
    IMAGE ENCODER METHODOLOGY |IMAGE ENCODER RVOS [2] ENCODER MULTI-RESOLUTION VISUAL FEATURES 11[2] C. Ventura et al., RVOS: End-to-End Recurrent Network for Video Object Segmentation. CVPR 2019
  • 12.
    MASK DECODER METHODOLOGY |PROPOSED ARCHITECTURE 12
  • 13.
    MASK DECODER METHODOLOGY |MASK DECODER LANGUAGE & VISION FUSION 1 M width height depth 1 N 13
  • 14.
    MASK DECODER METHODOLOGY |MASK DECODER RVOS [2] SPATIAL RECURRENCE space 14[2] C. Ventura et al., RVOS: End-to-End Recurrent Network for Video Object Segmentation. CVPR 2019
  • 15.
    EXPERIMENTS | IMAGEDATASET RefCOCO by UNC 15
  • 16.
    EXPERIMENTS | QUANTITATIVERESULTS RefCOCO - Embedding configurations 16
  • 17.
    EXPERIMENTS | QUANTITATIVERESULTS RefCOCO - Order of referents and batch size 17
  • 18.
    True positives False negatives False positives EXPERIMENTS | QUALITATIVERESULTS “man on the left” “right gal” “left horse” “right horse” 18 “right gal” “man on the left” “right horse” “left horse”
  • 19.
    EXPERIMENTS | FAILURECASE “sitting guy with cake” “woman with ponytail” “woman wearing red” “green shirt thanks for playing” 19 True positives False negatives False positives
  • 20.
  • 21.
    METHODOLOGY | VIDEOBASELINE ARCHITECTURE MAttNet [3] + RVOS [3] Licheng Yu et al., MAttNet: Modular Attention Network for Referring Expression Comprehension. CVPR 2018 21
  • 22.
    EXPERIMENTS | VIDEODATASET DAVIS 2017 + EXPRESSIONS BY KHOREVA [1] 22[1] A. Khoreva et al., Video Object Segmentation with Language Referring Expressions. ACCV 2018
  • 23.
    EXPERIMENTS | VIDEOQUALITATIVE RESULTS 23 REFERRING EXPRESSIONS: "a brown deer on the left" "a brown deer on the right with branched horns" FRAME 0 FRAME 2 FRAME 4 MAttNet + RVOS
  • 24.
    EXPERIMENTS | ADDITIONALRESULTS QUANTITATIVE 24 FAILURE CASE (MAttNet) FIRST FRAME GROUND TRUTH MATTNET MASK REFERRING EXPRESSIONS: "a white golf car" "a man in a black tshirt" "two golf sticks"
  • 25.
  • 26.
    FUTURE WORK |FIRST STEPS temporal 26 spatial temporal
  • 27.
    CONCLUSIONS | THESISSCOPE 27 1. Follows a global trend of solving multimodal tasks with deep neural networks. 2. Interest in both computer vision and natural language fields. 3. Some promising results prove that the architecture learns to take into account language information. 4. Experiments compare the approach of using referring expressions to single modality. 5. The architecture used for images is designed to be extended to videos by adding temporal recurrency and training with video.
  • 28.
    CONCLUSIONS | VIDEOBASELINE PUBLICATION 28 Workshop on Multimodal Understanding and Learning for Embodied Applications
  • 29.
    “Thank you foryour attention Do you have any questions? Feel free to ask! Special thanks to my advisors and the UPF COLT group members for their support. 29
  • 30.
  • 31.
    APPENDIX I |COMPARISONS RefCOCO - With & without referring expressions 31
  • 32.
    APPENDIX II |COMPARISON STATE OF THE ART RefCOCO - State-of-the-art 32