Advertisement

Recurrent Instance Segmentation with Linguistic Referring Expressions

Associate Professor at Universitat Politècnica de Catalunya
Sep. 17, 2019
Advertisement

More Related Content

More from Universitat Politècnica de Catalunya(20)

Recently uploaded(20)

Advertisement

Recurrent Instance Segmentation with Linguistic Referring Expressions

  1. Recurrent Instance Segmentation with Linguistic Referring Expressions Alba María Herrera Palacio ADVISORS: Xavier Giró-i-Nieto Carles Ventura Carina Silberer MASTER THESIS DISSERTATION, September 2019
  2. INTRODUCTION
  3. INTRODUCTION | MOTIVATION 3 Natural Language Expressions PREVIOUS WORK [1] [1] A. Khoreva et al., Video Object Segmentation with Language Referring Expressions. ACCV 2018
  4. Model time Model One-shot RVOS [2] INTRODUCTION | VIDEO OBJECT SEGMENTATION Model time Model Referring expression “the woman” 4[2] C. Ventura et al., RVOS: End-to-End Recurrent Network for Video Object Segmentation. CVPR 2019
  5. - Word or phrase - Unambiguous - Any form of linguistic description INTRODUCTION | IMAGE SEGMENTATION WITH REFERRING EXPRESSIONS REFERRING EXPRESSIONS 5 “male reading book with fanny pack on waist” “far right girl white shirt”“woman with phone”“left woman in blue”
  6. IMAGE SEGMENTATION WITH REFERRING EXPRESSIONS
  7. METHODOLOGY | GENERAL RECURRENT ARCHITECTURE 7 REFERRING EXPRESSIONS ENCODER MASK DECODER IMAGE ENCODER JointRepresentation “male reading book with fanny pack on waist” “far right girl white shirt” “woman with phone” “left woman in blue”
  8. METHODOLOGY | PROPOSED ARCHITECTURE 8 REFERRING EXPRESSION ENCODER
  9. METHODOLOGY | REFERRING EXPRESSION ENCODER REFERRING EXPRESSION ENCODER Pooled output BERT embedding Encoded layers None Dimensionality reduction Linear layer PCA 9
  10. IMAGE ENCODER METHODOLOGY | PROPOSED ARCHITECTURE 10
  11. IMAGE ENCODER METHODOLOGY | IMAGE ENCODER RVOS [2] ENCODER MULTI-RESOLUTION VISUAL FEATURES 11[2] C. Ventura et al., RVOS: End-to-End Recurrent Network for Video Object Segmentation. CVPR 2019
  12. MASK DECODER METHODOLOGY | PROPOSED ARCHITECTURE 12
  13. MASK DECODER METHODOLOGY | MASK DECODER LANGUAGE & VISION FUSION 1 M width height depth 1 N 13
  14. MASK DECODER METHODOLOGY | MASK DECODER RVOS [2] SPATIAL RECURRENCE space 14[2] C. Ventura et al., RVOS: End-to-End Recurrent Network for Video Object Segmentation. CVPR 2019
  15. EXPERIMENTS | IMAGE DATASET RefCOCO by UNC 15
  16. EXPERIMENTS | QUANTITATIVE RESULTS RefCOCO - Embedding configurations 16
  17. EXPERIMENTS | QUANTITATIVE RESULTS RefCOCO - Order of referents and batch size 17
  18. True positives False negatives False positives EXPERIMENTS | QUALITATIVE RESULTS “man on the left” “right gal” “left horse” “right horse” 18 “right gal” “man on the left” “right horse” “left horse”
  19. EXPERIMENTS | FAILURE CASE “sitting guy with cake” “woman with ponytail” “woman wearing red” “green shirt thanks for playing” 19 True positives False negatives False positives
  20. TOWARDS VIDEO
  21. METHODOLOGY | VIDEO BASELINE ARCHITECTURE MAttNet [3] + RVOS [3] Licheng Yu et al., MAttNet: Modular Attention Network for Referring Expression Comprehension. CVPR 2018 21
  22. EXPERIMENTS | VIDEO DATASET DAVIS 2017 + EXPRESSIONS BY KHOREVA [1] 22[1] A. Khoreva et al., Video Object Segmentation with Language Referring Expressions. ACCV 2018
  23. EXPERIMENTS | VIDEO QUALITATIVE RESULTS 23 REFERRING EXPRESSIONS: "a brown deer on the left" "a brown deer on the right with branched horns" FRAME 0 FRAME 2 FRAME 4 MAttNet + RVOS
  24. EXPERIMENTS | ADDITIONAL RESULTS QUANTITATIVE 24 FAILURE CASE (MAttNet) FIRST FRAME GROUND TRUTH MATTNET MASK REFERRING EXPRESSIONS: "a white golf car" "a man in a black tshirt" "two golf sticks"
  25. CONCLUSIONS FUTURE WORK &
  26. FUTURE WORK | FIRST STEPS temporal 26 spatial temporal
  27. CONCLUSIONS | THESIS SCOPE 27 1. Follows a global trend of solving multimodal tasks with deep neural networks. 2. Interest in both computer vision and natural language fields. 3. Some promising results prove that the architecture learns to take into account language information. 4. Experiments compare the approach of using referring expressions to single modality. 5. The architecture used for images is designed to be extended to videos by adding temporal recurrency and training with video.
  28. CONCLUSIONS | VIDEO BASELINE PUBLICATION 28 Workshop on Multimodal Understanding and Learning for Embodied Applications
  29. “Thank you for your attention Do you have any questions? Feel free to ask! Special thanks to my advisors and the UPF COLT group members for their support. 29
  30. APPENDICES
  31. APPENDIX I | COMPARISONS RefCOCO - With & without referring expressions 31
  32. APPENDIX II | COMPARISON STATE OF THE ART RefCOCO - State-of-the-art 32
Advertisement