Achieving human parity on visual question answering alicemind

ACHIEVING HUMAN PARITY ON VISUAL QUESTION
ANSWERING (Alicemind)
딥러닝논문읽기 자연어처리팀. 2022.02.13
주정헌(발표자), 최상우, 김지연, 백지윤, 민지원

Background
2
1. Visual Question Answering
• VQA Challenge는 컴퓨터비전패턴인식학회(IEEE Computer Vision and Pattern Recognition,
CVPR) 워크샵 중 하나이며, VQA Homepage에서 매년 열린다.
• -> 2016년 CVPR을 시작으로 매년 개최되며, 1년마다 발전된 기술을 평가하고 시상

Background
3
2. 국내 현황
• 2016 naver Labs 2위
• 서울대 장병탁교수팀 2위

Background
4
3. 과거 연구
1) 보통 CNN으로 이미지를 이해하고, RNN 등으로 으로 질문을 이해한 후 정답을 도출하는 방식으로 이루어짐
2) 영어데이터 관련 연구는 많이 이루어지고 있는데 반면, 한국어 연구는 많이 이루어지지 않았음
VQA: Visual Question Answering (ICCV 2015)
Yin and Yang: Balancing and Answering Binary Visual Questions (CVPR 2016)
Making the V in VQA Matter: Elevating the Role of Image Understanding
in Visual Question Answering (CVPR 2017)

VQA Modeling for Alicemind
5
1. Reaching Human Parity (사람 수준의 모델링을 목표)
- Open된 질문에 응대해보자!
2. Comprehensive Feature Representations
- Cross modal learning을 위해 feature engineering 을 해보자!
3. Pretraining
- Cross-modal interaction을 통하여 Vision-languages를 Pretraining.
- 좀 더 좋은 Pretraining을 제안하여 실험해보자!
4. Knowledge-guided Mixture of Experts (MOE)
- 사전학습으로 배운 것 중에서도 (절대로, rarely) 기계가 모르는 것이 있을 수도 있으니 따로 배워보자!

1. Human parity
7
1. Beyond existing approaches
2. To capture diversity of visual signals
3. Combine visual representations
1. Region feature: Region별로 물체 탐지
2. Grid feature: Grid에 따라 배경 탐지/파악
3. Patch feature: patch 별로 기타 이미지 특성 파악
4. Efficient Semantic gaps between Visual and Language
- Single-stream architecture(o)
- Dual-stream architecture(x)
1. MOE Paradigm
- Text Reading Expert
- Clock Reading Expert

2. Cross-modal learning
8
1. Visual Feature
Region Feature: better localization of individual objects and capture the detailed semantics
Bottom-up attention: to identify salient region
(작은 리전으로부터 relevant한 파트들로 묶어 attention을 수행하여 region 피쳐를 만든다.)
Grid Feature: to capture global information of images(배경 정보) Freely low-resolution images.
이는 HW x C feature 맵으로 만들시 a linear projection layer에서 채널 dimension이 줄어드는 효과가 있음
Patch Feature: 고정된 사이즈로 된 patch를 transformer에 통과시킴.(ViT)
이는 1) grid-based feature 와 함께 convolution 연산이 간편해짐. 2) full-image를 설명할 수 있는 구조를 self-
attention을 통해 학습 시킬 수 있음
2. Textual Feature
- BERT Embeddings
질문은 문장 그대로 인코딩, 정답지는 [CLS]를 넣어서 구분하여 학습

9
3. Visual & Language Pretraining
Single-stream Architecture
- Align attention을 수행하여 joint representation
을 학습한다(V: Vision , L: Language)
Score-matrix
Sub-matrics

10
3. Visual & Language Pretraining
Task
1) Masked LM Prediction: Same with BERT
2) Masked Object Prediction: randomly masking objects
3) Image-Text Matching: Randomly match/Mismatch image-text pairs.
4) Image Question Answering: classification problem with image QA data

11
4. Knowledge-guided Mixture of Experts
1) Text Reading Expert
StructuralLM: OCR 모델을 통해 끄집어낸 텍스트를 가지고 VQA 로 활용한다.
2) Clock Reading Expert
- Clock-detector: binary Classification with bounding box
- Clock-reader : Resnet Backbone, channel-wise attention, spatial attention (SE-Layer)
(시간은 총 12시간이니 12-category classification로 문제를 바꿔 loss를 계산한다)
Putting Together
1) MOE에서 main-task와 sub-task를 구분한다.
2) 일종의 Gating network통과 시킨다.
(이때 Multiple experts로 갈지 Single Expert로 갈지 Switching Transformer를 활용하여 구분한다)
3) Aliceminds 는 이제 총 3가지 tasks를 classification하여 답변에 대해 준비한다.

12
(Alicemind-MMU Architecture)

5. Pretraining datasets
• MS COCO: Image Captioning • Visual Genome: Visual Question Answering
• VQA2.0: Visual Question Answering
13

6. Fine-tuning datasets
• VQA, 10-human annotations, Several Questions
14

Achieving human parity on visual question answering alicemind

Recommended

Recommended

More Related Content

Similar to Achieving human parity on visual question answering alicemind

Similar to Achieving human parity on visual question answering alicemind (20)

More from taeseon ryu

More from taeseon ryu (20)

Achieving human parity on visual question answering alicemind