SlideShare a Scribd company logo
1
DEEP LEARNING JP
[DL Papers]
http://deeplearning.jp/
The Neuro-Symbolic Concept Learner: Interpreting Scenes,
Words, and Sentences From Natural Supervision
Kazuki Fujikawa, DeNA
サマリ
• 書誌情報
– The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences
From Natural Supervision
• ICLR2019
• Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, Jiajun Wu
• 概要
– Visual QAの問題に対するEnd-to-End学習の中で、物体のコンセプトやロジックの認識を
分離して学習する枠組みを提案
• 教師データは質問と回答のペアのみ必要とする
– 実験で提案手法の以下の特性を示した
• データ効率が良いアルゴリズムであり、少量データで高精度に到達することを実験で示した
• 単に回答を出力するのではなく、回答に至るプロセスを明示できることを示した
2
アウトライン
• 背景
• 関連研究
• 提案手法
• 実験・結果
3
アウトライン
• 背景
• 関連研究
• 提案手法
• 実験・結果
4
背景
• 物体に紐づくコンセプト(色・形などの属性)を認識することは重要
– 人間がVQAの複雑な質問に答える場合、コンセプト情報とロジック(カウント作業など)
を分離して考える
– 機械学習モデルも同様で、コンセプト情報とロジックを分離して学習・出力できると、
データ効率・解釈性の面で改善できる可能性がある
5
Published asaconference paper at ICLR 2019
Q: What’s the color of the object?
A: Red.
Q: Is there any cube?
A: Yes.
Q: What’s the color of the object?
A: Green.
Q: Is there any cube?
A: Yes.
Q: How many objects are right of the red object?
A: 2.
Q: How many objects have the same material as the cube?
A: 2
Q: How many objects are both right of the green cylinder
and have the same material as the small blue ball?
A: 3
I. Learning basic, object-based concepts. II. Learning relational concepts based on referential expressions.
III. Interpret complex questions from visual cues.
Figure 1: Humans learn visual concepts, words, and semantic parsing jointly and incrementally.
I. Learning visual concepts (red vs. green) starts from looking at simple scenes, reading simple
questions, and reasoning over contrastive examples (Fazly et al., 2010). II. Afterwards, we can
interpret referential expressions based on the learned object-based concepts, and learn relational
concepts (e.g., on the right of, the same material as). III Finally, we can interpret complex questions
from visual cues by exploiting thecompositional structure.
アウトライン
• 背景
• 関連研究
• 提案手法
• 実験・結果
6
関連研究
• 関連研究と本研究の位置付け
7
End-to-End Programを介するアプローチ 本研究
Hudson+ 2018, Mascharka+ 2018, etc. Yi+ 2018
モジュール分離 × ○ ○
解釈性 △ ○ ○
教師データ 画像, 質問文 → 回答
画像 → コンセプト、質問文 → プログラム
コンセプト, プログラム → 回答
画像, 質問文 → 回答
Obj:2
Published as aconference paper at ICLR 2019
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
sphere have the same color as the
cylinder left of the small matte cube?
A: No.
Initialized with DSL and executor.
Lesson1: Object-based questions.
Lesson2: Relational questions.
Lesson3: More complex questions.
Deploy: complex scenes, complex questions
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
1
3
Obj 1
Obj 2
Obj 3
Obj 4
Step1: Visual Parsing
Step2, 3: Semantic Parsing and Program
Filter Green Cub
Program Representations
Relate Object 2
Left
Filter Red
Filter Purple Ma
AEQuery Object 1 Object 3
Shape
Concept
A. Curriculum concept learning B. Illustrative execution of NS-
Figure4: A. Demonstration of thecurriculum learning of visual concepts, words, and sema
of sentences by watching images and reading paired questions and answers. Scenes and q
入力1: 画像データ 入力2: 質問文
Q: What is the shape
of the red object ?
出力: 回答
A: Box
NN
中間出力1: コンセプト
ID Color Shape
1 Green Cube
2 Red Sphere
NN
中間出力2: プログラム
NN
Filter(Red)
↓
Query(Shape)
Published as aconference paper at ICLR 2019
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
sphere have the same color as the
cylinder left of the small matte cube?
A: No.
Initialized with DSL and executor.
Lesson1: Object-based questions.
Lesson2: Relational questions.
Lesson3: More complex questions.
Deploy: complex scenes, complex questions
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
1 2
3 4
Obj 1
Obj 2
Obj 3
Obj 4
Step1: Visual Parsing
Step2, 3: Semantic Parsing and Program Execution
Filter Green Cube
Program Representations Outputs
Relate Object 2
Left
Filter Red
Filter Purple Matte
AEQuery Object 1 Object 3
Shape No (0.98)
Concepts
A. Curriculum concept learning B. Illustrative execution of NS-CL
Figure4: A. Demonstration of thecurriculum learning of visual concepts, words, and semantic parsing
入力1: 画像データ 入力2: 質問文
Q: What is the shape
of the red object ?
出力: 回答
A: Box
NN
Published as aconference paper at ICLR 201
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
sphere have the same color as the
cylinder left of the small matte cube?
A: No.
Initialized with DSL and executor.
Lesson1: Object-based questions
Lesson2: Relational questions.
Lesson3: More complex questions
Deploy: complex scenes, complex
A. Curriculum concept learning
Figure4: A. Demonstration of thecurriculum
of sentences by watching images and reading
入力1: 画像データ 入力2: 質問文
Q: What is the shape
of the red object ?
出力: 回答
A: Box
NN
中間出力1: ベクトル
NN
中間出力2: プログラム
NN
Filter(Red)
↓
Query(Shape)
Obj:1 Green
Red
アウトライン
• 背景
• 関連研究
• 提案手法
• 実験・結果
8
提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
9
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
10
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
11
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
12
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
13
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
14
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
15
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
16
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、Embedding Spaceを更新
提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
17
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
提案手法: Joint Learning of Concepts and Semantic Parsing
• 2. Program出力の強化学習(Concept Embeddingは固定)
18
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
Reinforce
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解 / 不正解を報酬にReinforceでProgramの生成方策を更新
Reinforce
提案手法: Joint Learning of Concepts and Semantic Parsing
• 2. Program出力の強化学習(Concept Embeddingは固定)
19
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解 / 不正解を報酬にReinforceでProgramの生成方策を更新
提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. と 2. を交互に実行して学習を進める
– Curriculum Learningの枠組みで、少しずつ問題の難度を上げていく
20
Published asaconference paper at ICLR 2019
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
sphere have the same color as the
cylinder left of the small matte cube?
A: No.
Initialized with DSL and executor.
Lesson1: Object-based questions.
Lesson2: Relational questions.
Lesson3: More complex questions.
Deploy: complex scenes, complex questions
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
1 2
3 4
Obj 1
Obj 2
Obj 3
Obj 4
Step1: Visual Parsing
Step2, 3: Semantic Parsing and Program Execution
Filter Green Cube
Program Representations Outputs
Relate Object 2
Left
Filter Red
Filter Purple Matte
AEQuery Object 1 Object 3
Shape No (0.98)
Concepts
A. Curriculum concept learning B. Illustrative execution of NS-CL
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
sphere have the same color as the
cylinder left of the small matte cube?
A: No.
Initialized with DSL and executor.
Lesson1: Object-based questions.
Lesson2: Relational questions.
Lesson3: More complex questions.
Deploy: complex scenes, complex questions
Q: Does the red object left o
cube have the same shape as
purple matte thing?
Obj 1
Obj 2
Obj 3
Obj 4
Step1: Visual Parsing
Step2, 3: Semantic Parsin
Filter
Program Representatio
Relate Objec
Filter
Filter
AEQuery Object 1 Objec
Figure4: A. Demonstration of thecurriculum learning of visual concepts, word
of sentences by watching images and reading paired questions and answers. S
different complexities are illustrated to thelearner in an incremental manne
neuro-symbolic inference model for VQA. The perception module begins wi
into object-based deep representations, while the semantic parser parse sen
programs. A symbolic execution process bridges two modules.
アウトライン
• 背景
• 関連研究
• 提案手法
• 実験・結果
21
実験: 定量評価
• 実験: CLEVR Dataset [Johnson+, 2017]
– 複数配置した球や立方体などの物体に対する質問応答のデータセット
– Train: 70K, Valid: 15K, Test: 15K
– 訓練データ全体(70K)を用いた場合、一部(5K)を用いた場合の実験を実施
– 10%のデータでも十分なパフォーマンスが出ており、データ効率が良いことが示された
22
96.9
98.9 99.1 99.6
80
85
90
95
100
IEP MAC TbD NS-CL
QA Accuracy on CLEVR
Overall
48.3
85.5
54.7
98.9
40
50
60
70
80
90
100
IEP MAC TbD NS-CL
10% Data
図引用:
http://nscl.csail.mit.edu/data/resources/2019ICLR-NSCL-poster.pdf
http://nscl.csail.mit.edu/data/resources/2019ICLR-NSCL.pptx
実験: 定性評価
• 実験: CLEVR Dataset [Johnson+, 2017]
– 提案手法は、回答に至るまでの意思決定のプロセスを明示できることが一つのメリット
• 間違った回答をした場合、何で間違ったのかを知ることができる
23
Published asaconference paper at ICLR 2019
Q: Do the cyan cylinder that is behind
the gray cylinder and the gray
cylinder have the same material?
AEQuery
Filter
Filter
Relate
FilterGray Cylinder
Behind
Cyan Cylinder
Gray Cylinder
Material Yes (0.92)
Example A.
Q: There is a small blue object
that is to the right of the small red
matte object; what shape is it?
Filter
Query
Relate
FilterSmall Red
Matte Object
Right
Small Blue
Object
Shape Cube (0.85)
Example B.
Concept Program Result Concept Program Result
AEQuery
Filter
Filter
RelateBehind
Cyan Cylinder
Gray Cylinder
Material Yes (0.92)
Filter
Query
RelateRight
Small Blue
Object
Shape Cube (0.85)
Q: What is the color of the big box
left of the blue metal cylinder?
Filter
Relate
FilterBlue Metal
Cylinder
Left
Big Box
QueryColor
Execution
Abort
No such object found!
Color: Blue ✓
Material: Rubber ✕
Shape: Cylinder ✓
Size: Small ✓
Example C. Failure Case
Q: What is the color of the big
metal object?
Query
FilterBig Metal
Object
Color
Execution
Abort
Ambiguous Referral!
Example D. Ambiguous Program Case
Concept Program Result Concept Program Result
Figure 11: Visualization of theexecution trace generated by our Neuro-Symbolic Concept Learner
on the CLEVR dataset. Example A and B aresuccessful executions that generate correct answers.
In example C, the execution aborts at the first operator. To inspect the reason why the execution
engine fails to find the corresponding object, we can read out the visual representation of the object,
AEQuery
Filter
Filter
RelateBehind
Cyan Cylinder
Gray Cylinder
Material Yes (0.92)
Filter
Query
RelateRight
Small Blue
Object
Shape Cube (0.85)
Q: What is the color of the big box
left of the blue metal cylinder?
Filter
Relate
FilterBlue Metal
Cylinder
Left
Big Box
QueryColor
Execution
Abort
No such object found!
Color: Blue ✓
Material: Rubber ✕
Shape: Cylinder ✓
Size: Small ✓
Example C. Failure Case
Q: What is the color of the big
metal object?
Query
FilterBig Metal
Object
Color
Execution
Abort
Ambiguous Referral!
Example D. Ambiguous Program Case
Concept Program Result Concept Program Result
Figure11: Visualization of theexecution trace generated by our Neuro-Symbolic Concept Learner
on the CLEVR dataset. Example A and B aresuccessful executions that generate correct answers.
In example C, the execution aborts at the first operator. To inspect the reason why the execution
engine fails to find thecorresponding object, wecan read out thevisual representation of theobject,
実験: 定性評価
• 実験: VQS Dataset [Gan+, 2017]
– 現実画像のデータに対しても本手法は適用可能
• CLEVRは機械的にデータセットを作成するため、Programのアノテーションも作成可能だが、
現実画像のデータに対してProgramのアノテーションをつけるのは高コスト
• 提案手法ではProgramのアノテーションが不要であるため、現実画像のデータに対しても適用可能
24
Published as aconference paper at ICLR 2019
Example B.
Q: What is the sharp object on the table?
Relate
FilterTable
On
Concept Program Result
Example A.
Q: How many zebras are there?
FilterZebra
Concept Program Result
Count 3 ✓
Filter
Relate
FilterTable
On
Concept Program Result
Shape Object
QueryWhat Knife (0.85) ✓
FilterZebra
Count 3 ✓
Q: What kind of desert is plated?
Query
FilterDesert, Plated
Kind Cake (0.68)
Example C.
Concept Program Result
✓
Example D.
Q: What are the kids doing?
Query
FilterKids
What Playing_Frisbee (0.70)
Concept Program Result
✕
Groundtruth: Playing_Baseball
結論
• Visual QAの問題に対するEnd-to-End学習の中で、物体のコンセプトや
ロジックの認識を分離して学習する枠組みを提案
– 教師データは質問と回答のペアのみ必要とする
• 実験で提案手法の以下の特性を示した
– データ効率が良いアルゴリズムであり、少量データで高精度に到達することを実験で示した
– 単に回答を出力するのではなく、回答に至るプロセスを明示できることを示した
25
References
• Mao, Jiayuan, et al. "The neuro-symbolic concept learner: Interpreting scenes, words, and
sentences from natural supervision." in Proc. of ICLR, 2019.
• Hudson, Drew A, et al. ”Compositional attention networks for machine reasoning.” in Proc. of ICLR,
2018.
• Mascharka, David, et al. “Transparency by design: Closing the gap between performance and
interpretability in visual reasoning.” in Proc. of CVPR, 2018.
• Yi, Kexin, et al. “Neural-Symbolic VQA: Disentangling reasoning from vision and language
understanding.” in Proc. of NeurIPS, 2018.
• Johnson, Justin, et al. “CLEVR: A diagnostic dataset for compositional language and elementary
visual reasoning.” in Proc. of CVPR, 2017.
• Gan, Chuang, et al. “VQS: Linking segmentations to questions and answers for supervised
attention in vqa and question-focused semantic segmentation.” in Proc. of ICCV, 2017.
26

More Related Content

What's hot

[DL輪読会]医用画像解析におけるセグメンテーション
[DL輪読会]医用画像解析におけるセグメンテーション[DL輪読会]医用画像解析におけるセグメンテーション
[DL輪読会]医用画像解析におけるセグメンテーション
Deep Learning JP
 
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
Deep Learning JP
 
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
Deep Learning JP
 
[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
Deep Learning JP
 
強化学習における好奇心
強化学習における好奇心強化学習における好奇心
強化学習における好奇心
Shota Imai
 
[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations
Deep Learning JP
 
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
Deep Learning JP
 
【DL輪読会】Perceiver io a general architecture for structured inputs & outputs
【DL輪読会】Perceiver io  a general architecture for structured inputs & outputs 【DL輪読会】Perceiver io  a general architecture for structured inputs & outputs
【DL輪読会】Perceiver io a general architecture for structured inputs & outputs
Deep Learning JP
 
Semi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learningSemi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learning
Yusuke Uchida
 
【論文紹介】How Powerful are Graph Neural Networks?
【論文紹介】How Powerful are Graph Neural Networks?【論文紹介】How Powerful are Graph Neural Networks?
【論文紹介】How Powerful are Graph Neural Networks?
Masanao Ochi
 
【DL輪読会】ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
【DL輪読会】ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders【DL輪読会】ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
【DL輪読会】ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
Deep Learning JP
 
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
Deep Learning JP
 
[DL輪読会]Temporal Abstraction in NeurIPS2019
[DL輪読会]Temporal Abstraction in NeurIPS2019[DL輪読会]Temporal Abstraction in NeurIPS2019
[DL輪読会]Temporal Abstraction in NeurIPS2019
Deep Learning JP
 
StyleGAN解説 CVPR2019読み会@DeNA
StyleGAN解説 CVPR2019読み会@DeNAStyleGAN解説 CVPR2019読み会@DeNA
StyleGAN解説 CVPR2019読み会@DeNA
Kento Doi
 
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
Deep Learning JP
 
【DL輪読会】GAN-Supervised Dense Visual Alignment (CVPR 2022)
【DL輪読会】GAN-Supervised Dense Visual Alignment (CVPR 2022)【DL輪読会】GAN-Supervised Dense Visual Alignment (CVPR 2022)
【DL輪読会】GAN-Supervised Dense Visual Alignment (CVPR 2022)
Deep Learning JP
 
[DL輪読会]representation learning via invariant causal mechanisms
[DL輪読会]representation learning via invariant causal mechanisms[DL輪読会]representation learning via invariant causal mechanisms
[DL輪読会]representation learning via invariant causal mechanisms
Deep Learning JP
 
深層学習の数理
深層学習の数理深層学習の数理
深層学習の数理
Taiji Suzuki
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
Deep Learning JP
 
[DL輪読会]Weight Agnostic Neural Networks
[DL輪読会]Weight Agnostic Neural Networks[DL輪読会]Weight Agnostic Neural Networks
[DL輪読会]Weight Agnostic Neural Networks
Deep Learning JP
 

What's hot (20)

[DL輪読会]医用画像解析におけるセグメンテーション
[DL輪読会]医用画像解析におけるセグメンテーション[DL輪読会]医用画像解析におけるセグメンテーション
[DL輪読会]医用画像解析におけるセグメンテーション
 
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
 
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
 
[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
 
強化学習における好奇心
強化学習における好奇心強化学習における好奇心
強化学習における好奇心
 
[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations
 
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
 
【DL輪読会】Perceiver io a general architecture for structured inputs & outputs
【DL輪読会】Perceiver io  a general architecture for structured inputs & outputs 【DL輪読会】Perceiver io  a general architecture for structured inputs & outputs
【DL輪読会】Perceiver io a general architecture for structured inputs & outputs
 
Semi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learningSemi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learning
 
【論文紹介】How Powerful are Graph Neural Networks?
【論文紹介】How Powerful are Graph Neural Networks?【論文紹介】How Powerful are Graph Neural Networks?
【論文紹介】How Powerful are Graph Neural Networks?
 
【DL輪読会】ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
【DL輪読会】ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders【DL輪読会】ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
【DL輪読会】ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
 
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
 
[DL輪読会]Temporal Abstraction in NeurIPS2019
[DL輪読会]Temporal Abstraction in NeurIPS2019[DL輪読会]Temporal Abstraction in NeurIPS2019
[DL輪読会]Temporal Abstraction in NeurIPS2019
 
StyleGAN解説 CVPR2019読み会@DeNA
StyleGAN解説 CVPR2019読み会@DeNAStyleGAN解説 CVPR2019読み会@DeNA
StyleGAN解説 CVPR2019読み会@DeNA
 
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
 
【DL輪読会】GAN-Supervised Dense Visual Alignment (CVPR 2022)
【DL輪読会】GAN-Supervised Dense Visual Alignment (CVPR 2022)【DL輪読会】GAN-Supervised Dense Visual Alignment (CVPR 2022)
【DL輪読会】GAN-Supervised Dense Visual Alignment (CVPR 2022)
 
[DL輪読会]representation learning via invariant causal mechanisms
[DL輪読会]representation learning via invariant causal mechanisms[DL輪読会]representation learning via invariant causal mechanisms
[DL輪読会]representation learning via invariant causal mechanisms
 
深層学習の数理
深層学習の数理深層学習の数理
深層学習の数理
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
 
[DL輪読会]Weight Agnostic Neural Networks
[DL輪読会]Weight Agnostic Neural Networks[DL輪読会]Weight Agnostic Neural Networks
[DL輪読会]Weight Agnostic Neural Networks
 

Similar to [DL輪読会]The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

Visual Question Answering (VQA) - CVPR2018動向分析 (CVPR 2018 完全読破チャレンジ報告会)
Visual Question Answering (VQA) - CVPR2018動向分析 (CVPR 2018 完全読破チャレンジ報告会)Visual Question Answering (VQA) - CVPR2018動向分析 (CVPR 2018 完全読破チャレンジ報告会)
Visual Question Answering (VQA) - CVPR2018動向分析 (CVPR 2018 完全読破チャレンジ報告会)
cvpaper. challenge
 
深層学習入門
深層学習入門深層学習入門
深層学習入門
Danushka Bollegala
 
Vision and Language(メタサーベイ )
Vision and Language(メタサーベイ )Vision and Language(メタサーベイ )
Vision and Language(メタサーベイ )
cvpaper. challenge
 
【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...
【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...
【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...
Deep Learning JP
 
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
Deep Learning JP
 
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
Deep Learning JP
 
「解説資料」ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
「解説資料」ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation 「解説資料」ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
「解説資料」ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
Takumi Ohkuma
 
【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
Deep Learning JP
 

Similar to [DL輪読会]The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision (8)

Visual Question Answering (VQA) - CVPR2018動向分析 (CVPR 2018 完全読破チャレンジ報告会)
Visual Question Answering (VQA) - CVPR2018動向分析 (CVPR 2018 完全読破チャレンジ報告会)Visual Question Answering (VQA) - CVPR2018動向分析 (CVPR 2018 完全読破チャレンジ報告会)
Visual Question Answering (VQA) - CVPR2018動向分析 (CVPR 2018 完全読破チャレンジ報告会)
 
深層学習入門
深層学習入門深層学習入門
深層学習入門
 
Vision and Language(メタサーベイ )
Vision and Language(メタサーベイ )Vision and Language(メタサーベイ )
Vision and Language(メタサーベイ )
 
【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...
【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...
【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...
 
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
[DL輪読会]YOLOv4: Optimal Speed and Accuracy of Object Detection
 
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
 
「解説資料」ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
「解説資料」ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation 「解説資料」ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
「解説資料」ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
 
【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
 

More from Deep Learning JP

【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて
Deep Learning JP
 
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
Deep Learning JP
 
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
Deep Learning JP
 
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
Deep Learning JP
 
【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM
Deep Learning JP
 
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo... 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
Deep Learning JP
 
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
Deep Learning JP
 
【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?
Deep Learning JP
 
【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について
Deep Learning JP
 
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
Deep Learning JP
 
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
Deep Learning JP
 
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
Deep Learning JP
 
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
Deep Learning JP
 
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
Deep Learning JP
 
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
Deep Learning JP
 
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
Deep Learning JP
 
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
Deep Learning JP
 
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
Deep Learning JP
 
【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...
【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...
【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...
Deep Learning JP
 
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
Deep Learning JP
 

More from Deep Learning JP (20)

【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて
 
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
 
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
 
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
 
【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM
 
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo... 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
 
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
 
【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?
 
【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について
 
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
 
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
 
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
 
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
 
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
 
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
 
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
 
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
 
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
 
【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...
【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...
【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...
 
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
 

Recently uploaded

【AI論文解説】Consistency ModelとRectified Flow
【AI論文解説】Consistency ModelとRectified Flow【AI論文解説】Consistency ModelとRectified Flow
【AI論文解説】Consistency ModelとRectified Flow
Sony - Neural Network Libraries
 
FIDO Alliance Osaka Seminar: CloudGate.pdf
FIDO Alliance Osaka Seminar: CloudGate.pdfFIDO Alliance Osaka Seminar: CloudGate.pdf
FIDO Alliance Osaka Seminar: CloudGate.pdf
FIDO Alliance
 
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
Matsushita Laboratory
 
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
harmonylab
 
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
Fukuoka Institute of Technology
 
FIDO Alliance Osaka Seminar: Welcome Slides.pdf
FIDO Alliance Osaka Seminar: Welcome Slides.pdfFIDO Alliance Osaka Seminar: Welcome Slides.pdf
FIDO Alliance Osaka Seminar: Welcome Slides.pdf
FIDO Alliance
 
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
atsushi061452
 
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
atsushi061452
 
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdfFIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
FIDO Alliance
 
MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
iPride Co., Ltd.
 
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアルLoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
CRI Japan, Inc.
 
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
yassun7010
 
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdfFIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdfFIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
FIDO Alliance
 
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
NTT DATA Technology & Innovation
 

Recently uploaded (15)

【AI論文解説】Consistency ModelとRectified Flow
【AI論文解説】Consistency ModelとRectified Flow【AI論文解説】Consistency ModelとRectified Flow
【AI論文解説】Consistency ModelとRectified Flow
 
FIDO Alliance Osaka Seminar: CloudGate.pdf
FIDO Alliance Osaka Seminar: CloudGate.pdfFIDO Alliance Osaka Seminar: CloudGate.pdf
FIDO Alliance Osaka Seminar: CloudGate.pdf
 
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
 
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
 
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
 
FIDO Alliance Osaka Seminar: Welcome Slides.pdf
FIDO Alliance Osaka Seminar: Welcome Slides.pdfFIDO Alliance Osaka Seminar: Welcome Slides.pdf
FIDO Alliance Osaka Seminar: Welcome Slides.pdf
 
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
 
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
 
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdfFIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
 
MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
 
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアルLoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
 
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
 
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdfFIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
 
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdfFIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
 
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
 

[DL輪読会]The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

  • 1. 1 DEEP LEARNING JP [DL Papers] http://deeplearning.jp/ The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision Kazuki Fujikawa, DeNA
  • 2. サマリ • 書誌情報 – The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision • ICLR2019 • Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, Jiajun Wu • 概要 – Visual QAの問題に対するEnd-to-End学習の中で、物体のコンセプトやロジックの認識を 分離して学習する枠組みを提案 • 教師データは質問と回答のペアのみ必要とする – 実験で提案手法の以下の特性を示した • データ効率が良いアルゴリズムであり、少量データで高精度に到達することを実験で示した • 単に回答を出力するのではなく、回答に至るプロセスを明示できることを示した 2
  • 3. アウトライン • 背景 • 関連研究 • 提案手法 • 実験・結果 3
  • 4. アウトライン • 背景 • 関連研究 • 提案手法 • 実験・結果 4
  • 5. 背景 • 物体に紐づくコンセプト(色・形などの属性)を認識することは重要 – 人間がVQAの複雑な質問に答える場合、コンセプト情報とロジック(カウント作業など) を分離して考える – 機械学習モデルも同様で、コンセプト情報とロジックを分離して学習・出力できると、 データ効率・解釈性の面で改善できる可能性がある 5 Published asaconference paper at ICLR 2019 Q: What’s the color of the object? A: Red. Q: Is there any cube? A: Yes. Q: What’s the color of the object? A: Green. Q: Is there any cube? A: Yes. Q: How many objects are right of the red object? A: 2. Q: How many objects have the same material as the cube? A: 2 Q: How many objects are both right of the green cylinder and have the same material as the small blue ball? A: 3 I. Learning basic, object-based concepts. II. Learning relational concepts based on referential expressions. III. Interpret complex questions from visual cues. Figure 1: Humans learn visual concepts, words, and semantic parsing jointly and incrementally. I. Learning visual concepts (red vs. green) starts from looking at simple scenes, reading simple questions, and reasoning over contrastive examples (Fazly et al., 2010). II. Afterwards, we can interpret referential expressions based on the learned object-based concepts, and learn relational concepts (e.g., on the right of, the same material as). III Finally, we can interpret complex questions from visual cues by exploiting thecompositional structure.
  • 6. アウトライン • 背景 • 関連研究 • 提案手法 • 実験・結果 6
  • 7. 関連研究 • 関連研究と本研究の位置付け 7 End-to-End Programを介するアプローチ 本研究 Hudson+ 2018, Mascharka+ 2018, etc. Yi+ 2018 モジュール分離 × ○ ○ 解釈性 △ ○ ○ 教師データ 画像, 質問文 → 回答 画像 → コンセプト、質問文 → プログラム コンセプト, プログラム → 回答 画像, 質問文 → 回答 Obj:2 Published as aconference paper at ICLR 2019 Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions. Deploy: complex scenes, complex questions Q: Does the red object left of the green cube have the same shape as the purple matte thing? 1 3 Obj 1 Obj 2 Obj 3 Obj 4 Step1: Visual Parsing Step2, 3: Semantic Parsing and Program Filter Green Cub Program Representations Relate Object 2 Left Filter Red Filter Purple Ma AEQuery Object 1 Object 3 Shape Concept A. Curriculum concept learning B. Illustrative execution of NS- Figure4: A. Demonstration of thecurriculum learning of visual concepts, words, and sema of sentences by watching images and reading paired questions and answers. Scenes and q 入力1: 画像データ 入力2: 質問文 Q: What is the shape of the red object ? 出力: 回答 A: Box NN 中間出力1: コンセプト ID Color Shape 1 Green Cube 2 Red Sphere NN 中間出力2: プログラム NN Filter(Red) ↓ Query(Shape) Published as aconference paper at ICLR 2019 Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions. Deploy: complex scenes, complex questions Q: Does the red object left of the green cube have the same shape as the purple matte thing? 1 2 3 4 Obj 1 Obj 2 Obj 3 Obj 4 Step1: Visual Parsing Step2, 3: Semantic Parsing and Program Execution Filter Green Cube Program Representations Outputs Relate Object 2 Left Filter Red Filter Purple Matte AEQuery Object 1 Object 3 Shape No (0.98) Concepts A. Curriculum concept learning B. Illustrative execution of NS-CL Figure4: A. Demonstration of thecurriculum learning of visual concepts, words, and semantic parsing 入力1: 画像データ 入力2: 質問文 Q: What is the shape of the red object ? 出力: 回答 A: Box NN Published as aconference paper at ICLR 201 Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions Lesson2: Relational questions. Lesson3: More complex questions Deploy: complex scenes, complex A. Curriculum concept learning Figure4: A. Demonstration of thecurriculum of sentences by watching images and reading 入力1: 画像データ 入力2: 質問文 Q: What is the shape of the red object ? 出力: 回答 A: Box NN 中間出力1: ベクトル NN 中間出力2: プログラム NN Filter(Red) ↓ Query(Shape) Obj:1 Green Red
  • 8. アウトライン • 背景 • 関連研究 • 提案手法 • 実験・結果 8
  • 9. 提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定) 9 Red 入力1: 画像データ 入力2: 質問文 Q:What is the shape of the red object ? Mask R-CNN ResNet-34 Color Embedding Space Shape Embedding Space 1 2 Cylinder Sphere Box Visual Feature Space Obj:1 Obj:2 出力: 回答 A: Box 正解: Sphere 予測 BP ↓ Filter(Red) Query(Shape) BiGRU-GRU NN NN ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 ② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法 [Dong+, 2016]でProgramを出力 ③ Programの1行目の処理に必要なConceptのEmbedding (Color embedding)を獲得 ④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定) ⑤ Programの2行目の処理に必要なConceptのEmbedding (Shape embedding)を獲得 ⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを 獲得)し、予測結果として出力 ⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
  • 10. 提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定) 10 Red 入力1: 画像データ 入力2: 質問文 Q:What is the shape of the red object ? Mask R-CNN ResNet-34 Color Embedding Space Shape Embedding Space 1 2 Cylinder Sphere Box Visual Feature Space Obj:1 Obj:2 出力: 回答 A: Box 正解: Sphere 予測 BP ↓ Filter(Red) Query(Shape) BiGRU-GRU NN NN ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 ② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法 [Dong+, 2016]でProgramを出力 ③ Programの1行目の処理に必要なConceptのEmbedding (Color embedding)を獲得 ④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定) ⑤ Programの2行目の処理に必要なConceptのEmbedding (Shape embedding)を獲得 ⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを 獲得)し、予測結果として出力 ⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
  • 11. 提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定) 11 Red 入力1: 画像データ 入力2: 質問文 Q:What is the shape of the red object ? Mask R-CNN ResNet-34 Color Embedding Space Shape Embedding Space 1 2 Cylinder Sphere Box Visual Feature Space Obj:1 Obj:2 出力: 回答 A: Box 正解: Sphere 予測 BP ↓ Filter(Red) Query(Shape) BiGRU-GRU NN NN ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 ② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法 [Dong+, 2016]でProgramを出力 ③ Programの1行目の処理に必要なConceptのEmbedding (Color embedding)を獲得 ④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定) ⑤ Programの2行目の処理に必要なConceptのEmbedding (Shape embedding)を獲得 ⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを 獲得)し、予測結果として出力 ⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
  • 12. 提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定) 12 Red 入力1: 画像データ 入力2: 質問文 Q:What is the shape of the red object ? Mask R-CNN ResNet-34 Color Embedding Space Shape Embedding Space 1 2 Cylinder Sphere Box Visual Feature Space Obj:1 Obj:2 出力: 回答 A: Box 正解: Sphere 予測 BP ↓ Filter(Red) Query(Shape) BiGRU-GRU NN NN ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 ② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法 [Dong+, 2016]でProgramを出力 ③ Programの1行目の処理に必要なConceptのEmbedding (Color embedding)を獲得 ④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定) ⑤ Programの2行目の処理に必要なConceptのEmbedding (Shape embedding)を獲得 ⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを 獲得)し、予測結果として出力 ⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
  • 13. 提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定) 13 Red 入力1: 画像データ 入力2: 質問文 Q:What is the shape of the red object ? Mask R-CNN ResNet-34 Color Embedding Space Shape Embedding Space 1 2 Cylinder Sphere Box Visual Feature Space Obj:1 Obj:2 出力: 回答 A: Box 正解: Sphere 予測 BP ↓ Filter(Red) Query(Shape) BiGRU-GRU NN NN ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 ② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法 [Dong+, 2016]でProgramを出力 ③ Programの1行目の処理に必要なConceptのEmbedding (Color embedding)を獲得 ④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定) ⑤ Programの2行目の処理に必要なConceptのEmbedding (Shape embedding)を獲得 ⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを 獲得)し、予測結果として出力 ⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
  • 14. 提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定) 14 Red 入力1: 画像データ 入力2: 質問文 Q:What is the shape of the red object ? Mask R-CNN ResNet-34 Color Embedding Space Shape Embedding Space 1 2 Cylinder Sphere Box Visual Feature Space Obj:1 Obj:2 出力: 回答 A: Box 正解: Sphere 予測 BP ↓ Filter(Red) Query(Shape) BiGRU-GRU NN NN ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 ② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法 [Dong+, 2016]でProgramを出力 ③ Programの1行目の処理に必要なConceptのEmbedding (Color embedding)を獲得 ④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定) ⑤ Programの2行目の処理に必要なConceptのEmbedding (Shape embedding)を獲得 ⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを 獲得)し、予測結果として出力 ⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
  • 15. 提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定) 15 Red 入力1: 画像データ 入力2: 質問文 Q:What is the shape of the red object ? Mask R-CNN ResNet-34 Color Embedding Space Shape Embedding Space 1 2 Cylinder Sphere Box Visual Feature Space Obj:1 Obj:2 出力: 回答 A: Box 正解: Sphere 予測 BP ↓ Filter(Red) Query(Shape) BiGRU-GRU NN NN ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 ② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法 [Dong+, 2016]でProgramを出力 ③ Programの1行目の処理に必要なConceptのEmbedding (Color embedding)を獲得 ④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定) ⑤ Programの2行目の処理に必要なConceptのEmbedding (Shape embedding)を獲得 ⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを 獲得)し、予測結果として出力 ⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
  • 16. 提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定) 16 Red 入力1: 画像データ 入力2: 質問文 Q:What is the shape of the red object ? Mask R-CNN ResNet-34 Color Embedding Space Shape Embedding Space 1 2 Cylinder Sphere Box Visual Feature Space Obj:1 Obj:2 出力: 回答 A: Box 正解: Sphere 予測 BP ↓ Filter(Red) Query(Shape) BiGRU-GRU NN NN ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 ② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法 [Dong+, 2016]でProgramを出力 ③ Programの1行目の処理に必要なConceptのEmbedding (Color embedding)を獲得 ④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定) ⑤ Programの2行目の処理に必要なConceptのEmbedding (Shape embedding)を獲得 ⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを 獲得)し、予測結果として出力 ⑦ 正解データとの誤差を逆伝播し、Embedding Spaceを更新
  • 17. 提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定) 17 Red 入力1: 画像データ 入力2: 質問文 Q:What is the shape of the red object ? Mask R-CNN ResNet-34 Color Embedding Space Shape Embedding Space 1 2 Cylinder Sphere Box Visual Feature Space Obj:1 Obj:2 出力: 回答 A: Box 正解: Sphere 予測 BP ↓ Filter(Red) Query(Shape) BiGRU-GRU NN NN ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 ② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法 [Dong+, 2016]でProgramを出力 ③ Programの1行目の処理に必要なConceptのEmbedding (Color embedding)を獲得 ④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定) ⑤ Programの2行目の処理に必要なConceptのEmbedding (Shape embedding)を獲得 ⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを 獲得)し、予測結果として出力 ⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
  • 18. 提案手法: Joint Learning of Concepts and Semantic Parsing • 2. Program出力の強化学習(Concept Embeddingは固定) 18 Red 入力1: 画像データ 入力2: 質問文 Q:What is the shape of the red object ? Mask R-CNN ResNet-34 Color Embedding Space Shape Embedding Space 1 2 Cylinder Sphere Box Visual Feature Space Obj:1 Obj:2 出力: 回答 A: Box 正解: Sphere 予測 Reinforce ↓ Filter(Red) Query(Shape) BiGRU-GRU NN NN ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 ② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法 [Dong+, 2016]でProgramを出力 ③ Programの1行目の処理に必要なConceptのEmbedding (Color embedding)を獲得 ④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定) ⑤ Programの2行目の処理に必要なConceptのEmbedding (Shape embedding)を獲得 ⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを 獲得)し、予測結果として出力 ⑦ 正解 / 不正解を報酬にReinforceでProgramの生成方策を更新
  • 19. Reinforce 提案手法: Joint Learning of Concepts and Semantic Parsing • 2. Program出力の強化学習(Concept Embeddingは固定) 19 Red 入力1: 画像データ 入力2: 質問文 Q:What is the shape of the red object ? Mask R-CNN ResNet-34 Color Embedding Space Shape Embedding Space 1 2 Cylinder Sphere Box Visual Feature Space Obj:1 Obj:2 出力: 回答 A: Box 正解: Sphere 予測 ↓ Filter(Red) Query(Shape) BiGRU-GRU NN NN ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 ② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法 [Dong+, 2016]でProgramを出力 ③ Programの1行目の処理に必要なConceptのEmbedding (Color embedding)を獲得 ④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定) ⑤ Programの2行目の処理に必要なConceptのEmbedding (Shape embedding)を獲得 ⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを 獲得)し、予測結果として出力 ⑦ 正解 / 不正解を報酬にReinforceでProgramの生成方策を更新
  • 20. 提案手法: Joint Learning of Concepts and Semantic Parsing • 1. と 2. を交互に実行して学習を進める – Curriculum Learningの枠組みで、少しずつ問題の難度を上げていく 20 Published asaconference paper at ICLR 2019 Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions. Deploy: complex scenes, complex questions Q: Does the red object left of the green cube have the same shape as the purple matte thing? 1 2 3 4 Obj 1 Obj 2 Obj 3 Obj 4 Step1: Visual Parsing Step2, 3: Semantic Parsing and Program Execution Filter Green Cube Program Representations Outputs Relate Object 2 Left Filter Red Filter Purple Matte AEQuery Object 1 Object 3 Shape No (0.98) Concepts A. Curriculum concept learning B. Illustrative execution of NS-CL Q: What is the shape of the red object? A: Cube. Q: How many cubes are behind the sphere? A: 3 Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No Q: Does the matte thing behind the big sphere have the same color as the cylinder left of the small matte cube? A: No. Initialized with DSL and executor. Lesson1: Object-based questions. Lesson2: Relational questions. Lesson3: More complex questions. Deploy: complex scenes, complex questions Q: Does the red object left o cube have the same shape as purple matte thing? Obj 1 Obj 2 Obj 3 Obj 4 Step1: Visual Parsing Step2, 3: Semantic Parsin Filter Program Representatio Relate Objec Filter Filter AEQuery Object 1 Objec Figure4: A. Demonstration of thecurriculum learning of visual concepts, word of sentences by watching images and reading paired questions and answers. S different complexities are illustrated to thelearner in an incremental manne neuro-symbolic inference model for VQA. The perception module begins wi into object-based deep representations, while the semantic parser parse sen programs. A symbolic execution process bridges two modules.
  • 21. アウトライン • 背景 • 関連研究 • 提案手法 • 実験・結果 21
  • 22. 実験: 定量評価 • 実験: CLEVR Dataset [Johnson+, 2017] – 複数配置した球や立方体などの物体に対する質問応答のデータセット – Train: 70K, Valid: 15K, Test: 15K – 訓練データ全体(70K)を用いた場合、一部(5K)を用いた場合の実験を実施 – 10%のデータでも十分なパフォーマンスが出ており、データ効率が良いことが示された 22 96.9 98.9 99.1 99.6 80 85 90 95 100 IEP MAC TbD NS-CL QA Accuracy on CLEVR Overall 48.3 85.5 54.7 98.9 40 50 60 70 80 90 100 IEP MAC TbD NS-CL 10% Data 図引用: http://nscl.csail.mit.edu/data/resources/2019ICLR-NSCL-poster.pdf http://nscl.csail.mit.edu/data/resources/2019ICLR-NSCL.pptx
  • 23. 実験: 定性評価 • 実験: CLEVR Dataset [Johnson+, 2017] – 提案手法は、回答に至るまでの意思決定のプロセスを明示できることが一つのメリット • 間違った回答をした場合、何で間違ったのかを知ることができる 23 Published asaconference paper at ICLR 2019 Q: Do the cyan cylinder that is behind the gray cylinder and the gray cylinder have the same material? AEQuery Filter Filter Relate FilterGray Cylinder Behind Cyan Cylinder Gray Cylinder Material Yes (0.92) Example A. Q: There is a small blue object that is to the right of the small red matte object; what shape is it? Filter Query Relate FilterSmall Red Matte Object Right Small Blue Object Shape Cube (0.85) Example B. Concept Program Result Concept Program Result AEQuery Filter Filter RelateBehind Cyan Cylinder Gray Cylinder Material Yes (0.92) Filter Query RelateRight Small Blue Object Shape Cube (0.85) Q: What is the color of the big box left of the blue metal cylinder? Filter Relate FilterBlue Metal Cylinder Left Big Box QueryColor Execution Abort No such object found! Color: Blue ✓ Material: Rubber ✕ Shape: Cylinder ✓ Size: Small ✓ Example C. Failure Case Q: What is the color of the big metal object? Query FilterBig Metal Object Color Execution Abort Ambiguous Referral! Example D. Ambiguous Program Case Concept Program Result Concept Program Result Figure 11: Visualization of theexecution trace generated by our Neuro-Symbolic Concept Learner on the CLEVR dataset. Example A and B aresuccessful executions that generate correct answers. In example C, the execution aborts at the first operator. To inspect the reason why the execution engine fails to find the corresponding object, we can read out the visual representation of the object, AEQuery Filter Filter RelateBehind Cyan Cylinder Gray Cylinder Material Yes (0.92) Filter Query RelateRight Small Blue Object Shape Cube (0.85) Q: What is the color of the big box left of the blue metal cylinder? Filter Relate FilterBlue Metal Cylinder Left Big Box QueryColor Execution Abort No such object found! Color: Blue ✓ Material: Rubber ✕ Shape: Cylinder ✓ Size: Small ✓ Example C. Failure Case Q: What is the color of the big metal object? Query FilterBig Metal Object Color Execution Abort Ambiguous Referral! Example D. Ambiguous Program Case Concept Program Result Concept Program Result Figure11: Visualization of theexecution trace generated by our Neuro-Symbolic Concept Learner on the CLEVR dataset. Example A and B aresuccessful executions that generate correct answers. In example C, the execution aborts at the first operator. To inspect the reason why the execution engine fails to find thecorresponding object, wecan read out thevisual representation of theobject,
  • 24. 実験: 定性評価 • 実験: VQS Dataset [Gan+, 2017] – 現実画像のデータに対しても本手法は適用可能 • CLEVRは機械的にデータセットを作成するため、Programのアノテーションも作成可能だが、 現実画像のデータに対してProgramのアノテーションをつけるのは高コスト • 提案手法ではProgramのアノテーションが不要であるため、現実画像のデータに対しても適用可能 24 Published as aconference paper at ICLR 2019 Example B. Q: What is the sharp object on the table? Relate FilterTable On Concept Program Result Example A. Q: How many zebras are there? FilterZebra Concept Program Result Count 3 ✓ Filter Relate FilterTable On Concept Program Result Shape Object QueryWhat Knife (0.85) ✓ FilterZebra Count 3 ✓ Q: What kind of desert is plated? Query FilterDesert, Plated Kind Cake (0.68) Example C. Concept Program Result ✓ Example D. Q: What are the kids doing? Query FilterKids What Playing_Frisbee (0.70) Concept Program Result ✕ Groundtruth: Playing_Baseball
  • 25. 結論 • Visual QAの問題に対するEnd-to-End学習の中で、物体のコンセプトや ロジックの認識を分離して学習する枠組みを提案 – 教師データは質問と回答のペアのみ必要とする • 実験で提案手法の以下の特性を示した – データ効率が良いアルゴリズムであり、少量データで高精度に到達することを実験で示した – 単に回答を出力するのではなく、回答に至るプロセスを明示できることを示した 25
  • 26. References • Mao, Jiayuan, et al. "The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision." in Proc. of ICLR, 2019. • Hudson, Drew A, et al. ”Compositional attention networks for machine reasoning.” in Proc. of ICLR, 2018. • Mascharka, David, et al. “Transparency by design: Closing the gap between performance and interpretability in visual reasoning.” in Proc. of CVPR, 2018. • Yi, Kexin, et al. “Neural-Symbolic VQA: Disentangling reasoning from vision and language understanding.” in Proc. of NeurIPS, 2018. • Johnson, Justin, et al. “CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning.” in Proc. of CVPR, 2017. • Gan, Chuang, et al. “VQS: Linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation.” in Proc. of ICCV, 2017. 26