[DL輪読会]The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

1
DEEP LEARNING JP
[DL Papers]
http://deeplearning.jp/
The Neuro-Symbolic Concept Learner: Interpreting Scenes,
Words, and Sentences From Natural Supervision
Kazuki Fujikawa, DeNA

サマリ
• 書誌情報
– The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences
From Natural Supervision
• ICLR2019
• Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, Jiajun Wu
• 概要
– Visual QAの問題に対するEnd-to-End学習の中で、物体のコンセプトやロジックの認識を
分離して学習する枠組みを提案
• 教師データは質問と回答のペアのみ必要とする
– 実験で提案手法の以下の特性を示した
• データ効率が良いアルゴリズムであり、少量データで高精度に到達することを実験で示した
• 単に回答を出力するのではなく、回答に至るプロセスを明示できることを示した
2

アウトライン
• 背景
• 関連研究
• 提案手法
• 実験・結果
3

アウトライン
• 背景
• 関連研究
• 提案手法
• 実験・結果
4

背景
• 物体に紐づくコンセプト（色・形などの属性）を認識することは重要
– 人間がVQAの複雑な質問に答える場合、コンセプト情報とロジック（カウント作業など）
を分離して考える
– 機械学習モデルも同様で、コンセプト情報とロジックを分離して学習・出力できると、
データ効率・解釈性の面で改善できる可能性がある
5
Published asaconference paper at ICLR 2019
Q: What’s the color of the object?
A: Red.
Q: Is there any cube?
A: Yes.
Q: What’s the color of the object?
A: Green.
Q: Is there any cube?
A: Yes.
Q: How many objects are right of the red object?
A: 2.
Q: How many objects have the same material as the cube?
A: 2
Q: How many objects are both right of the green cylinder
and have the same material as the small blue ball?
A: 3
I. Learning basic, object-based concepts. II. Learning relational concepts based on referential expressions.
III. Interpret complex questions from visual cues.
Figure 1: Humans learn visual concepts, words, and semantic parsing jointly and incrementally.
I. Learning visual concepts (red vs. green) starts from looking at simple scenes, reading simple
questions, and reasoning over contrastive examples (Fazly et al., 2010). II. Afterwards, we can
interpret referential expressions based on the learned object-based concepts, and learn relational
concepts (e.g., on the right of, the same material as). III Finally, we can interpret complex questions
from visual cues by exploiting thecompositional structure.

アウトライン
• 背景
• 関連研究
• 提案手法
• 実験・結果
6

関連研究
• 関連研究と本研究の位置付け
7
End-to-End Programを介するアプローチ本研究
Hudson+ 2018, Mascharka+ 2018, etc. Yi+ 2018
モジュール分離 × ○ ○
解釈性 △ ○ ○
教師データ画像, 質問文 → 回答
画像 → コンセプト、質問文 → プログラム
コンセプト, プログラム → 回答
画像, 質問文 → 回答
Obj:2
Published as aconference paper at ICLR 2019
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
sphere have the same color as the
cylinder left of the small matte cube?
A: No.
Initialized with DSL and executor.
Lesson1: Object-based questions.
Lesson2: Relational questions.
Lesson3: More complex questions.
Deploy: complex scenes, complex questions
purple matte thing?
1
3
Obj 1
Obj 2
Obj 3
Obj 4
Step1: Visual Parsing
Step2, 3: Semantic Parsing and Program
Filter Green Cub
Program Representations
Relate Object 2
Left
Filter Red
Filter Purple Ma
AEQuery Object 1 Object 3
Shape
Concept
A. Curriculum concept learning B. Illustrative execution of NS-
Figure4: A. Demonstration of thecurriculum learning of visual concepts, words, and sema
of sentences by watching images and reading paired questions and answers. Scenes and q
入力1: 画像データ入力2: 質問文
Q: What is the shape
of the red object ?
出力: 回答
A: Box
NN
中間出力1: コンセプト
ID Color Shape
1 Green Cube
2 Red Sphere
NN
中間出力2: プログラム
NN
Filter(Red)
↓
Query(Shape)
A: Cube.
sphere?
A: 3
purple matte thing?
A: No
A: No.
purple matte thing?
1 2
3 4
Obj 1
Obj 2
Obj 3
Obj 4
Step2, 3: Semantic Parsing and Program Execution
Filter Green Cube
Program Representations Outputs
Relate Object 2
Left
Filter Red
Filter Purple Matte
Shape No (0.98)
Concepts
A. Curriculum concept learning B. Illustrative execution of NS-CL
Figure4: A. Demonstration of thecurriculum learning of visual concepts, words, and semantic parsing
of the red object ?
出力: 回答
A: Box
NN
A: Cube.
sphere?
A: 3
purple matte thing?
A: No
A: No.
Lesson1: Object-based questions
Lesson3: More complex questions
Deploy: complex scenes, complex
A. Curriculum concept learning
Figure4: A. Demonstration of thecurriculum
of sentences by watching images and reading
of the red object ?
出力: 回答
A: Box
NN
中間出力1: ベクトル
NN
中間出力2: プログラム
NN
Filter(Red)
↓
Query(Shape)
Obj:1 Green
Red

アウトライン
• 背景
• 関連研究
• 提案手法
• 実験・結果
8

提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習（Program出力部は固定）
9
Red
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder（BiGRU-GRU）ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
（Color embedding）を獲得
④ Filter処理を実行（RedとのCosine類似度が最大となるObjに限定）
⑤ Programの2行目の処理に必要なConceptのEmbedding
（Shape embedding）を獲得
⑥Query処理を実行（Obj: 2とのCosine類似度が最大となるShapeを
獲得）し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新

10
Red
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
1
2
Cylinder
Sphere
Box
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN

11
Red
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
1
2
Cylinder
Sphere
Box
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN

12
Red
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
1
2
Cylinder
Sphere
Box
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN

13
Red
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
1
2
Cylinder
Sphere
Box
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN

14
Red
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
1
2
Cylinder
Sphere
Box
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN

15
Red
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
1
2
Cylinder
Sphere
Box
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN

16
Red
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
1
2
Cylinder
Sphere
Box
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
⑦ 正解データとの誤差を逆伝播し、Embedding Spaceを更新

17
Red
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
1
2
Cylinder
Sphere
Box
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN

• 2. Program出力の強化学習（Concept Embeddingは固定）
18
Red
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
1
2
Cylinder
Sphere
Box
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
Reinforce
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
⑦ 正解 / 不正解を報酬にReinforceでProgramの生成方策を更新

Reinforce
• 2. Program出力の強化学習（Concept Embeddingは固定）
19
Red
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
1
2
Cylinder
Sphere
Box
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
⑦ 正解 / 不正解を報酬にReinforceでProgramの生成方策を更新

• 1. と 2. を交互に実行して学習を進める
– Curriculum Learningの枠組みで、少しずつ問題の難度を上げていく
20
A: Cube.
sphere?
A: 3
purple matte thing?
A: No
A: No.
purple matte thing?
1 2
3 4
Obj 1
Obj 2
Obj 3
Obj 4
Step2, 3: Semantic Parsing and Program Execution
Filter Green Cube
Program Representations Outputs
Relate Object 2
Left
Filter Red
Filter Purple Matte
Shape No (0.98)
Concepts
A. Curriculum concept learning B. Illustrative execution of NS-CL
A: Cube.
sphere?
A: 3
purple matte thing?
A: No
A: No.
Q: Does the red object left o
cube have the same shape as
purple matte thing?
Obj 1
Obj 2
Obj 3
Obj 4
Step2, 3: Semantic Parsin
Filter
Program Representatio
Relate Objec
Filter
Filter
AEQuery Object 1 Objec
Figure4: A. Demonstration of thecurriculum learning of visual concepts, word
of sentences by watching images and reading paired questions and answers. S
different complexities are illustrated to thelearner in an incremental manne
neuro-symbolic inference model for VQA. The perception module begins wi
into object-based deep representations, while the semantic parser parse sen
programs. A symbolic execution process bridges two modules.

アウトライン
• 背景
• 関連研究
• 提案手法
• 実験・結果
21

実験: 定量評価
• 実験: CLEVR Dataset [Johnson+, 2017]
– 複数配置した球や立方体などの物体に対する質問応答のデータセット
– Train: 70K, Valid: 15K, Test: 15K
– 訓練データ全体（70K）を用いた場合、一部（5K）を用いた場合の実験を実施
– 10%のデータでも十分なパフォーマンスが出ており、データ効率が良いことが示された
22
96.9
98.9 99.1 99.6
80
85
90
95
100
IEP MAC TbD NS-CL
QA Accuracy on CLEVR
Overall
48.3
85.5
54.7
98.9
40
50
60
70
80
90
100
IEP MAC TbD NS-CL
10% Data
図引用:
http://nscl.csail.mit.edu/data/resources/2019ICLR-NSCL-poster.pdf
http://nscl.csail.mit.edu/data/resources/2019ICLR-NSCL.pptx

実験: 定性評価
• 実験: CLEVR Dataset [Johnson+, 2017]
– 提案手法は、回答に至るまでの意思決定のプロセスを明示できることが一つのメリット
• 間違った回答をした場合、何で間違ったのかを知ることができる
23
Q: Do the cyan cylinder that is behind
the gray cylinder and the gray
cylinder have the same material?
AEQuery
Filter
Filter
Relate
FilterGray Cylinder
Behind
Cyan Cylinder
Gray Cylinder
Material Yes (0.92)
Example A.
Q: There is a small blue object
that is to the right of the small red
matte object; what shape is it?
Filter
Query
Relate
FilterSmall Red
Matte Object
Right
Small Blue
Object
Shape Cube (0.85)
Example B.
Concept Program Result Concept Program Result
AEQuery
Filter
Filter
RelateBehind
Cyan Cylinder
Gray Cylinder
Material Yes (0.92)
Filter
Query
RelateRight
Small Blue
Object
Shape Cube (0.85)
Q: What is the color of the big box
left of the blue metal cylinder?
Filter
Relate
FilterBlue Metal
Cylinder
Left
Big Box
QueryColor
Execution
Abort
No such object found!
Color: Blue ✓
Material: Rubber ✕
Shape: Cylinder ✓
Size: Small ✓
Example C. Failure Case
Q: What is the color of the big
metal object?
Query
FilterBig Metal
Object
Color
Execution
Abort
Ambiguous Referral!
Example D. Ambiguous Program Case
Figure 11: Visualization of theexecution trace generated by our Neuro-Symbolic Concept Learner
on the CLEVR dataset. Example A and B aresuccessful executions that generate correct answers.
In example C, the execution aborts at the first operator. To inspect the reason why the execution
engine fails to find the corresponding object, we can read out the visual representation of the object,
AEQuery
Filter
Filter
RelateBehind
Cyan Cylinder
Gray Cylinder
Material Yes (0.92)
Filter
Query
RelateRight
Small Blue
Object
Shape Cube (0.85)
Q: What is the color of the big box
left of the blue metal cylinder?
Filter
Relate
FilterBlue Metal
Cylinder
Left
Big Box
QueryColor
Execution
Abort
No such object found!
Color: Blue ✓
Material: Rubber ✕
Shape: Cylinder ✓
Size: Small ✓
Example C. Failure Case
Q: What is the color of the big
metal object?
Query
FilterBig Metal
Object
Color
Execution
Abort
Ambiguous Referral!
Example D. Ambiguous Program Case
Figure11: Visualization of theexecution trace generated by our Neuro-Symbolic Concept Learner
on the CLEVR dataset. Example A and B aresuccessful executions that generate correct answers.
In example C, the execution aborts at the first operator. To inspect the reason why the execution
engine fails to find thecorresponding object, wecan read out thevisual representation of theobject,

実験: 定性評価
• 実験: VQS Dataset [Gan+, 2017]
– 現実画像のデータに対しても本手法は適用可能
• CLEVRは機械的にデータセットを作成するため、Programのアノテーションも作成可能だが、
現実画像のデータに対してProgramのアノテーションをつけるのは高コスト
• 提案手法ではProgramのアノテーションが不要であるため、現実画像のデータに対しても適用可能
24
Example B.
Q: What is the sharp object on the table?
Relate
FilterTable
On
Concept Program Result
Example A.
Q: How many zebras are there?
FilterZebra
Count 3 ✓
Filter
Relate
FilterTable
On
Shape Object
QueryWhat Knife (0.85) ✓
FilterZebra
Count 3 ✓
Q: What kind of desert is plated?
Query
FilterDesert, Plated
Kind Cake (0.68)
Example C.
✓
Example D.
Q: What are the kids doing?
Query
FilterKids
What Playing_Frisbee (0.70)
✕
Groundtruth: Playing_Baseball

結論
• Visual QAの問題に対するEnd-to-End学習の中で、物体のコンセプトや
ロジックの認識を分離して学習する枠組みを提案
– 教師データは質問と回答のペアのみ必要とする
• 実験で提案手法の以下の特性を示した
– データ効率が良いアルゴリズムであり、少量データで高精度に到達することを実験で示した
– 単に回答を出力するのではなく、回答に至るプロセスを明示できることを示した
25

References
• Mao, Jiayuan, et al. "The neuro-symbolic concept learner: Interpreting scenes, words, and
sentences from natural supervision." in Proc. of ICLR, 2019.
• Hudson, Drew A, et al. ”Compositional attention networks for machine reasoning.” in Proc. of ICLR,
2018.
• Mascharka, David, et al. “Transparency by design: Closing the gap between performance and
interpretability in visual reasoning.” in Proc. of CVPR, 2018.
• Yi, Kexin, et al. “Neural-Symbolic VQA: Disentangling reasoning from vision and language
understanding.” in Proc. of NeurIPS, 2018.
• Johnson, Justin, et al. “CLEVR: A diagnostic dataset for compositional language and elementary
visual reasoning.” in Proc. of CVPR, 2017.
• Gan, Chuang, et al. “VQS: Linking segmentations to questions and answers for supervised
attention in vqa and question-focused semantic segmentation.” in Proc. of ICCV, 2017.
26

[DL輪読会]The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [DL輪読会]The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

Similar to [DL輪読会]The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision (8)

More from Deep Learning JP

More from Deep Learning JP (20)

Recently uploaded

Recently uploaded (15)

[DL輪読会]The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision