視覚と対話の融合研究

視覚と対話の融合研究
オムロンサイニックエックス株式会社
牛久祥孝
losnuevetoros

自己紹介
2014.3 博士(情報理工学)、東京大学
2014.4～2016.3 NTT CS研研究員
2016.4～2018.9 東京大学講師 (原田・牛久研究室)
2016.9～産業技術総合研究所協力研究員
2016.12～2018.9 国立国語研究所共同研究員
2018.10～オムロンサイニックエックス株式会社
Principal Investigator
[Ushiku+, ACMMM 2012]
[Ushiku+, ICCV 2015]
画像キャプション生成主観的な感性表現を持つ
画像キャプション生成
動画の特定区間と
キャプションの相互検索
[Yamaguchi+, ICCV 2017]
A guy is skiing with no shirt on
and yellow snow pants.
A zebra standing in a field with
a tree in the dirty background.
[Shin+, BMVC 2016]
A yellow train on the tracks near
a train station.

2011
2012
2014
電話音声認識のエラー率が
30%程度→20%以下に
[Seide+, InterSpeech 2011]
大規模画像分類のエラー率が
25%程度→15%程度に
[Krizhevsky+, NIPS 2012]
LSTMで英仏翻訳の精度が
複雑なシステムと同等に
[Sutskever+, NIPS 2014]

2012年：一般物体認識における激震
2012年の画像
認識タスクで
ディープ勢が
2位以下に圧勝!
2012年の画像
認識タスクで
ディープ勢が
2012年の画像
認識タスクで
ディープ勢が

公式サイトにアクセスしてみると…
1st team w/ DL
Error rate: 15%
2nd team w/o DL
Error rate: 26%
[http://image-net.org/challenges/LSVRC/2012/results.html]

公式サイトにアクセスしてみると…
1st team w/ DL
Error rate: 15%
2nd team w/o DL
Error rate: 26%
[http://image-net.org/challenges/LSVRC/2012/results.html]
It’s me!!

入力
出力
Deep Learning の影響
• 機械翻訳でも深層学習が登場 [Sutskever+, NIPS 2014]
– RNNで問題になっていた勾配の消失をLSTM
[Hochreiter+Schmidhuber, 1997] で解決
→文中の離れた単語間での関係を扱えるように
– LSTMを4層つなぎ、end-to-endで機械学習
→state-of-the-art並み（英仏翻訳）
CNN/RNNなどの共通技術が台頭
画像認識や機械翻訳の参入障壁が低下

ユーザー生成コンテンツの爆発的増加
特にコンテンツ投稿・共有サービスでは…
• Facebookに画像が2500億枚 (2013年9月時点)
• YouTubeにアップロードされる動画
1分間で計400時間分 (2015年7月時点)
Pōhutukawa blooms this
time of the year in New
Zealand. As the flowers
fall, the ground
underneath the trees look
spectacular.
画像/動画と
関連する文章の対
→大量に収集可能

ユーザー生成コンテンツの爆発的増加
特にコンテンツ投稿・共有サービスでは…
• Facebookに画像が2500億枚 (2013年9月時点)
• YouTubeにアップロードされる動画
1分間で計400時間分 (2015年7月時点)
Pōhutukawa blooms this
time of the year in New
Zealand. As the flowers
fall, the ground
underneath the trees look
spectacular.
画像/動画と
関連する文章の対
→大量に収集可能
これらの背景から…
つぎのような様々な取り組みが！

画像キャプション生成
Group of people sitting
at a table with a dinner.
Tourists are standing on
the middle of a flat desert.
[Ushiku+, ICCV 2015]

動画キャプション生成
A man is holding a box of doughnuts.
Then he and a woman are standing next each other.
Then she is holding a plate of food.
[Shin+, ICIP 2016]

ビジュアル質問応答
[Fukui+, EMNLP 2016]

本講演の目的
視覚×対話の融合研究へ至る道の俯瞰
1. 画像・動画キャプション生成
2. 画像に関する質問への応答
3. Vision-aware Dialog

視覚と対話の融合研究1
画像・動画キャプション生成

マルチキーフレーズ推定アプローチ
当時の問題＝使用候補であるフレーズの精度が悪い
キーフレーズを独立なラベルとして扱うと…
マルチキーフレーズの推定＝一般画像認識
文生成は[Ushiku+, ACM MM 2011]と同じ
[Ushiku+, ACM MM 2012]

Google NIC [Vinyals+, CVPR 2015]
Googleで開発された
• GoogLeNet [Szegedy+, CVPR 2015]
• LSTM [Sutskever+, NIPS 2014]
を直列させて文生成する。
画像𝐼への文（単語列）𝑆0 … 𝑆 𝑁は
𝑆0: スタートを意味する単語
𝑆1 = LSTM CNN 𝐼
𝑆𝑡 = LSTM St−1 , 𝑡 = 2 … 𝑁 − 1
𝑆 𝑁: ストップを意味する単語

生成された説明文の例
[https://github.com/tensorflow/models/tree/master/im2txt]

Deep Learning による動画キャプション生成
• LRCN
[Donahue+, CVPR 2015]
– CNN+RNN
• 動作認識
• 画像/動画
キャプション生成
• Video to Text
[Venugopalan+, ICCV 2015]
– CNN+RNN
• RGB画像で物体を
• オプティカルフローで
動作を
認識→キャプション生成

評価方法とデータセット

どれがどれくらい良いキャプションなのか？
CoSMoS [Ushiku et al., ICCV 2015]
Group of people sitting at a table with a dinner.
Corpus-Guided [Yang et al., EMNLP 2011]
Three people are showing the bottle on the street
Midge [Mitchel et al., EACL 2012]
people with a bottle at the table
アンケートによる比較：相対的な良さの評価
• 毎回ほかの手法と比較してもらわなければならない
• 絶対的なキャプションの良さの評価がほしい

定量評価指標
機械翻訳では…
• テスト文に複数の参照訳が付随（通常5文）
• これらの参照訳と近い訳文が「良い」
• 既存の評価指標（BLEUやMETEOR、ROUGEなど）
• キャプション生成の評価指標（CIDErやSPICEなど）
One jet lands at an airport while another takes off next to it.
Two airplanes parked in an airport.
Two jets taxi past each other.
Two parked jet airplanes facing opposite directions.
two passenger planes on a grassy plain
キャプション生成の評価でも同様の流れ
PASCAL Sentenceの画像と参照キャプションの例

データセット
Webからクロールしてきたもの
• SBU Captioned Image [Ordonez+, NIPS 2011]
100万枚のFlickr画像、1キャプション/画像
• YFCC-100M [Thomee+, 2015]
1億枚のFlickr画像＋動画、一部の画像にキャプショ
ン
• Déjà Images [Chen+, ACL 2015]
1つのキャプションに複数の画像が紐づいている

データセット
クラウドソーシングを用いたもの
• PASCAL Sentence, Flickr 8k/30k (すべてUIUCから)
それぞれ1000/8000/30000枚の画像、5キャプション/画像
• Abstract Scene Dataset [Zitnick+Parikh, CVPR 2013]
10000枚のクリップアート、6キャプション/画像
• MS COCO [Lin+, 2014]
10万超の画像、5キャプション/画像
• MSR Dense Visual Annotation Corpus [Yatskar+, *SEM
2014]
500枚の画像に100,000の矩形領域+キャプション
• PASCAL-50S, ABSTRACT-50S [Vedantam+, CVPR 2015]
より人間らしい評価のために作成、50キャプション/画像
• Visual Genome [Krishna+, IJCV 2017]
10万超の画像にキャプションやQAなどが密に付随

データセット
クラウドソーシングを用いたもの
• PASCAL Sentence, Flickr 8k/30k (すべてUIUCから)
それぞれ1000/8000/30000枚の画像、5キャプション/画像
• Abstract Scene Dataset [Zitnick+Parikh, CVPR 2013]
10000枚のクリップアート、6キャプション/画像
• MS COCO [Lin+, 2014]
10万超の画像、5キャプション/画像
• MSR Dense Visual Annotation Corpus [Yatskar+, *SEM
2014]
500枚の画像に100,000の矩形領域+キャプション
• PASCAL-50S, ABSTRACT-50S [Vedantam+, CVPR 2015]
より人間らしい評価のために作成、50キャプション/画像
• Visual Genome [Krishna+, IJCV 2017]
10万超の画像にキャプションやQAなどが密に付随
特にMS COCOとVisual Genomeは…
他のデータセットのベースになる
ことが非常に多い

日本語を英語にGoogle翻訳してみる
この結果を日本語にGoogle翻訳してみる

精度の発展：「行って帰ってくる損失」
• 機械学習
出力キャプション→入力画像を再推定
cf. CycleGAN[Zhu+, ICCV 2017]
変分自己符号化器
[Pu+, NIPS 2016]
出力キャプションで領域検索
[Luo+Shakhnarovich, CVPR 2017]

精度の発展：アテンションモデル
• 2分野が融合して新たに生まれたものの例：
– アテンションモデルの利用 [Xu+, ICML 2015]
– 画像+キャプションデータのみからの学習！
– 動画：時間方向のアテンション[Laokulrat+, COLING 2016]

ここまでの問題点1:最適化したい目的関数
学習に用いるのは Cross-Entropy
評価に用いるのは BLEUなどの評価指標
→ 生成したキャプションの評価指標を
直接最適化するべきでは？
• 評価指標の直接最適化
– 機械翻訳ではディープ以前からある [Och, ACL 2003]
• 深層学習で評価指標を直接最適化…？
– 勾配が求められないから学習できない！！
短い文へのペナルティ
N-gramのPrecision

ここまでの問題点2: Exposure Bias
通常のRNNによる系列生成モデル学習では…
• 学習時：Teacher forcing
– 入力は𝑡 − 1番目までの
教師データ
• テスト時：Free running
– 入力は𝑡 − 1番目までで
自身が推定したデータ
テスト時の生成系列が学習時から外れだすと
エラーが蓄積し続ける

精度の発展：強化学習の利用
• 強化学習を利用したキャプション生成
評価指標を報酬とする強化学習を導入すれば
問題点1と2が同時に解決されるはず！
– 方策勾配：評価指標の勾配が分からなくても、
出力の事後確率の勾配でOK→問題点１
– 評価指標を利用すれば、Teacher forcingしない
学習も可能→問題点２
state
RNNの状態変数
action
単語系列の推定
reward
評価指標
environment
画像特徴と生成中のキャプション
[Ranzato+, ICLR 2016][Rennie+, CVPR 2017]

問題の発展：より細かいキャプション生成
[Lin+, BMVC 2015] [Johnson+, CVPR 2016]

問題の発展：キャプション列生成
アルバムのような系列画像に対して
The family
got
together for
a cookout.
They had a
lot of
delicious
food.
The dog
was happy
to be there.
They had a
great time
on the
beach.
They even
had a swim
in the water.
[Park+Kim, NIPS 2015][Huang+, NAACL 2016]

A man is holding a box of doughnuts.
Then he and a woman are standing next each other.
Then she is holding a plate of food.
[Shin+, ICIP 2016]
シーンの切り替わる動画に対して

A boat is floating on the water near a mountain.
And a man riding a wave on top of a surfboard.
Then he on the surfboard in the water.
[Shin+, ICIP 2016]

問題の発展：主観的な表現
感性語Sentiment Termを重視したキャプション生成
ニュートラルな文
ポジティブな文
（生成した例）
[Mathews+, AAAI 2016][Shin+, BMVC 2016]

画像に関する質問への応答

Visual Question Answering (VQA)
最初はユーザインタフェース分野で注目
• VizWiz [Bigham+, UIST 2010]
AMTで人力解決
• 初の自動化（ディープラーニング不使用）
[Malinowski+Fritz, NIPS 2014]
• 類似用語：Visual Turing Test [Malinowski+Fritz, 2014]

VQA: Visual Question Answering
• ビジュアル質問応答を分野として確立
– ベンチマークデータセットの提供
– ベースとなるパイプラインでの実験
• ポータルサイトも運営
– http://www.visualqa.org/
– 国際コンペティションも開催
[Antol+, ICCV 2015]
What color are her eyes?
What is the mustache made of?

VQA Dataset
AMT で質問と回答を収集
• 10万超の実画像、3万超のアニメ調画像
• 計70万弱の質問＋それぞれ10の模範回答

VQA=多クラス分類問題
表現ベクトル𝑍𝐼+𝑄以降は通常のクラス識別
質問文𝑄
What objects are
found on the bed?
応答𝐴
bed sheets, pillow
画像𝐼
画像特徴量
𝑥𝐼
質問特徴量
𝑥 𝑄
統合された
表現ベクトル
𝑧𝐼+𝑄

その後の展開：統合方法
「統合された表現ベクトル 𝑧𝐼+𝑄」の工夫
• VQA [Antol+, ICCV 2015]：そのまま直列に並べる
• 「和」グループ
例 Attentionで重みづけ和をとった画像特徴と
質問特徴を単純に足す [Xu+Saenko, ECCV 2016]
• 「積」グループ
例双線形積とフーリエ変換を組み合わせる
[Fukui+, EMNLP 2016]
• 「和」と「積」のハイブリッド
例要素毎の積と要素毎の和を直列に並べる
[Saito+, ICME 2017]
𝑧𝐼+𝑄 =
𝑥𝐼
𝑥 𝑄
𝑥𝐼 𝑥 𝑄
𝑥𝐼 𝑥 𝑄𝑧𝐼+𝑄 =
𝑧𝐼+𝑄 =
𝑧𝐼+𝑄 =
𝑥𝐼 𝑥 𝑄
𝑥𝐼 𝑥 𝑄

その後の展開：アテンション
• 2017年SOTA [Anderson+, CVPR 2018]
– これまで：Top-down領域の
画像にアテンション
– Bottom-upとTow-down領域の
両方にアテンション
• 2018年SOTA [Nguyen+Okatani, CVPR 2018]
– これまで：画像に対し
アテンション
– 質問特徴と画像特徴の
両方にアテンション
Bottom-upTop-down

VQA Challenge
コンペティション参加チームの解答例から
Q: What is the woman holding?
GT A: laptop
Machine A: laptop
Q: Is it going to rain soon?
GT A: yes
Machine A: yes

VQA Challenge
コンペティション参加チームの解答例から
Q: Why is there snow on one
side of the stream and clear
grass on the other?
GT A: shade
Machine A: yes
Q: Is the hydrant painted a new
color?
GT A: yes
Machine A: no

VisWiz Grand Challenge
視覚障碍者用のVQAチャレンジ
• 視覚障碍者がスマフォで実際に撮影
• 質問31,000件、一部は回答不可能が正解（下側）
[Gurari+, CVPR 2018]

EmbodiedQA
質問応答のために探索が必要な問題
著者自らがその後、階層的な方策を獲得するA3C
ベースの強化学習を提案 [Das+, CoRL 2018]
[Das+, CVPR 2018]

その他にも…
• VQS
– ビジュアル質問に関連する
領域で答える
• MovieQA [Tapaswi+, CVPR 2016]
– 映画についてのQA

その他にも…
• QuAC [Choi+, EMNLP 2018]
– Question Answering
in Context
– Wikipedia記事で
対話形式のQAを
整備
• Textbook Question Answering
– 教科書をデータセット
として整備
→外部知識は不要
– Machine Comprehension
では一文探せば回答可能
→より高度な問題を用意

Visual Question Generation
• 視覚的質問生成
（VQG）の提案
質問生成は画像
キャプションの
生成か検索を検討
• Qに対する要求
– その質問から会話が始まるような質問
– 画像を見てわかるような質問ではだめ
✗ How many horses
are in the field?
✓ Who won the race?
[Mostafazadeh+, ACL 2016]

未知物体についてのVQG
画像認識器が知らない物体: 人から教わりたい
• 質問なら何でもいいわけじゃない
• 「なにこれ？」のような曖昧な質問だと…
回答も「物体」のように曖昧になりそう
• 学習して自動生成できた質問の例
What is the
woman
holding in
her right
hand?
What type
of shirt is
the man
wearing?
What in
on the
man’s
lap?
?
[Uehara+, ECCV 2018]

その他にも…
• Visual Discriminative Question Generation
[Li+, ICCV 2017]
ペアになっている画像に
– DiscriminativeなQ
– そうでないQ
を付与→生成モデルを学習
• Inverse Visual Question Answering
[Liu+, CVPR 2018]
入力：
画像と解答
出力：
適切な質問

Vision-Aware Dialog

Vision-Aware Dialog
エージェントとユーザー以外に視覚的な情報が存在
研究を大別すると…
• データセットの提供
VisDial [Das+, CVPR 2017]
• それらを利用した対話の研究
共参照解析を利用したVisDialモデル

マルチモーダル対話？
• マルチモーダル対話
– 主にユーザからの入力が複数モーダルの情報
• Vision-Aware Dialog
– 環境などに視覚情報を伴うものをさす
– マルチモダリティではある
本講演では
Vision-Aware Dialog
と呼ぶことにします

データセットによる分類

紹介するデータセット一覧
• GuessWhat?!
[de Vries+, CVPR 2017]
• Visual Dialog (VisDial)
[Das+, CVPR 2017]
• Vision-and-Language Navigation (VNL)
[Anderson+, ICCV 2017]
• MNIST Dialog
[Seo+, NIPS 2017]
• Multimodal Dialog (MMD)
[Saha+, AAAI 2018]
• Twitch-FIFA
[Pasunuru+Bansal, EMNLP 2018]
• Talk the Walk
[de Vries+, arXiv 2018]

GuessWhat?!
連続するYes/No型のVQAデータ
Is it a person? No
Is it an item being worn or held? Yes
Is it a snowboard? Yes
Is it the red one? No
Is it the one being held by the Yes
person in blue?
Is it a cow? Yes
Is it the big cow in the middle? No
Is the cow on the left? No
On the right? Yes
First cow near us? Yes

GuessWhat?! の概要
• Questioner
– Guesserが何を見ているのかを知るために質問
• Guesser
– 自分が見ているものに応じてYes/Noで応答
• MS COCOを利用
– 画像数 64,000
– 対話 135,000
– 質問 673,000 Is it a vase? Yes
Is it partially visible? No
Is it in the left corner? No
Is it the turquoise and Yes
purple one?

Visual Dialog (VisDial)
連続する一般的なVQAデータ
Questioner Answerer
A couple of people
in the snow on skis.
[Das+, CVPR 2017]

Visual Dialog (VisDial)
Questioner Answerer
A couple of people
What are their genders?
Are they both adults?
Do they wear goggles?
Do they have hats on?
Are there any other people?
What color is man’s hat?
Is it snowing now?
What is woman wearing?
Are they smiling?
Do you see trees?
1 man 1 woman
Yes
Looks like sunglasses
Man does
No
Black
No
Blue jacket and black pants
Yes
Yes
[Das+, CVPR 2017]

Visual Dialog (VisDial) の概要
• MS COCOが基本
– 12万枚の画像
– 5キャプション/画像
• 1対話/画像を収集
– Amazon Mechanical Turk
– QA形式で10ラウンド
• 2017年12月現在はv0.9
– 画像約12万枚の対話
– 1画像に付き1対話
[Das+, CVPR 2017]

Vision-and-Language Navigation (VNL)
対話行為が移動とナビゲーション

R2R データセット
実世界3次元データ [Chang+, 3DV 2017] を利用
• 90の建造物で総計10,800点のパノラマRGBD画像を収集
• 各点で18方向のRGBD画像を収集→パノラマ化
• 平均2.25m間隔、人の目線の高さ、カメラポーズも記録
この3次元世界を動けるシミュレータを提供
• 観測：3次元位置およびカメラ角度＋主観画像（RGB）
• 行動：隣接地点への移動またはカメラ角度の更新

R2R データセット
Amazon Mechanical Turk で収集
• 7189経路を抽出
– 5m以上離れた2地点
→平均10m程度
– 最低4～6回移動
• 経路あたり3つずつの
インストラクション
（非対話）を収集
– 平均29単語
(類似課題に比べて長め)
– 約3100語彙
(類似課題に比べて少量)

MNIST Dialog
VisDialのMNIST版
• 4x4のMNIST画像（白黒）に
– 文字色5種、背景色5種、スタイル2種を適用
– 画像数 50,000
– 対話 3 dialogs/image
– 質問 10 Q&As/dialog
How many 9’s are there in the image? four
How many brown digits are there among them? one
What is the background color of the digit at the left of it? white
What is the style of the digit? flat
What is the color of the digit at the left of it? blue
What is the number of the blue digit? 4
Are there other blue digits? two
[Seo+, NIPS 2017]

Multimodal Dialog (MMD)
商品推薦を伴うマルチモーダル対話
[Saha+, AAAI 2018]

Multimodal Dialog (MMD) の概要
[Saha+, AAAI 2018]
• Shopper
– 客であり、要望を言語もしくは画像で伝える
• Agent
– 店員であり、要望に言語もしくは画像で応える
• 20人で100万枚画像を用いながら対話
– 対話 150,000
– 平均発話 40/対話
– 画像を伴う発話総計約1,200,000

Twitch-FIFA
ゲーム実況のウェブ配信+多人数チャット
• 49本のビデオ、86時間分
– 入力：20秒のビデオ＋多人数チャット
– 出力：直後10秒の多人数チャット
– 合計15000の対話
[Pasunuru+Bansal, EMNLP 2018]

Talk the Walk
NYCを歩くTouristと目的地へ導くGuide
• 1万程度の発話
– 行動：約44回
– Guide発話：約9回
– Tourist発話：約8回
[de Vries+, arXiv 2018]

その他にも…
• Image-Grounded Conversation [Mostafazadeh+,
IJCNLP 2017]
– VisDialはVQAベース
No
Any huge pumpkins?
No
Do you see trees?
No
Do you see anyone?
That is possible
Do you think it's for Halloween?
Possibly
Is this at a farm?
No
Is the photo close up?
Yes
Is the photo in color?

その他にも…
• Image-Grounded Conversation [Mostafazadeh+,
IJCNLP 2017]
– VisDialはVQAベース
– 雑談の様な対話を収集
Place near my house is getting
ready for Halloween a little early.
Don't you think Halloween
should be year-round, though?
That'd be fun since it's
my favorite holiday!
It‘s my favorite holiday as well!
I never got around to carving a
pumpkin last year even though I
bought one.
Well, it's a good thing that they are
starting to sell them early this year!

その他にも…
• DialEdit [Ramesh+, 2018]
• Video Scene-Aware Dialog Data [Hori+, 2018]
– Dialog System Technology Challenge (DSTC) 7
– VisDialの動画バージョン

目的の一覧
• Vision-Awareな対話をモデル化したい
→VisDial [Das+, CVPR 2017]など
• CVの既存/新規な問題を対話的に解きたい
– 画像キャプション生成
– ロボットのPick&Place
– 画像内のどこを見ているかを共有したい
→GuessWhat?! [de Vries+, CVPR 2017]など
– どの画像を見ているかを共有したい
– 商品の推薦システムを作りたい
→MMD [Saha+, AAAI 2018]など
– 対話的なナビゲーションシステムを作りたい
→VNL [Anderson+, ICCV 2017], Talk the Walk [de Vries+, 2018]など

問題の発展：個人適合キャプション列生成
複数のキャプションで説明しようとすると
• 個人で注目する場所によってふさわしい
キャプションも変わる
• ユーザへの質問を通じて注目個所を獲得
What is the man riding?
Motorcycle
Input image
The man is riding
Motorcycle. It is
white. The motorcycle is
honda.
[Shin+, CVPR 2018]

問題の発展：個人適合キャプション列生成
複数のキャプションで説明しようとすると
• 個人で注目する場所によってふさわしい
キャプションも変わる
• ユーザへの質問を通じて注目個所を獲得
What is the man riding?
Skateboard
The man is riding
skateboard. The man
is skateboarding. The
color of the jacket is red.
Input image
[Shin+, CVPR 2018]

ロボットのPick&Place
[Hatori+, ICRA 2018]

実はComprehensionモデル+α

精度の発展：「行って帰ってくる損失」
• 機械学習
出力キャプション→入力画像を再推定
cf. CycleGAN[Zhu+, ICCV 2017]
変分自己符号化器
[Pu+, NIPS 2016]
出力キャプションで領域検索
[Luo+Shakhnarovich, CVPR 2017]
再掲

Comprehensionに相当する部分
（何をPickするか）

＋αに相当する部分
（どこにPlaceするか）

どの画像を見ているかを共有したい
10 Round のQA後Questionerが画像を当てる
当たれば2エージェント共に勝利（協調）
10
Rounds
[Das+, ICCV 2017]

なぜ言語か
• 一番トリビアルな解:
「Questionerを無視してAnswererが画像や
画像特徴量を送付」
自然言語（シンボル列）はボトルネックとして
作用→トリビアルな解を回避できる
• 人間に理解可能：学習後、どちらかのエー
ジェントと人間が交替して対話できる
Questioner Answerer大人が写っていますか？
（無視）
[Das+, ICCV 2017]

Cooperative Visual Dialog with RL
• 強化学習の活用
– AlphaGo [Silver+, Nature 2016]からの着想
– 画像あてゲームの成功/失敗を報酬として最適化
• Fine-tuneとして採用
– 事前学習はこれまで同様教師あり学習
– フルスクラッチでの強化学習だと…
• 画像と言語の関係やコミュニケーションプロトコルの
学習が困難
• 学習できたとしても、結果としてエージェントが人間
の言葉を喋る可能性は低い
[Das+, ICCV 2017]

提案手法による対話と学習
• 質問𝑞𝑡と応答𝑎 𝑡の生成・理解:
Hierarchical Recurrent Encoder-Decoder
• 画像(特徴量)𝑦の推定値 𝑦の出力:
単層全結合ネットワークによる回帰
[Das+, ICCV 2017]

定性的な対話結果例
Questioner Answerer
A couple of people

SL-Pretrained（教師あり事前学習のみ）
Questioner Answerer
A couple of people
2
I can’t tell
I can’t tell, they are far away
I can’t tell
Yes
Yes
Yes
Yes
Yes
Yes
How many people?
Male or female?
What are they wearing?
What color are skis?
Are they wearing goggles?
Are they wearing hats?

RL-full-QAf（提案手法）
Questioner Answerer
A couple of people
Are people male or female or both?
What are they wearing?
Is it snowing?
Can you see any trees?
Can you see any buildings in background?
Does it look like they are in park?
I think 1 is male, can’t see other 1 ...
They are all wearing snow pants and jackets
It does not appear to be snowing at moment
Yes, there are lot of trees in background
No, I can not
No, I do not see any buildings
No , I do not see any buildings
It does not appear to be

まとめ
視覚×対話の融合研究へ至る道の俯瞰
1. 画像・動画キャプション生成
2. 画像に関する質問への応答
3. Vision-aware Dialog
現在の深層学習技術のショーケース
• 各モダリティのエンコーダ/デコーダ
• 教師あり学習/強化学習
視覚×対話の新たなステージへ

視覚と対話の融合研究

More Related Content

What's hot

Similar to 視覚と対話の融合研究

More from Yoshitaka Ushiku

Recently uploaded

視覚と対話の融合研究

Editor's Notes