SlideShare a Scribd company logo
🦩 Flamingo: a Visual Language
Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc,
Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi,
Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian
Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bińkowski, Ricardo Barreira,
Oriol Vinyals, Andrew Zisserman, Karén Simonyan, NeurIPS2022
2023/05/11
nFlamingo
• DeepMind Visual Language Model
• Large Language Model LLM
•
n
•
•
• 3B/9B/80B
Input Prompt Completion
This is a
chinchilla. They
are mainly found
in Chile.
This is a shiba.
They are very
popular in Japan.
This is !
a flamingo. They
are found in the
Caribbean and
South America.
Visual Question Answering
This is a
chinchilla. They
are mainly found
in Chile.
This is a shiba.
They are very
popular in Japan.
This is !
a flamingo. They
are found in the
Caribbean and
South America.
What is the title
of this painting?
Answer: The
Hallucinogenic
Toreador.
Where is this
painting
displayed?
Answer: Louvres
Museum, Paris.
What is the name
of the city where
this was painted?
Answer:
! Arles.
Output:
"Underground"
Output:
"Congress"
Output: ! "Soulomes"
2+1=3 5+6=11 ! 3x6=18
Output: A
propaganda
poster depicting a
cat dressed as
French emperor
Napoleon holding
a piece of cheese.
Output: A pink
room with a
flamingo pool
float.
Output: !
A portrait of
Salvador Dali
with a robot
head.
Visual Dialog
What happens to
the man after
hitting the ball?
Answer:
! he falls down.
This is a picture of two teddy
bears on the moon.
What are they doing?
They are having a
conversation.
What object are they using?
It looks like a computer.
Is this surprising?
Yes, it is surprising.
Why is this picture surprising
to you?
I think it is surprising because
teddy bears are not usually
found on the moon.
What is the common thing
about these three images?
They are all flamingos.
What is the difference be-
tween these three images?
The first one is a cartoon, the
second one is a real flamingo,
and the third one is a 3D
model of a flamingo.
This is an apple with a sticker
on it.
What does the sticker say?
The sticker says "iPod".
Where is the photo taken?
It looks like it’s taken in a
backyard.
Do you think it is printed or
handwritten?
It looks like it’s handwritten.
What color is the sticker?
It’s white.
This is a cityscape. It looks
like Chicago.
What makes you think this is
Chicago?
I think it’s Chicago because
of the Shedd Aquarium in the
background.
What about this one? Which
city is this and what famous
landmark helped you recog-
nise the city?
This is Tokyo. I think it’s
Tokyo because of the Tokyo
Tower.
Figure 1: Selected examples of inputs and outputs obtained from Flamingo-80B. Flamingo
Vision
Encoder
❄
Perceiver
Resampler 1st LM block
a very serious cat.
Pretrained and frozen
❄
Vision
Encoder
❄
Perceiver
Resampler
1st GATED XATTN-DENSE
Interleaved visual/text data
This is a very cute dog. This is
Trained from scratch
n-th GATED XATTN-DENSE
n-th LM block ❄
…
Output: text
❄
<image> This is a very cute dog.<image> This is
Processed text
Figure 3: Flamingo architecture overview. Flamingo is a family of visual language models (VLMs)
that take as input visual data interleaved with text and produce free-form text as output.
Perceiver Resampler
Learned
latent
queries
Vision
Encoder
Vision
Encoder
Vision
Encoder
✕ num_layers
t=0 t=1 t=2 +
+
+
Xf
flatten
FFW
K=V=[Xf
,X]
Attention
Q=[X]
def perceiver_resampler(
x_f, # The [T, S, d] visual features (T=time, S=space)
time_embeddings, # The [T, 1, d] time pos embeddings.
x, # R learned latents of shape [R, d]
num_layers, # Number of layers
):
"""The Perceiver Resampler model."""
# Add the time position embeddings and flatten.
x_f = x_f + time_embeddings
x_f = flatten(x_f) # [T, S, d] -> [T * S, d]
# Apply the Perceiver Resampler layers.
for i in range(num_layers):
# Attention.
x = x + attention_i(q=x, kv=concat([x_f, x]))
# Feed forward.
x = x + ffw_i(x)
return x
Time
+
+
X
Vision Encoder
Learned latent queries
[Jaegle+, ICML2021]
GATED XATTN-DENSE
nLLM Visual input
•
self attention
FFW
Q=[Y]
FFW
+
+
tanh gating
+
+
tanh gating
GATED XATTN-DENSE
LM layer ❄
X
K=V=[X]
cross attention
K=V=[Y] Q=[Y]
❄
❄
Y
Language
input
def gated_xattn_dense(
y, # input language features
x, # input visual features
alpha_xattn, # xattn gating parameter – init at 0.
alpha_dense, # ffw gating parameter – init at 0.
):
"""Applies a GATED XATTN-DENSE layer."""
# 1. Gated Cross Attention
y = y + tanh(alpha_xattn) * attention(q=y, kv=x)
# 2. Gated Feed Forward (dense) Layer
y = y + tanh(alpha_dense) * ffw(y)
# Regular self-attention + FFW on language
y = y + frozen_attention(q=y, kv=y)
y = y + frozen_ffw(y)
return y # output visually informed language features
Vision
input
Y
X
…
…
Frozen
nVision Encoder
• Normalizer-Free ResNet [Brock, arXiv2021]
•
• ALIGN[Jia+, ICML2021]
• CLIP [Radford+, ICML2021]
nLLM
• Chinchilla
[Hoffmann+, arXiv2022]
• Transformer LLM
•
• Frozen 1.4/7/70B
Chinchilla
Vision
Encoder
❄
Perceiver
Resampler 1st LM block
a very serious cat.
Pretrained and frozen
❄
Vision
Encoder
❄
Perceiver
Resampler
1st GATED XATTN-DENSE
Interleaved visual/text data
This is a very cute dog. This is
Trained from scratch
n-th GATED XATTN-DENSE
n-th LM block ❄
…
Output: text
❄
<image> This is a very cute dog.<image> This is
Processed text
Flamingo
nFlamingo
• ALIGN 18
•
• M3W[Rae+, arXiv2021] 4300 HTML
• LTIP 1200
• VTP 2700
•
𝑥ℓ, 𝑦ℓ : ℓ /
𝜆": m 𝒟"
n
• VQA TextVQA [Singh+, CVPR2019], NextQA [Xiao+, CVPR2021]
• Visual Dialog VisDial [Das+, CVPR2017]
• Vision and text Classification HatefulMemes [Kiela+, NeurIPS2020]
• 11
n shot
• Flamingo 3B/9B/80B Zero-shot Few-shot
• Few-shot
• Fine-tuning
• Fine-tuning
•
Vision Encoder
Input Prompt Completion
This is a
chinchilla. They
are mainly found
in Chile.
This is a shiba.
They are very
popular in Japan.
This is !
a flamingo. They
are found in the
Caribbean and
South America.
What is the title Where is this
What is the name
forward MLP is 4D. L: number of layers, D: transformer hidden size, H: number of heads, Act.:
FFW activation, Sq. ReLU: Squared ReLU [104].
Zero/Few-shot
nZero/Few-shot SoTA
Method FT Shot
OKVQA
(I)
VQAv2
(I)
COCO
(I)
MSVDQA
(V)
VATEX
(V)
VizWiz
(I)
Flick30K
(I)
MSRVTTQA
(V)
iVQA
(V)
YouCook2
(V)
STAR
(V)
VisDial
(I)
TextVQA
(I)
NextQA
(I)
HatefulMemes
(I)
RareAct
(V)
Zero/Few
shot SOTA
7
(X)
[34]
43.3
(16)
[114]
38.2
(4)
[124]
32.2
(0)
[58]
35.2
(0)
- - -
[58]
19.2
(0)
[135]
12.2
(0)
-
[143]
39.4
(0)
[79]
11.6
(0)
- -
[85]
66.1
(0)
[85]
40.7
(0)
Flamingo-3B
7 0 41.2 49.2 73.0 27.5 40.1 28.9 60.6 11.0 32.7 55.8 39.6 46.1 30.1 21.3 53.7 58.4
7 4 43.3 53.2 85.0 33.0 50.0 34.0 72.0 14.9 35.7 64.6 41.3 47.3 32.7 22.4 53.6 -
7 32 45.9 57.1 99.0 42.6 59.2 45.5 71.2 25.6 37.7 76.7 41.6 47.3 30.6 26.1 56.3 -
Flamingo-9B
7 0 44.7 51.8 79.4 30.2 39.5 28.8 61.5 13.7 35.2 55.0 41.8 48.0 31.8 23.0 57.0 57.9
7 4 49.3 56.3 93.1 36.2 51.7 34.9 72.6 18.2 37.7 70.8 42.8 50.4 33.6 24.7 62.7 -
7 32 51.0 60.4 106.3 47.2 57.4 44.0 72.8 29.4 40.7 77.3 41.2 50.4 32.6 28.4 63.5 -
Flamingo
7 0 50.6 56.3 84.3 35.6 46.7 31.6 67.2 17.4 40.7 60.1 39.7 52.0 35.0 26.7 46.4 60.8
7 4 57.4 63.1 103.2 41.7 56.0 39.6 75.1 23.9 44.1 74.5 42.4 55.6 36.5 30.8 68.6 -
7 32 57.8 67.6 113.8 52.3 65.1 49.8 75.4 31.0 45.3 86.8 42.2 55.6 37.9 33.5 70.0 -
Pretrained
FT SOTA
4
(X)
54.4
[34]
(10K)
80.2
[140]
(444K)
143.3
[124]
(500K)
47.9
[28]
(27K)
76.3
[153]
(500K)
57.2
[65]
(20K)
67.4
[150]
(30K)
46.8
[51]
(130K)
35.4
[135]
(6K)
138.7
[132]
(10K)
36.7
[128]
(46K)
75.2
[79]
(123K)
54.7
[137]
(20K)
25.2
[129]
(38K)
79.1
[62]
(9K)
-
Table 1: Comparison to the state of the art. A single Flamingo model reaches the state of the art
on a wide array of image (I) and video (V) understanding tasks with few-shot learning, significantly
Few-shot
nFine-tuned models few-shot Flamingo
• 6 SoTA
Figure 2: Flamingo results overview. Left: Our largest model, dubbed Flamingo, outperforms
nFlamingo
• 80B Visual Language Model
• LLM Chinchilla
n
• Zero-shot Few-shot
(a) Attention tanh gating (b) FFW tanh gating.
Figure 6: Evolution of the absolute value of the tanh gating at different layers of Flamingo-3B.
<BOS> Cute pics of my pets!<EOC><image>My puppy sitting in the grass. <EOC><image>My cat looking very dignified.<EOC>
Masked cross attention
<BOS>Cute pics of my pets!<EOC><image>My puppy sitting in the grass.<EOC><image> My cat looking very dignified.<EOC>
tokenization
Vision
Encoder
Perceiver
Resampler
Vision
Encoder
Perceiver
Resampler
K=V=[X]
Q
Image 1 Image 2
Processed text: <image> tags are inserted and special tokens are added
Cute pics of my pets!
My puppy sitting in the
grass.
My cat looking very
dignified.
Input webpage
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Figure 7: Interleaved visual data and text support. Given text interleaved with images/videos,
e.g. coming from a webpage, we first process the text by inserting <image> tags at the locations of
the visual data in the text as well as special tokens (<BOS> for “beginning of sequence” or <EOC> for
“end of chunk”). Images are processed independently by the Vision Encoder and Perceiver Resampler
to extract visual tokens. At a given text token, the model only cross-attends to the visual tokens
corresponding to the last preceding image/video. indicates which image/video a text token can
attend or 0 when no image/video is preceding. In practice, this selective cross-attention is achieved
through masking – illustrated here with the dark blue entries (unmasked/visible) and light blue entries
Fine-tuning
Method VQAV2 COCO VATEX VizWiz MSRVTTQA VisDial YouCook2 TextVQA HatefulMemes
test-dev test-std test test test-dev test-std test valid test-std valid valid test-std test seen
32 shots 67.6 - 113.8 65.1 49.8 - 31.0 56.8 - 86.8 36.0 - 70.0
Fine-tuned 82.0 82.1 138.1 84.2 65.7 65.4 47.4 61.8 59.7 118.6 57.1 54.1 86.6
81.3†
81.3†
149.6†
81.4†
57.2†
60.6†
46.8 75.2 75.4†
138.7 54.7 73.7 84.6†
SotA
[133] [133] [119] [153] [65] [65] [51] [79] [123] [132] [137] [84] [152]
Table 2: Comparison to SotA when fine-tuning Flamingo. We fine-tune Flamingo on all nine
tasks where Flamingo does not achieve SotA with few-shot learning. Flamingo sets a new SotA on
five of them, outperfoming methods (marked with †) that use tricks such as model ensembling or
domain-specific metric optimisation (e.g., CIDEr optimisation).
Ablated Flamingo-3B Changed Param. Step COCO OKVQA VQAv2 MSVDQA VATEX Overall
setting original value value count # time # CIDEr" top1" top1" top1" CIDEr" score"
Flamingo-3B model 3.2B 1.74s 86.5 42.1 55.8 36.3 53.4 70.7
(i) Training data All data
w/o Video-Text pairs 3.2B 1.42s 84.2 43.0 53.9 34.5 46.0 67.3
w/o Image-Text pairs 3.2B 0.95s 66.3 39.2 51.6 32.0 41.6 60.9
Image-Text pairs! LAION 3.2B 1.74s 79.5 41.4 53.5 33.9 47.6 66.4
w/o M3W 3.2B 1.02s 54.1 36.5 52.7 31.4 23.5 53.4
(ii) Optimisation Accumulation Round Robin 3.2B 1.68s 76.1 39.8 52.1 33.2 40.8 62.9
(iii) Tanh gating 3 7 3.2B 1.74s 78.4 40.5 52.9 35.9 47.5 66.5

More Related Content

What's hot

ConvNetの歴史とResNet亜種、ベストプラクティス
ConvNetの歴史とResNet亜種、ベストプラクティスConvNetの歴史とResNet亜種、ベストプラクティス
ConvNetの歴史とResNet亜種、ベストプラクティス
Yusuke Uchida
 
動画像理解のための深層学習アプローチ
動画像理解のための深層学習アプローチ動画像理解のための深層学習アプローチ
動画像理解のための深層学習アプローチ
Toru Tamaki
 
論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels
Toru Tamaki
 
DeepVIO: Self-supervised Deep Learning of Monocular Visual Inertial Odometry ...
DeepVIO: Self-supervised Deep Learning of Monocular Visual Inertial Odometry ...DeepVIO: Self-supervised Deep Learning of Monocular Visual Inertial Odometry ...
DeepVIO: Self-supervised Deep Learning of Monocular Visual Inertial Odometry ...
harmonylab
 
論文紹介:Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...
論文紹介:Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...論文紹介:Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...
論文紹介:Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...
Toru Tamaki
 
【DL輪読会】GET3D: A Generative Model of High Quality 3D Textured Shapes Learned f...
【DL輪読会】GET3D: A Generative Model of High Quality 3D Textured Shapes Learned f...【DL輪読会】GET3D: A Generative Model of High Quality 3D Textured Shapes Learned f...
【DL輪読会】GET3D: A Generative Model of High Quality 3D Textured Shapes Learned f...
Deep Learning JP
 
[DL Hacks]Visdomを使ったデータ可視化
[DL Hacks]Visdomを使ったデータ可視化[DL Hacks]Visdomを使ったデータ可視化
[DL Hacks]Visdomを使ったデータ可視化
Deep Learning JP
 
[DL輪読会]End-to-End Object Detection with Transformers
[DL輪読会]End-to-End Object Detection with Transformers[DL輪読会]End-to-End Object Detection with Transformers
[DL輪読会]End-to-End Object Detection with Transformers
Deep Learning JP
 
continual learning survey
continual learning surveycontinual learning survey
continual learning survey
ぱんいち すみもと
 
文献紹介:Token Shift Transformer for Video Classification
文献紹介:Token Shift Transformer for Video Classification文献紹介:Token Shift Transformer for Video Classification
文献紹介:Token Shift Transformer for Video Classification
Toru Tamaki
 
[DL輪読会]Blind Video Temporal Consistency via Deep Video Prior
[DL輪読会]Blind Video Temporal Consistency via Deep Video Prior[DL輪読会]Blind Video Temporal Consistency via Deep Video Prior
[DL輪読会]Blind Video Temporal Consistency via Deep Video Prior
Deep Learning JP
 
backbone としての timm 入門
backbone としての timm 入門backbone としての timm 入門
backbone としての timm 入門
Takuji Tahara
 
Deep Learningによる超解像の進歩
Deep Learningによる超解像の進歩Deep Learningによる超解像の進歩
Deep Learningによる超解像の進歩
Hiroto Honda
 
動画認識における代表的なモデル・データセット(メタサーベイ)
動画認識における代表的なモデル・データセット(メタサーベイ)動画認識における代表的なモデル・データセット(メタサーベイ)
動画認識における代表的なモデル・データセット(メタサーベイ)
cvpaper. challenge
 
[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation
[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation
[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation
taeseon ryu
 
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
Toru Tamaki
 
[DL輪読会]Attention Is All You Need
[DL輪読会]Attention Is All You Need[DL輪読会]Attention Is All You Need
[DL輪読会]Attention Is All You Need
Deep Learning JP
 
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII
 
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
Deep Learning JP
 
【論文読み】Bag of Tricks for Image Classification with Convolutional Neural Networks
【論文読み】Bag of Tricks for Image Classification with Convolutional Neural Networks【論文読み】Bag of Tricks for Image Classification with Convolutional Neural Networks
【論文読み】Bag of Tricks for Image Classification with Convolutional Neural Networks
tattaka_sun
 

What's hot (20)

ConvNetの歴史とResNet亜種、ベストプラクティス
ConvNetの歴史とResNet亜種、ベストプラクティスConvNetの歴史とResNet亜種、ベストプラクティス
ConvNetの歴史とResNet亜種、ベストプラクティス
 
動画像理解のための深層学習アプローチ
動画像理解のための深層学習アプローチ動画像理解のための深層学習アプローチ
動画像理解のための深層学習アプローチ
 
論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels
 
DeepVIO: Self-supervised Deep Learning of Monocular Visual Inertial Odometry ...
DeepVIO: Self-supervised Deep Learning of Monocular Visual Inertial Odometry ...DeepVIO: Self-supervised Deep Learning of Monocular Visual Inertial Odometry ...
DeepVIO: Self-supervised Deep Learning of Monocular Visual Inertial Odometry ...
 
論文紹介:Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...
論文紹介:Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...論文紹介:Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...
論文紹介:Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...
 
【DL輪読会】GET3D: A Generative Model of High Quality 3D Textured Shapes Learned f...
【DL輪読会】GET3D: A Generative Model of High Quality 3D Textured Shapes Learned f...【DL輪読会】GET3D: A Generative Model of High Quality 3D Textured Shapes Learned f...
【DL輪読会】GET3D: A Generative Model of High Quality 3D Textured Shapes Learned f...
 
[DL Hacks]Visdomを使ったデータ可視化
[DL Hacks]Visdomを使ったデータ可視化[DL Hacks]Visdomを使ったデータ可視化
[DL Hacks]Visdomを使ったデータ可視化
 
[DL輪読会]End-to-End Object Detection with Transformers
[DL輪読会]End-to-End Object Detection with Transformers[DL輪読会]End-to-End Object Detection with Transformers
[DL輪読会]End-to-End Object Detection with Transformers
 
continual learning survey
continual learning surveycontinual learning survey
continual learning survey
 
文献紹介:Token Shift Transformer for Video Classification
文献紹介:Token Shift Transformer for Video Classification文献紹介:Token Shift Transformer for Video Classification
文献紹介:Token Shift Transformer for Video Classification
 
[DL輪読会]Blind Video Temporal Consistency via Deep Video Prior
[DL輪読会]Blind Video Temporal Consistency via Deep Video Prior[DL輪読会]Blind Video Temporal Consistency via Deep Video Prior
[DL輪読会]Blind Video Temporal Consistency via Deep Video Prior
 
backbone としての timm 入門
backbone としての timm 入門backbone としての timm 入門
backbone としての timm 入門
 
Deep Learningによる超解像の進歩
Deep Learningによる超解像の進歩Deep Learningによる超解像の進歩
Deep Learningによる超解像の進歩
 
動画認識における代表的なモデル・データセット(メタサーベイ)
動画認識における代表的なモデル・データセット(メタサーベイ)動画認識における代表的なモデル・データセット(メタサーベイ)
動画認識における代表的なモデル・データセット(メタサーベイ)
 
[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation
[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation
[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation
 
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
 
[DL輪読会]Attention Is All You Need
[DL輪読会]Attention Is All You Need[DL輪読会]Attention Is All You Need
[DL輪読会]Attention Is All You Need
 
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
 
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
 
【論文読み】Bag of Tricks for Image Classification with Convolutional Neural Networks
【論文読み】Bag of Tricks for Image Classification with Convolutional Neural Networks【論文読み】Bag of Tricks for Image Classification with Convolutional Neural Networks
【論文読み】Bag of Tricks for Image Classification with Convolutional Neural Networks
 

Similar to 論文紹介:Flamingo: a Visual Language Model for Few-Shot Learning

A Bizarre Way to do Real-Time Lighting
A Bizarre Way to do Real-Time LightingA Bizarre Way to do Real-Time Lighting
A Bizarre Way to do Real-Time LightingSteven Tovey
 
[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路
台灣資料科學年會
 
Starling Framework
Starling FrameworkStarling Framework
Starling Framework
Krylover
 
what engineers don't know (but probably mathematicians do)
what engineers don't know (but probably mathematicians do)what engineers don't know (but probably mathematicians do)
what engineers don't know (but probably mathematicians do)
budi rahardjo
 
stackconf 2022: Are all programming languages in english?
stackconf 2022: Are all programming languages in english?stackconf 2022: Are all programming languages in english?
stackconf 2022: Are all programming languages in english?
NETWAYS
 
March.2012.KinectForWindows
March.2012.KinectForWindowsMarch.2012.KinectForWindows
March.2012.KinectForWindowsReuben Ahmed
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
Databricks
 
Blockchain Technology - Week 6 - Role of Cryptography in Blockchain
Blockchain Technology - Week 6 - Role of Cryptography in BlockchainBlockchain Technology - Week 6 - Role of Cryptography in Blockchain
Blockchain Technology - Week 6 - Role of Cryptography in Blockchain
Ferdin Joe John Joseph PhD
 
TRICK 2018 results
TRICK 2018 resultsTRICK 2018 results
TRICK 2018 results
mametter
 
StoryVisualization using StoryGAN implemented by pytorch on Pororro Dataset​....
StoryVisualization using StoryGAN implemented by pytorch on Pororro Dataset​....StoryVisualization using StoryGAN implemented by pytorch on Pororro Dataset​....
StoryVisualization using StoryGAN implemented by pytorch on Pororro Dataset​....
OthmaneABOUELAECHA
 
BlueHat Seattle 2019 || Modern Binary Analysis with ILs
BlueHat Seattle 2019 || Modern Binary Analysis with ILsBlueHat Seattle 2019 || Modern Binary Analysis with ILs
BlueHat Seattle 2019 || Modern Binary Analysis with ILs
BlueHat Security Conference
 
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, ExpectDeep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
Keunwoo Choi
 
มโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษามโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษาJulalak Kaewjoonla
 
มโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษามโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษาJulalak Kaewjoonla
 
มโนทัศน์เทคโนโลยีทางการศึกษา
มโนทัศน์เทคโนโลยีทางการศึกษามโนทัศน์เทคโนโลยีทางการศึกษา
มโนทัศน์เทคโนโลยีทางการศึกษาmay53638332
 
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
RootedCON
 
Dynamic Wounds on Animated Characters in UE4
Dynamic Wounds on Animated Characters in UE4Dynamic Wounds on Animated Characters in UE4
Dynamic Wounds on Animated Characters in UE4
Michał Kłoś
 
Python slide
Python slidePython slide
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph Convolution
Kazuki Fujikawa
 
Thai Text processing by Transfer Learning using Transformer (Bert)
Thai Text processing by Transfer Learning using Transformer (Bert)Thai Text processing by Transfer Learning using Transformer (Bert)
Thai Text processing by Transfer Learning using Transformer (Bert)
Kobkrit Viriyayudhakorn
 

Similar to 論文紹介:Flamingo: a Visual Language Model for Few-Shot Learning (20)

A Bizarre Way to do Real-Time Lighting
A Bizarre Way to do Real-Time LightingA Bizarre Way to do Real-Time Lighting
A Bizarre Way to do Real-Time Lighting
 
[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路
 
Starling Framework
Starling FrameworkStarling Framework
Starling Framework
 
what engineers don't know (but probably mathematicians do)
what engineers don't know (but probably mathematicians do)what engineers don't know (but probably mathematicians do)
what engineers don't know (but probably mathematicians do)
 
stackconf 2022: Are all programming languages in english?
stackconf 2022: Are all programming languages in english?stackconf 2022: Are all programming languages in english?
stackconf 2022: Are all programming languages in english?
 
March.2012.KinectForWindows
March.2012.KinectForWindowsMarch.2012.KinectForWindows
March.2012.KinectForWindows
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
 
Blockchain Technology - Week 6 - Role of Cryptography in Blockchain
Blockchain Technology - Week 6 - Role of Cryptography in BlockchainBlockchain Technology - Week 6 - Role of Cryptography in Blockchain
Blockchain Technology - Week 6 - Role of Cryptography in Blockchain
 
TRICK 2018 results
TRICK 2018 resultsTRICK 2018 results
TRICK 2018 results
 
StoryVisualization using StoryGAN implemented by pytorch on Pororro Dataset​....
StoryVisualization using StoryGAN implemented by pytorch on Pororro Dataset​....StoryVisualization using StoryGAN implemented by pytorch on Pororro Dataset​....
StoryVisualization using StoryGAN implemented by pytorch on Pororro Dataset​....
 
BlueHat Seattle 2019 || Modern Binary Analysis with ILs
BlueHat Seattle 2019 || Modern Binary Analysis with ILsBlueHat Seattle 2019 || Modern Binary Analysis with ILs
BlueHat Seattle 2019 || Modern Binary Analysis with ILs
 
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, ExpectDeep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
 
มโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษามโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษา
 
มโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษามโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษา
 
มโนทัศน์เทคโนโลยีทางการศึกษา
มโนทัศน์เทคโนโลยีทางการศึกษามโนทัศน์เทคโนโลยีทางการศึกษา
มโนทัศน์เทคโนโลยีทางการศึกษา
 
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
 
Dynamic Wounds on Animated Characters in UE4
Dynamic Wounds on Animated Characters in UE4Dynamic Wounds on Animated Characters in UE4
Dynamic Wounds on Animated Characters in UE4
 
Python slide
Python slidePython slide
Python slide
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph Convolution
 
Thai Text processing by Transfer Learning using Transformer (Bert)
Thai Text processing by Transfer Learning using Transformer (Bert)Thai Text processing by Transfer Learning using Transformer (Bert)
Thai Text processing by Transfer Learning using Transformer (Bert)
 

More from Toru Tamaki

論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
Toru Tamaki
 
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
Toru Tamaki
 
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
Toru Tamaki
 
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
Toru Tamaki
 
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
Toru Tamaki
 
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
Toru Tamaki
 
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
Toru Tamaki
 
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
Toru Tamaki
 
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
Toru Tamaki
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
Toru Tamaki
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
Toru Tamaki
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
Toru Tamaki
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
Toru Tamaki
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
Toru Tamaki
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
Toru Tamaki
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
Toru Tamaki
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
Toru Tamaki
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
Toru Tamaki
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
Toru Tamaki
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
Toru Tamaki
 

More from Toru Tamaki (20)

論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
 
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
 
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
 
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
 
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
 
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
 
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
 
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
 
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
 

Recently uploaded

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 

Recently uploaded (20)

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 

論文紹介:Flamingo: a Visual Language Model for Few-Shot Learning

  • 1. 🦩 Flamingo: a Visual Language Model for Few-Shot Learning Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bińkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karén Simonyan, NeurIPS2022 2023/05/11
  • 2. nFlamingo • DeepMind Visual Language Model • Large Language Model LLM • n • • • 3B/9B/80B Input Prompt Completion This is a chinchilla. They are mainly found in Chile. This is a shiba. They are very popular in Japan. This is ! a flamingo. They are found in the Caribbean and South America.
  • 3. Visual Question Answering This is a chinchilla. They are mainly found in Chile. This is a shiba. They are very popular in Japan. This is ! a flamingo. They are found in the Caribbean and South America. What is the title of this painting? Answer: The Hallucinogenic Toreador. Where is this painting displayed? Answer: Louvres Museum, Paris. What is the name of the city where this was painted? Answer: ! Arles. Output: "Underground" Output: "Congress" Output: ! "Soulomes" 2+1=3 5+6=11 ! 3x6=18 Output: A propaganda poster depicting a cat dressed as French emperor Napoleon holding a piece of cheese. Output: A pink room with a flamingo pool float. Output: ! A portrait of Salvador Dali with a robot head.
  • 4. Visual Dialog What happens to the man after hitting the ball? Answer: ! he falls down. This is a picture of two teddy bears on the moon. What are they doing? They are having a conversation. What object are they using? It looks like a computer. Is this surprising? Yes, it is surprising. Why is this picture surprising to you? I think it is surprising because teddy bears are not usually found on the moon. What is the common thing about these three images? They are all flamingos. What is the difference be- tween these three images? The first one is a cartoon, the second one is a real flamingo, and the third one is a 3D model of a flamingo. This is an apple with a sticker on it. What does the sticker say? The sticker says "iPod". Where is the photo taken? It looks like it’s taken in a backyard. Do you think it is printed or handwritten? It looks like it’s handwritten. What color is the sticker? It’s white. This is a cityscape. It looks like Chicago. What makes you think this is Chicago? I think it’s Chicago because of the Shedd Aquarium in the background. What about this one? Which city is this and what famous landmark helped you recog- nise the city? This is Tokyo. I think it’s Tokyo because of the Tokyo Tower. Figure 1: Selected examples of inputs and outputs obtained from Flamingo-80B. Flamingo
  • 5. Vision Encoder ❄ Perceiver Resampler 1st LM block a very serious cat. Pretrained and frozen ❄ Vision Encoder ❄ Perceiver Resampler 1st GATED XATTN-DENSE Interleaved visual/text data This is a very cute dog. This is Trained from scratch n-th GATED XATTN-DENSE n-th LM block ❄ … Output: text ❄ <image> This is a very cute dog.<image> This is Processed text Figure 3: Flamingo architecture overview. Flamingo is a family of visual language models (VLMs) that take as input visual data interleaved with text and produce free-form text as output.
  • 6. Perceiver Resampler Learned latent queries Vision Encoder Vision Encoder Vision Encoder ✕ num_layers t=0 t=1 t=2 + + + Xf flatten FFW K=V=[Xf ,X] Attention Q=[X] def perceiver_resampler( x_f, # The [T, S, d] visual features (T=time, S=space) time_embeddings, # The [T, 1, d] time pos embeddings. x, # R learned latents of shape [R, d] num_layers, # Number of layers ): """The Perceiver Resampler model.""" # Add the time position embeddings and flatten. x_f = x_f + time_embeddings x_f = flatten(x_f) # [T, S, d] -> [T * S, d] # Apply the Perceiver Resampler layers. for i in range(num_layers): # Attention. x = x + attention_i(q=x, kv=concat([x_f, x])) # Feed forward. x = x + ffw_i(x) return x Time + + X Vision Encoder Learned latent queries [Jaegle+, ICML2021]
  • 7. GATED XATTN-DENSE nLLM Visual input • self attention FFW Q=[Y] FFW + + tanh gating + + tanh gating GATED XATTN-DENSE LM layer ❄ X K=V=[X] cross attention K=V=[Y] Q=[Y] ❄ ❄ Y Language input def gated_xattn_dense( y, # input language features x, # input visual features alpha_xattn, # xattn gating parameter – init at 0. alpha_dense, # ffw gating parameter – init at 0. ): """Applies a GATED XATTN-DENSE layer.""" # 1. Gated Cross Attention y = y + tanh(alpha_xattn) * attention(q=y, kv=x) # 2. Gated Feed Forward (dense) Layer y = y + tanh(alpha_dense) * ffw(y) # Regular self-attention + FFW on language y = y + frozen_attention(q=y, kv=y) y = y + frozen_ffw(y) return y # output visually informed language features Vision input Y X … …
  • 8. Frozen nVision Encoder • Normalizer-Free ResNet [Brock, arXiv2021] • • ALIGN[Jia+, ICML2021] • CLIP [Radford+, ICML2021] nLLM • Chinchilla [Hoffmann+, arXiv2022] • Transformer LLM • • Frozen 1.4/7/70B Chinchilla Vision Encoder ❄ Perceiver Resampler 1st LM block a very serious cat. Pretrained and frozen ❄ Vision Encoder ❄ Perceiver Resampler 1st GATED XATTN-DENSE Interleaved visual/text data This is a very cute dog. This is Trained from scratch n-th GATED XATTN-DENSE n-th LM block ❄ … Output: text ❄ <image> This is a very cute dog.<image> This is Processed text
  • 9. Flamingo nFlamingo • ALIGN 18 • • M3W[Rae+, arXiv2021] 4300 HTML • LTIP 1200 • VTP 2700 • 𝑥ℓ, 𝑦ℓ : ℓ / 𝜆": m 𝒟"
  • 10. n • VQA TextVQA [Singh+, CVPR2019], NextQA [Xiao+, CVPR2021] • Visual Dialog VisDial [Das+, CVPR2017] • Vision and text Classification HatefulMemes [Kiela+, NeurIPS2020] • 11 n shot • Flamingo 3B/9B/80B Zero-shot Few-shot • Few-shot • Fine-tuning • Fine-tuning • Vision Encoder Input Prompt Completion This is a chinchilla. They are mainly found in Chile. This is a shiba. They are very popular in Japan. This is ! a flamingo. They are found in the Caribbean and South America. What is the title Where is this What is the name
  • 11. forward MLP is 4D. L: number of layers, D: transformer hidden size, H: number of heads, Act.: FFW activation, Sq. ReLU: Squared ReLU [104].
  • 12. Zero/Few-shot nZero/Few-shot SoTA Method FT Shot OKVQA (I) VQAv2 (I) COCO (I) MSVDQA (V) VATEX (V) VizWiz (I) Flick30K (I) MSRVTTQA (V) iVQA (V) YouCook2 (V) STAR (V) VisDial (I) TextVQA (I) NextQA (I) HatefulMemes (I) RareAct (V) Zero/Few shot SOTA 7 (X) [34] 43.3 (16) [114] 38.2 (4) [124] 32.2 (0) [58] 35.2 (0) - - - [58] 19.2 (0) [135] 12.2 (0) - [143] 39.4 (0) [79] 11.6 (0) - - [85] 66.1 (0) [85] 40.7 (0) Flamingo-3B 7 0 41.2 49.2 73.0 27.5 40.1 28.9 60.6 11.0 32.7 55.8 39.6 46.1 30.1 21.3 53.7 58.4 7 4 43.3 53.2 85.0 33.0 50.0 34.0 72.0 14.9 35.7 64.6 41.3 47.3 32.7 22.4 53.6 - 7 32 45.9 57.1 99.0 42.6 59.2 45.5 71.2 25.6 37.7 76.7 41.6 47.3 30.6 26.1 56.3 - Flamingo-9B 7 0 44.7 51.8 79.4 30.2 39.5 28.8 61.5 13.7 35.2 55.0 41.8 48.0 31.8 23.0 57.0 57.9 7 4 49.3 56.3 93.1 36.2 51.7 34.9 72.6 18.2 37.7 70.8 42.8 50.4 33.6 24.7 62.7 - 7 32 51.0 60.4 106.3 47.2 57.4 44.0 72.8 29.4 40.7 77.3 41.2 50.4 32.6 28.4 63.5 - Flamingo 7 0 50.6 56.3 84.3 35.6 46.7 31.6 67.2 17.4 40.7 60.1 39.7 52.0 35.0 26.7 46.4 60.8 7 4 57.4 63.1 103.2 41.7 56.0 39.6 75.1 23.9 44.1 74.5 42.4 55.6 36.5 30.8 68.6 - 7 32 57.8 67.6 113.8 52.3 65.1 49.8 75.4 31.0 45.3 86.8 42.2 55.6 37.9 33.5 70.0 - Pretrained FT SOTA 4 (X) 54.4 [34] (10K) 80.2 [140] (444K) 143.3 [124] (500K) 47.9 [28] (27K) 76.3 [153] (500K) 57.2 [65] (20K) 67.4 [150] (30K) 46.8 [51] (130K) 35.4 [135] (6K) 138.7 [132] (10K) 36.7 [128] (46K) 75.2 [79] (123K) 54.7 [137] (20K) 25.2 [129] (38K) 79.1 [62] (9K) - Table 1: Comparison to the state of the art. A single Flamingo model reaches the state of the art on a wide array of image (I) and video (V) understanding tasks with few-shot learning, significantly
  • 13. Few-shot nFine-tuned models few-shot Flamingo • 6 SoTA Figure 2: Flamingo results overview. Left: Our largest model, dubbed Flamingo, outperforms
  • 14. nFlamingo • 80B Visual Language Model • LLM Chinchilla n • Zero-shot Few-shot
  • 15.
  • 16. (a) Attention tanh gating (b) FFW tanh gating. Figure 6: Evolution of the absolute value of the tanh gating at different layers of Flamingo-3B. <BOS> Cute pics of my pets!<EOC><image>My puppy sitting in the grass. <EOC><image>My cat looking very dignified.<EOC> Masked cross attention <BOS>Cute pics of my pets!<EOC><image>My puppy sitting in the grass.<EOC><image> My cat looking very dignified.<EOC> tokenization Vision Encoder Perceiver Resampler Vision Encoder Perceiver Resampler K=V=[X] Q Image 1 Image 2 Processed text: <image> tags are inserted and special tokens are added Cute pics of my pets! My puppy sitting in the grass. My cat looking very dignified. Input webpage 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 Figure 7: Interleaved visual data and text support. Given text interleaved with images/videos, e.g. coming from a webpage, we first process the text by inserting <image> tags at the locations of the visual data in the text as well as special tokens (<BOS> for “beginning of sequence” or <EOC> for “end of chunk”). Images are processed independently by the Vision Encoder and Perceiver Resampler to extract visual tokens. At a given text token, the model only cross-attends to the visual tokens corresponding to the last preceding image/video. indicates which image/video a text token can attend or 0 when no image/video is preceding. In practice, this selective cross-attention is achieved through masking – illustrated here with the dark blue entries (unmasked/visible) and light blue entries
  • 17. Fine-tuning Method VQAV2 COCO VATEX VizWiz MSRVTTQA VisDial YouCook2 TextVQA HatefulMemes test-dev test-std test test test-dev test-std test valid test-std valid valid test-std test seen 32 shots 67.6 - 113.8 65.1 49.8 - 31.0 56.8 - 86.8 36.0 - 70.0 Fine-tuned 82.0 82.1 138.1 84.2 65.7 65.4 47.4 61.8 59.7 118.6 57.1 54.1 86.6 81.3† 81.3† 149.6† 81.4† 57.2† 60.6† 46.8 75.2 75.4† 138.7 54.7 73.7 84.6† SotA [133] [133] [119] [153] [65] [65] [51] [79] [123] [132] [137] [84] [152] Table 2: Comparison to SotA when fine-tuning Flamingo. We fine-tune Flamingo on all nine tasks where Flamingo does not achieve SotA with few-shot learning. Flamingo sets a new SotA on five of them, outperfoming methods (marked with †) that use tricks such as model ensembling or domain-specific metric optimisation (e.g., CIDEr optimisation). Ablated Flamingo-3B Changed Param. Step COCO OKVQA VQAv2 MSVDQA VATEX Overall setting original value value count # time # CIDEr" top1" top1" top1" CIDEr" score" Flamingo-3B model 3.2B 1.74s 86.5 42.1 55.8 36.3 53.4 70.7 (i) Training data All data w/o Video-Text pairs 3.2B 1.42s 84.2 43.0 53.9 34.5 46.0 67.3 w/o Image-Text pairs 3.2B 0.95s 66.3 39.2 51.6 32.0 41.6 60.9 Image-Text pairs! LAION 3.2B 1.74s 79.5 41.4 53.5 33.9 47.6 66.4 w/o M3W 3.2B 1.02s 54.1 36.5 52.7 31.4 23.5 53.4 (ii) Optimisation Accumulation Round Robin 3.2B 1.68s 76.1 39.8 52.1 33.2 40.8 62.9 (iii) Tanh gating 3 7 3.2B 1.74s 78.4 40.5 52.9 35.9 47.5 66.5