Compression of Generative Pre-trained Language Models
via Quantization
Chaofan Tao, Lu Hou, Wei Zhang, Lifeng Shang,
Xin Jiang, Qun Liu, Ping Luo, Ngai Wong
ACL-2022
HKU Huawei Noah’ s Ark Lab
Network Quantization
The quantization process can be defined as
where 𝛼 is a clipping factor. A good clipping factor is expected to take the majority of full-
precision weight into account via clipping.
Illustration of a quantized
transformer block:
Introduction
Introduction
Few study on NLG: Mainly study on Natural Language Understanding tasks (TinyBert, DynaBert,
TernaryBERT, BinaryBert, .etc), but not Natural Language Generation tasks.
Previous quantization methods not specifically designed for NLP: Directly applying previous
quantization methods to generative PLMs lead to poor performance.
Pact: Parameterized Clipping Activation
for Quantized Neural Networks
Pact: Parameterized Clipping Activation
for Quantized Neural Networks
Straight-Through Estimator (STE)
Pact: Parameterized Clipping Activation
for Quantized Neural Networks
Learned Step Size Quantization (LSQ)
Learned Step Size Quantization (LSQ)
Learned Step Size Quantization (LSQ)
Learned Step Size Quantization (LSQ)
Method
Difficulty 1: Homogeneous word embeddings
1. The word embeddings of the full-precision model are scattered distinguishable.
2. PACT, LSQ and LAQ learn homogeneous word embeddings which are clustered and less
distinguishable.
Method
Difficulty 1: Homogeneous word embeddings
The higher degree of homogeneity in the word embedding of a quantized model, the fewer
dependencies among different tokens are kept.
Method
Difficulty 1: Homogeneous word embeddings (Sentence Completion example)
Prefix: The Apprentice Boys'parade is an annual celebration by unionists of the relief of the Siege of Derry in 1689, which began when
thirteen young apprentice boys shut the city's gates against the army of King James. At that time the parade was held on 12 August
each year. Participants from across Northern Ireland and Britain marched along the city walls above the <unk>, and were often
openly hostile to the residents. On 30 July 1969 the Derry Citizens Defence Association ( <unk> ) was formed to try to preserve peace
during the period of the parade, and to defend the <unk> and <unk> in the event of an attack. The chairman was Seán Keenan, an
Irish Republican Army ( IRA ) veteran ; the vice @-@ chairman was Paddy Doherty, a popular local man sometimes known as " Paddy
<unk> " and the secretary was Johnnie White, another leading republican and leader of the James Connolly Republican Club. Street
committees were formed under the overall command of the <unk> and barricades were built on the night of 11 August. The parade
took place as planned on 12 August. As it passed through Waterloo Place, on the edge of the <unk>, hostilities began between
supporters and opponents of the parade
LSQ (2-bit): . in, (, in, ( the and (, (, the the, the the, ( (,,, in. (, the in in and ( ( ( (, in. ( in ( in the, in, in the, (, ( in,,,, in, ( the, (, in. ( (,
( (,,, in in, in in the in, the in ( and,,,,, ( in,, ( in,. in the, the the, and and ( ( in and (, and, the, ( (, (,, and ( in in ( the. in the (., the. (,,
in the in,, and the (,,, the in (,, (,,., in in in, in ( in the. in ( in,, the, ( ( ( and. ( ( (,, the, ( and the,,., ( in in, and ( ( in in..,, (, in in in. in in
and and (, ( ( the, (, in (, in (, ( and the (, the in ( in and, ( and (, ( ( and ( ( (
repeated and illogical generation
Method
Difficulty 1: Homogeneous word embeddings (Sentence Completion example)
Prefix: The Apprentice Boys'parade is an annual celebration by unionists of the relief of the Siege of Derry in 1689, which began when
thirteen young apprentice boys shut the city's gates against the army of King James. At that time the parade was held on 12 August
each year. Participants from across Northern Ireland and Britain marched along the city walls above the <unk>, and were often
openly hostile to the residents. On 30 July 1969 the Derry Citizens Defence Association ( <unk> ) was formed to try to preserve peace
during the period of the parade, and to defend the <unk> and <unk> in the event of an attack. The chairman was Seán Keenan, an
Irish Republican Army ( IRA ) veteran ; the vice @-@ chairman was Paddy Doherty, a popular local man sometimes known as " Paddy
<unk> " and the secretary was Johnnie White, another leading republican and leader of the James Connolly Republican Club. Street
committees were formed under the overall command of the <unk> and barricades were built on the night of 11 August. The parade
took place as planned on 12 August. As it passed through Waterloo Place, on the edge of the <unk>, hostilities began between
supporters and opponents of the parade
QuantGPT (ours, 2-bit): , with many called up by the crowd. Seán Keenan was an IRA member and also a war veteran, having
previously been a teacher at the former school. On 17 October 1968, White was fatally shot in the head by a policeman who fired
on White's chest, which left him in a coma for several days. After the war, the parade moved to Holbeche Road, where it
underwent an extensive refurbishment which included installing miniature, " giant " batons, and an anti @-@ tank shell @-@
puddle inside. Four months later, the parade celebrated the death of the Deputy Leader of the Irish Labour Party ( DSPB ), Ga
<unk> MacIntyre, who was fatally shot in front of the parade by members of the DSPB's North Belfast branch, the <unk> Police.
The parades'architecture, architecture and style were among the favourite things the parade, with C. P. McHugh describing the
parade as " totally different from that which took place at the rest of Ireland ". The parade was charged with protecting peace and
public order on the streets from perceived sabotage of police and public houses, and on 12 November 1969, the Official Receiver
of the
Method
Difficulty 1: Homogeneous word embeddings (Summarization example)
repeated and illogical generation
Method
Difficulty 2: Varied Distribution of Weights.
Previous method
1. Hard to find a good initial value for the learnable clipping factor 𝛼.
2. Does not consider the effect of weight within [−𝛼, 𝛼].
Could we solve these two problems with a new quantizer?
Method
Method
Proposal 1: Token-level Contrastive Distillation
For the i-th token 𝑡!, suppose its hidden states of the last Transformer layer from the quantized and full-
! ! ! !
precision network are linearly projected to (ℎ", ℎ#), and 𝑞# is the smoothed representation of ℎ" in the
memory bank. The contrastive loss for a length-n sequence is defined as:
where the memory bank update representation of tokens with the moving-average:
Method
Method
Method
Proposal 1: Token-level Contrastive Distillation
We also add a distillation loss 𝐿$!"#
The final loss 𝐿 is
The 𝜆 is a trade-off factor, which is set as 0.1 in default.
Method
Proposal 2: Module-dependent Dynamic Scaling
Instead of directly learning the clipping factor α as PACT, we turn to learn a new scaling factor γ
The gradient of γ :
2. Consider the effect of
weight within [- α, α]
1. Be proportional to the averaged
weight magnitude
Experiments & Discussions
Experiments & Discussions
The Learned Scaling Factor
Experiments & Discussions
Number of Negative Samples
Experiments & Discussions
Experiments & Discussions
Conclusions
1. We study the low-bit quantization of generative PLMs, and find that the difficulty
lies in homogeneous word embedding and varied distribution of weights.
2. We propose token-level contrastive learning to learn distinguishable token
embeddings, and a module-dependent dynamic scaling for more accurate
quantization. We name the GPT and BART quantized by our method as QuantGPT
and QuantBART, respectively.
Thanks for listening!
Word Representation Learning in Multimodal
Pre-Trained Transformers: An Intrinsic Evaluation
ACL 2022, Dublin
Sandro Pezzelle, Ece Takmaz, Raquel Fernández
ILLC, University of Amsterdam
s.pezzelle@uva.nl
Overview
How good are the semantic representations by multimodal
pre-trained Transformers such as LXMERT, ViLBERT, etc.?
2
Overview
How good are the semantic representations by multimodal
pre-trained Transformers such as LXMERT, ViLBERT, etc.?
good = aligned with human semantic intuitions
man, person: similar
dog, airplane: dissimilar
image: ViLBERT (Lu et al. 2019)
7
8
Theoretical background
Evidence that meaning of words is multimodal:
• Human concepts are grounded in our senses [Barsalou,
2008; De Vega et al., 2012]
• Sensory-motor experiences play an important role in
determining word meaning [Meteyard et al., 2012]
Theoretical background
Evidence that meaning of words is multimodal:
BANANA
… a banana which is
11
rich in potassium …
running …
… the
… I ate a
best bananas
banana while
from Costa Rica …
Previous work
Classic (pre-Transformers) approaches combining
representations from language and vision:
• Advantage of multimodal over text-only representations
[Bruni et al. 2014; Lazaridou et al. 2015; Kiela et al. 2016]
• Advantage typically for concrete — but not abstract —
words [Hill and Korhonen, 2014]
image: Beinborn et al. 2018
13
16
Multimodal Transformers
• Current state-of-the-art in most language and vision
tasks, e.g., VQA, Visual reasoning, Visual Dialogue, etc.
• Claimed to encode “task agnostic” representations
thanks to their (massive) pre-training
• ‘Extrinsic’ evaluation and probing [Cao et al. 2020;
Parcalabescu et al. 2021; Bugliarello et al. 2021; Thrush et al. 2022]
18
Research questions
How intrinsically good are the semantic representations by
multimodal Transformers?
Do they align with human semantic intuitions better than
text-only representations?
Method
• Obtain static embeddings from contextualized ones that
are dependent on the actual linguistic context
• Approach used with text-only BERT [Bommassani et al. 2020]
features
features
features
features
features
features
features
features
features
features
features
features
features
features
features
features
M different contexts for 'donut' Language Layer 1 h Layer L h
s1
mean
wC
s1
mean
wC
>&L6@a man buys a SinN don ut at ... s2 ... s2
embeds
tO tl t2 t3 t4 t5 t6 t7 t8 ...
mean mean
s1 s2 Vision
M wC1 M wC1
... ...
features w w
vl v2 v3 v4 ... v36
donut ⃗ = mean(wdonut , wdonut , …, wdonut)
1 2 N
19
Method
• Obtain static embeddings from contextualized ones that
are dependent on the actual linguistic context
• Approach used with text-only BERT [Bommassani et al. 2020]
donut ⃗ = mean(wdonut , wdonut , …, wdonut)
1 2 N
mean
mean
Layer 1
features
features
features
features
features
...
don ut at ...
features
features
features
features
embeds
s1
s2 ...
v36
vl v2 v3 v4 ...
Layer L
Language
Vision
s1 s2
M different contexts for 'donut'
features
features
features
features
wC1
M
mean
w
...
wC
>&L6@a man buys a SinN
t tl t2 t3 t4 t
s1
s2
features
features
features
features
wC1
mean
w
wC
M
20
t6 t t ...
h h
23
Experiments
Similarity benchmarks
• RG65, WordSim353, SimLex999, MEN, SimVerb3500
Multimodal models
• LXMERT, UNITER, ViLBERT, VisualBERT + Vokenization
Baselines (language-only models)
• BERT, GloVe
23
LXMERT (EMNLP 2019)
23
ViLBERT (NeurIPS 2019)
23
ViLBERT (NeurIPS 2019)
23
Vokenization (EMNLP 2020)
23
Vokenization (EMNLP 2020)
23
Vokenization (EMNLP 2020)
23
Vokenization (EMNLP 2020)
Pre-trained models
*initialized with BERT weights
24
Representations
• Pre-trained models used in inference mode (no fine-tuning)
• Sentences (BERT, Vokenization) or <sentence,image> pairs
(MM models) from COCO [Lin et al. 2014] + VIST [Huang et al. 2016]
25
‘Intrinsic’ evaluation
Traditional way to evaluate static embeddings:
• Spearman correlation between pairwise similarities
(cosines) and human semantic similarity judgements
• donut, muffin = 0.8
• car, train = 0.5
• dog, airplane = 0.1
• …
Comparison of the
semantic spaces
26
Layer-wise results
27
Best results: MEN
MEN [Bruni et al. 2014]
Spearman’s
correlation
28
0,82
0,81
0,80
0,79
0,78
0,77
0,76
0,75
0,74
0,73
0,72
Best results: SimLex999
SimLex999 [Hill et al. 2015]
Spearman’s
correlation
29
0,53
0,52
0,51
0,50
0,49
0,48
0,47
0,46
0,45
0,44
0,43
Results
30
Concreteness ratings for 40 thousand
generally known English word
lemmas
30
Results ~ Concreteness
30
Results ~ Concreteness
31
One counterexample
BERT outperforms both ViLBERT and Vokenization
32
Results
31
36
Discussion
Representations by multimodal pre-trained Transformers:
• Generally highly correlated with human intuitions (outperform
GloVe)
• Outperform BERT only in the more concrete benchmarks
(MEN, RG65) — not in the abstract ones
• Best when visually supervised at the token level (Vokenization)
• Differ between models and across layers: further investigation
Voxel-Informed
Language Grounding
Rudy Corona, Shizhan Zhu, Dan Klein, Trevor Darrell
Language Grounding
“A beige office chair
with 5 wheels”
Object Reference Game
“The beige chair”
Object Reference Game
“The beige chair”
Object Reference Game
“The beige chair”
Object Reference Game
“The chair with 5 wheels”
Object Reference Game
Occlusion
“The chair with 5 wheels”
Object Reference Game
“The chair with 5 wheels”
Object Reference Game
Correspondence
“The chair with 5 wheels”
Object Reference Game
“The chair with 5 wheels”
Anchoring Language to 3D
“The chair with 5 wheels”
Anchoring Language to 3D
“The beige
chair”
“the chair with
5 wheels”
Voxel-informed Language Grounder
“the chair with
5 wheels”
Voxel-informed Language Grounder
Visio-Linguistic
Module
“the chair with
5 wheels”
Voxel-informed Language Grounder
Visio-Linguistic
Module
Scoring
Function
“the chair with
5 wheels”
Voxel-informed Language Grounder
Visio-Linguistic
Module
Scoring
Function
0.79
“the chair with
5 wheels”
Voxel-informed Language Grounder
Visio-Linguistic
Module
Voxel-Language
Module
Scoring
Function
0.79
“the chair with
5 wheels”
Voxel-informed Language Grounder
Voxel-Language
Module
Voxel-informed Language Grounder
“the chair with
5 wheels”
Voxel-informed Language Grounder
“the chair with
5 wheels”
Radford et al. 2021
CLIP Language Encoder
Voxel-informed Language Grounder
“the chair with
5 wheels”
CLIP Language Encoder
w1 w2 wn
Voxel-informed Language Grounder
“the chair with
5 wheels”
Yagubbayli et al. 2021
CLIP Language Encoder
Voxel Reconstruction
Model
w1 w2 wn
Voxel-informed Language Grounder
CLIP Language Encoder
Voxel Reconstruction
Model
w1 w2 wn
f1 f2 fm
“the chair with
5 wheels”
Voxel-informed Language Grounder
CLIP Language Encoder
Voxel Reconstruction
Model
Cross-Modal Transformer
CLS w1 w2 wn
f1 f2 fm
“the chair with
5 wheels”
Voxel-informed Language Grounder
CLIP Language Encoder
Voxel Reconstruction
Model
Cross-Modal Transformer
CLS w1 w2 wn
f1 f2 fm
CLS
“the chair with
5 wheels”
SNARE
“Beige office chair”
Target Confounder
Thomason et al. 2021
Evaluation Caption Splits
Evaluation Caption Splits
Visual
“Beige chair”
“Office chair”
Evaluation Caption Splits
Visual
“Beige chair”
“Office chair”
Blindfolded
“high back with a star-shaped base
with wheels”
“can rotate, small arms”
Test Set Performance
76.6
Overall Accuracy
76.5 77.0
ViLBERT
MATCH
LAGOR
Test Set Performance
76.6
Overall Accuracy
76.5 77.0
79.0
ViLBERT
MATCH
LAGOR
Ours
Validation Caption Splits
89.5
Visual
90.6
89.8
ViLBERT
MATCH
LAGOR
89.5
Visual
90.6
89.8
91.2
ViLBERT
MATCH
LAGOR
Ours
Validation Caption Splits
Validation Caption Splits
ViLBERT
Blindfolded
MATCH
LAGOR
76.6
75.7 75.3
Validation Caption Splits
ViLBERT
Blindfolded
MATCH
LAGOR
Ours
76.6
75.7 75.3
78.4
Conclusion
Thanks!
https://github.com/rcorona/voxel_informed_language_grounding
There’s a Time and Place for
Reasoning Beyond the Image
Xingyu Fu, Ben Zhou*, Ishaan Preetam Chandratreya*, Carl Vondrick, Dan Roth
1
Motivation: Images are Everywhere
2
Motivation
What can we know from a
single image?
Explicit Information
● People
● Text
● Masks
● Clothing
● Digital Screens
● …
3
Related Work
● Existing datasets focus on local reasoning within images
4
Motivation What can we know from a
single image?
Explicit Info
● Digital screens
● Text
● Masks
…
⇒ Implicit Info
● Location
● Event
● Time
● …. 5
Can models know Time and Location from a single image?
6
Potential Joint Reasoning
7
Potential Joint Reasoning
8
Potential Joint Reasoning
9
Potential Joint Reasoning
10
11
Contribution
■ New task: TARA (Time and plAce for Reasoning beyond the imAge) that
identifies time and location information for images.
■ Dataset with 15,429 images and 61,325 additional weak supervision images.
■ Our baseline models, while slightly filled the 70% gap between sota model
and human performance, highlight avenues for future research in vision-
language joint open-ended reasoning with world knowledge.
□ Task & Evaluation
□ Data Collection & Analysis
□ Baseline & Experiments
12
TARA (Time and plAce for Reasoning beyond the imAge)
Task Definition
• 2021. April
• New York, United
States, North America
13
Image
Question
What is the time/location
for this image?
Answer
Evaluation Metrics
● Hierarchical Labels
○ “Philadelphia, Pennsylvania, United States, North America” ⇒
[“Philadelphia, Pennsylvania, United States, North America”, “United States, North America”,
“North America”]
○ “1967-7-14” ⇒
[“1967-7-14”, “1967-7”, “1967”, “1960s”, “20th century”].
● Evaluation Metrics
○ Accuracy
○ Example-F1
14
□ Task & Evaluation
□ Data Collection & Analysis
□ Baseline & Experiments
15
TARA (Time and plAce for Reasoning beyond the imAge)
Dataset Collection
16
Dataset Collection
17
• 15,429 images (94% kept after validation)
• 61,325 weak supervision Wikipedia images
Dataset Analysis
18
• 15,429 images (94% kept after validation)
• 61,325 weak supervision Wikipedia images
Dataset Analysis
19
□ Task & Evaluation
□ Data Collection & Analysis
□ Baseline & Experiments
20
TARA (Time and plAce for Reasoning beyond the imAge)
21
Experiments: Human VS Model
● Test Set of Interest:
○ Eliminates annotator bias
○ Guarantees to be out of CLIP(Radford et al., 2021) pretraining data
22
Experiments: Human VS Sota
Location
CLIP(Radford et al., 2021): Sota CLIP+: Finetuned on TARA CLIP+WIT: Finetuned on TARA and WIT
Models evaluated on all data and Human performance on Test set of Interest.
CLIP+Seg: use object & face segments
Thank you!
Code: https://github.com/
zeyofu/TARA
Time
23
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx

slide-acl2022-combined_san.pptx

  • 1.
    Compression of GenerativePre-trained Language Models via Quantization Chaofan Tao, Lu Hou, Wei Zhang, Lifeng Shang, Xin Jiang, Qun Liu, Ping Luo, Ngai Wong ACL-2022 HKU Huawei Noah’ s Ark Lab
  • 2.
    Network Quantization The quantizationprocess can be defined as where 𝛼 is a clipping factor. A good clipping factor is expected to take the majority of full- precision weight into account via clipping. Illustration of a quantized transformer block: Introduction
  • 3.
    Introduction Few study onNLG: Mainly study on Natural Language Understanding tasks (TinyBert, DynaBert, TernaryBERT, BinaryBert, .etc), but not Natural Language Generation tasks. Previous quantization methods not specifically designed for NLP: Directly applying previous quantization methods to generative PLMs lead to poor performance.
  • 4.
    Pact: Parameterized ClippingActivation for Quantized Neural Networks
  • 5.
    Pact: Parameterized ClippingActivation for Quantized Neural Networks Straight-Through Estimator (STE)
  • 6.
    Pact: Parameterized ClippingActivation for Quantized Neural Networks
  • 7.
    Learned Step SizeQuantization (LSQ)
  • 8.
    Learned Step SizeQuantization (LSQ)
  • 9.
    Learned Step SizeQuantization (LSQ)
  • 10.
    Learned Step SizeQuantization (LSQ)
  • 11.
    Method Difficulty 1: Homogeneousword embeddings 1. The word embeddings of the full-precision model are scattered distinguishable. 2. PACT, LSQ and LAQ learn homogeneous word embeddings which are clustered and less distinguishable.
  • 12.
    Method Difficulty 1: Homogeneousword embeddings The higher degree of homogeneity in the word embedding of a quantized model, the fewer dependencies among different tokens are kept.
  • 13.
    Method Difficulty 1: Homogeneousword embeddings (Sentence Completion example) Prefix: The Apprentice Boys'parade is an annual celebration by unionists of the relief of the Siege of Derry in 1689, which began when thirteen young apprentice boys shut the city's gates against the army of King James. At that time the parade was held on 12 August each year. Participants from across Northern Ireland and Britain marched along the city walls above the <unk>, and were often openly hostile to the residents. On 30 July 1969 the Derry Citizens Defence Association ( <unk> ) was formed to try to preserve peace during the period of the parade, and to defend the <unk> and <unk> in the event of an attack. The chairman was Seán Keenan, an Irish Republican Army ( IRA ) veteran ; the vice @-@ chairman was Paddy Doherty, a popular local man sometimes known as " Paddy <unk> " and the secretary was Johnnie White, another leading republican and leader of the James Connolly Republican Club. Street committees were formed under the overall command of the <unk> and barricades were built on the night of 11 August. The parade took place as planned on 12 August. As it passed through Waterloo Place, on the edge of the <unk>, hostilities began between supporters and opponents of the parade LSQ (2-bit): . in, (, in, ( the and (, (, the the, the the, ( (,,, in. (, the in in and ( ( ( (, in. ( in ( in the, in, in the, (, ( in,,,, in, ( the, (, in. ( (, ( (,,, in in, in in the in, the in ( and,,,,, ( in,, ( in,. in the, the the, and and ( ( in and (, and, the, ( (, (,, and ( in in ( the. in the (., the. (,, in the in,, and the (,,, the in (,, (,,., in in in, in ( in the. in ( in,, the, ( ( ( and. ( ( (,, the, ( and the,,., ( in in, and ( ( in in..,, (, in in in. in in and and (, ( ( the, (, in (, in (, ( and the (, the in ( in and, ( and (, ( ( and ( ( ( repeated and illogical generation
  • 14.
    Method Difficulty 1: Homogeneousword embeddings (Sentence Completion example) Prefix: The Apprentice Boys'parade is an annual celebration by unionists of the relief of the Siege of Derry in 1689, which began when thirteen young apprentice boys shut the city's gates against the army of King James. At that time the parade was held on 12 August each year. Participants from across Northern Ireland and Britain marched along the city walls above the <unk>, and were often openly hostile to the residents. On 30 July 1969 the Derry Citizens Defence Association ( <unk> ) was formed to try to preserve peace during the period of the parade, and to defend the <unk> and <unk> in the event of an attack. The chairman was Seán Keenan, an Irish Republican Army ( IRA ) veteran ; the vice @-@ chairman was Paddy Doherty, a popular local man sometimes known as " Paddy <unk> " and the secretary was Johnnie White, another leading republican and leader of the James Connolly Republican Club. Street committees were formed under the overall command of the <unk> and barricades were built on the night of 11 August. The parade took place as planned on 12 August. As it passed through Waterloo Place, on the edge of the <unk>, hostilities began between supporters and opponents of the parade QuantGPT (ours, 2-bit): , with many called up by the crowd. Seán Keenan was an IRA member and also a war veteran, having previously been a teacher at the former school. On 17 October 1968, White was fatally shot in the head by a policeman who fired on White's chest, which left him in a coma for several days. After the war, the parade moved to Holbeche Road, where it underwent an extensive refurbishment which included installing miniature, " giant " batons, and an anti @-@ tank shell @-@ puddle inside. Four months later, the parade celebrated the death of the Deputy Leader of the Irish Labour Party ( DSPB ), Ga <unk> MacIntyre, who was fatally shot in front of the parade by members of the DSPB's North Belfast branch, the <unk> Police. The parades'architecture, architecture and style were among the favourite things the parade, with C. P. McHugh describing the parade as " totally different from that which took place at the rest of Ireland ". The parade was charged with protecting peace and public order on the streets from perceived sabotage of police and public houses, and on 12 November 1969, the Official Receiver of the
  • 15.
    Method Difficulty 1: Homogeneousword embeddings (Summarization example) repeated and illogical generation
  • 16.
    Method Difficulty 2: VariedDistribution of Weights. Previous method 1. Hard to find a good initial value for the learnable clipping factor 𝛼. 2. Does not consider the effect of weight within [−𝛼, 𝛼]. Could we solve these two problems with a new quantizer?
  • 17.
  • 18.
    Method Proposal 1: Token-levelContrastive Distillation For the i-th token 𝑡!, suppose its hidden states of the last Transformer layer from the quantized and full- ! ! ! ! precision network are linearly projected to (ℎ", ℎ#), and 𝑞# is the smoothed representation of ℎ" in the memory bank. The contrastive loss for a length-n sequence is defined as: where the memory bank update representation of tokens with the moving-average:
  • 19.
  • 20.
  • 21.
    Method Proposal 1: Token-levelContrastive Distillation We also add a distillation loss 𝐿$!"# The final loss 𝐿 is The 𝜆 is a trade-off factor, which is set as 0.1 in default.
  • 22.
    Method Proposal 2: Module-dependentDynamic Scaling Instead of directly learning the clipping factor α as PACT, we turn to learn a new scaling factor γ The gradient of γ : 2. Consider the effect of weight within [- α, α] 1. Be proportional to the averaged weight magnitude
  • 23.
  • 24.
  • 25.
    The Learned ScalingFactor Experiments & Discussions Number of Negative Samples
  • 26.
  • 27.
  • 28.
    Conclusions 1. We studythe low-bit quantization of generative PLMs, and find that the difficulty lies in homogeneous word embedding and varied distribution of weights. 2. We propose token-level contrastive learning to learn distinguishable token embeddings, and a module-dependent dynamic scaling for more accurate quantization. We name the GPT and BART quantized by our method as QuantGPT and QuantBART, respectively. Thanks for listening!
  • 29.
    Word Representation Learningin Multimodal Pre-Trained Transformers: An Intrinsic Evaluation ACL 2022, Dublin Sandro Pezzelle, Ece Takmaz, Raquel Fernández ILLC, University of Amsterdam s.pezzelle@uva.nl
  • 30.
    Overview How good arethe semantic representations by multimodal pre-trained Transformers such as LXMERT, ViLBERT, etc.? 2
  • 31.
    Overview How good arethe semantic representations by multimodal pre-trained Transformers such as LXMERT, ViLBERT, etc.? good = aligned with human semantic intuitions man, person: similar dog, airplane: dissimilar image: ViLBERT (Lu et al. 2019) 7
  • 32.
    8 Theoretical background Evidence thatmeaning of words is multimodal: • Human concepts are grounded in our senses [Barsalou, 2008; De Vega et al., 2012] • Sensory-motor experiences play an important role in determining word meaning [Meteyard et al., 2012]
  • 33.
    Theoretical background Evidence thatmeaning of words is multimodal: BANANA … a banana which is 11 rich in potassium … running … … the … I ate a best bananas banana while from Costa Rica …
  • 34.
    Previous work Classic (pre-Transformers)approaches combining representations from language and vision: • Advantage of multimodal over text-only representations [Bruni et al. 2014; Lazaridou et al. 2015; Kiela et al. 2016] • Advantage typically for concrete — but not abstract — words [Hill and Korhonen, 2014] image: Beinborn et al. 2018 13
  • 35.
    16 Multimodal Transformers • Currentstate-of-the-art in most language and vision tasks, e.g., VQA, Visual reasoning, Visual Dialogue, etc. • Claimed to encode “task agnostic” representations thanks to their (massive) pre-training • ‘Extrinsic’ evaluation and probing [Cao et al. 2020; Parcalabescu et al. 2021; Bugliarello et al. 2021; Thrush et al. 2022]
  • 36.
    18 Research questions How intrinsicallygood are the semantic representations by multimodal Transformers? Do they align with human semantic intuitions better than text-only representations?
  • 37.
    Method • Obtain staticembeddings from contextualized ones that are dependent on the actual linguistic context • Approach used with text-only BERT [Bommassani et al. 2020] features features features features features features features features features features features features features features features features M different contexts for 'donut' Language Layer 1 h Layer L h s1 mean wC s1 mean wC >&L6@a man buys a SinN don ut at ... s2 ... s2 embeds tO tl t2 t3 t4 t5 t6 t7 t8 ... mean mean s1 s2 Vision M wC1 M wC1 ... ... features w w vl v2 v3 v4 ... v36 donut ⃗ = mean(wdonut , wdonut , …, wdonut) 1 2 N 19
  • 38.
    Method • Obtain staticembeddings from contextualized ones that are dependent on the actual linguistic context • Approach used with text-only BERT [Bommassani et al. 2020] donut ⃗ = mean(wdonut , wdonut , …, wdonut) 1 2 N mean mean Layer 1 features features features features features ... don ut at ... features features features features embeds s1 s2 ... v36 vl v2 v3 v4 ... Layer L Language Vision s1 s2 M different contexts for 'donut' features features features features wC1 M mean w ... wC >&L6@a man buys a SinN t tl t2 t3 t4 t s1 s2 features features features features wC1 mean w wC M 20 t6 t t ... h h
  • 39.
    23 Experiments Similarity benchmarks • RG65,WordSim353, SimLex999, MEN, SimVerb3500 Multimodal models • LXMERT, UNITER, ViLBERT, VisualBERT + Vokenization Baselines (language-only models) • BERT, GloVe
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
    Representations • Pre-trained modelsused in inference mode (no fine-tuning) • Sentences (BERT, Vokenization) or <sentence,image> pairs (MM models) from COCO [Lin et al. 2014] + VIST [Huang et al. 2016] 25
  • 49.
    ‘Intrinsic’ evaluation Traditional wayto evaluate static embeddings: • Spearman correlation between pairwise similarities (cosines) and human semantic similarity judgements • donut, muffin = 0.8 • car, train = 0.5 • dog, airplane = 0.1 • … Comparison of the semantic spaces 26
  • 50.
  • 51.
    Best results: MEN MEN[Bruni et al. 2014] Spearman’s correlation 28 0,82 0,81 0,80 0,79 0,78 0,77 0,76 0,75 0,74 0,73 0,72
  • 52.
    Best results: SimLex999 SimLex999[Hill et al. 2015] Spearman’s correlation 29 0,53 0,52 0,51 0,50 0,49 0,48 0,47 0,46 0,45 0,44 0,43
  • 53.
  • 54.
    Concreteness ratings for40 thousand generally known English word lemmas 30
  • 55.
  • 56.
  • 57.
    One counterexample BERT outperformsboth ViLBERT and Vokenization 32
  • 58.
  • 59.
    36 Discussion Representations by multimodalpre-trained Transformers: • Generally highly correlated with human intuitions (outperform GloVe) • Outperform BERT only in the more concrete benchmarks (MEN, RG65) — not in the abstract ones • Best when visually supervised at the token level (Vokenization) • Differ between models and across layers: further investigation
  • 60.
    Voxel-Informed Language Grounding Rudy Corona,Shizhan Zhu, Dan Klein, Trevor Darrell
  • 61.
    Language Grounding “A beigeoffice chair with 5 wheels”
  • 62.
  • 63.
  • 64.
  • 65.
    Object Reference Game “Thechair with 5 wheels”
  • 66.
  • 67.
    Object Reference Game “Thechair with 5 wheels”
  • 68.
  • 69.
    Object Reference Game “Thechair with 5 wheels”
  • 70.
    Anchoring Language to3D “The chair with 5 wheels”
  • 71.
    Anchoring Language to3D “The beige chair” “the chair with 5 wheels”
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
    Voxel-informed Language Grounder “thechair with 5 wheels” Radford et al. 2021 CLIP Language Encoder
  • 80.
    Voxel-informed Language Grounder “thechair with 5 wheels” CLIP Language Encoder w1 w2 wn
  • 81.
    Voxel-informed Language Grounder “thechair with 5 wheels” Yagubbayli et al. 2021 CLIP Language Encoder Voxel Reconstruction Model w1 w2 wn
  • 82.
    Voxel-informed Language Grounder CLIPLanguage Encoder Voxel Reconstruction Model w1 w2 wn f1 f2 fm “the chair with 5 wheels”
  • 83.
    Voxel-informed Language Grounder CLIPLanguage Encoder Voxel Reconstruction Model Cross-Modal Transformer CLS w1 w2 wn f1 f2 fm “the chair with 5 wheels”
  • 84.
    Voxel-informed Language Grounder CLIPLanguage Encoder Voxel Reconstruction Model Cross-Modal Transformer CLS w1 w2 wn f1 f2 fm CLS “the chair with 5 wheels”
  • 85.
    SNARE “Beige office chair” TargetConfounder Thomason et al. 2021
  • 86.
  • 87.
    Evaluation Caption Splits Visual “Beigechair” “Office chair”
  • 88.
    Evaluation Caption Splits Visual “Beigechair” “Office chair” Blindfolded “high back with a star-shaped base with wheels” “can rotate, small arms”
  • 89.
    Test Set Performance 76.6 OverallAccuracy 76.5 77.0 ViLBERT MATCH LAGOR
  • 90.
    Test Set Performance 76.6 OverallAccuracy 76.5 77.0 79.0 ViLBERT MATCH LAGOR Ours
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
    There’s a Timeand Place for Reasoning Beyond the Image Xingyu Fu, Ben Zhou*, Ishaan Preetam Chandratreya*, Carl Vondrick, Dan Roth 1
  • 98.
  • 99.
    Motivation What can weknow from a single image? Explicit Information ● People ● Text ● Masks ● Clothing ● Digital Screens ● … 3
  • 100.
    Related Work ● Existingdatasets focus on local reasoning within images 4
  • 101.
    Motivation What canwe know from a single image? Explicit Info ● Digital screens ● Text ● Masks … ⇒ Implicit Info ● Location ● Event ● Time ● …. 5
  • 102.
    Can models knowTime and Location from a single image? 6
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
    11 Contribution ■ New task:TARA (Time and plAce for Reasoning beyond the imAge) that identifies time and location information for images. ■ Dataset with 15,429 images and 61,325 additional weak supervision images. ■ Our baseline models, while slightly filled the 70% gap between sota model and human performance, highlight avenues for future research in vision- language joint open-ended reasoning with world knowledge.
  • 108.
    □ Task &Evaluation □ Data Collection & Analysis □ Baseline & Experiments 12 TARA (Time and plAce for Reasoning beyond the imAge)
  • 109.
    Task Definition • 2021.April • New York, United States, North America 13 Image Question What is the time/location for this image? Answer
  • 110.
    Evaluation Metrics ● HierarchicalLabels ○ “Philadelphia, Pennsylvania, United States, North America” ⇒ [“Philadelphia, Pennsylvania, United States, North America”, “United States, North America”, “North America”] ○ “1967-7-14” ⇒ [“1967-7-14”, “1967-7”, “1967”, “1960s”, “20th century”]. ● Evaluation Metrics ○ Accuracy ○ Example-F1 14
  • 111.
    □ Task &Evaluation □ Data Collection & Analysis □ Baseline & Experiments 15 TARA (Time and plAce for Reasoning beyond the imAge)
  • 112.
  • 113.
  • 114.
    • 15,429 images(94% kept after validation) • 61,325 weak supervision Wikipedia images Dataset Analysis 18
  • 115.
    • 15,429 images(94% kept after validation) • 61,325 weak supervision Wikipedia images Dataset Analysis 19
  • 116.
    □ Task &Evaluation □ Data Collection & Analysis □ Baseline & Experiments 20 TARA (Time and plAce for Reasoning beyond the imAge)
  • 117.
    21 Experiments: Human VSModel ● Test Set of Interest: ○ Eliminates annotator bias ○ Guarantees to be out of CLIP(Radford et al., 2021) pretraining data
  • 118.
  • 119.
    Experiments: Human VSSota Location CLIP(Radford et al., 2021): Sota CLIP+: Finetuned on TARA CLIP+WIT: Finetuned on TARA and WIT Models evaluated on all data and Human performance on Test set of Interest. CLIP+Seg: use object & face segments Thank you! Code: https://github.com/ zeyofu/TARA Time 23