LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx

LongT5: Efficient Text-To-Text Transformer for
Long Sequences
San Kim
2022.10.25
Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung,
Yinfei Yang
Google Research

Background
Either (1) increasing the input length or (2) increasing the model size can improve the performance of
Transformer-based neural models.
• A new Transformer architecture, LongT5, that allows
for scaling both input length and model scale at
the same time
• A new attention mechanism (TGlobal), which
mimics ETC’s local/global mechanism but is a drop-in
replacement to regular attention for existing
Transformer architectures like T5.
• SOTA results on the arXiv, PubMed, BigPatent,
MediaSum, and TriviaQA datasets.

An illustration of mechanisms to scale attention to long inputs

ETC (Extended Transformer Construction)

ETC (Extended Transformer Construction)
Objectives
1. Masked Language model (MLM)
2. Contrastive Predictive Coding (CPC)

CPC (Contrastive Predictive Coding)
Motivation and Intuition
The main intuition behind our model is to learn the representations that encode the underlying
shared information between different parts of the (high-dimensional) signal.
At the same time it discards low-level information and noise that is more local.
In time series and high-dimensional modeling, approaches that use next step prediction exploit the
local smoothness of the signal. When predicting further in the future, the amount of shared
information becomes much lower, and the model needs to infer more global structure.
These ‘slow features’ that span many time steps are often more interesting (e.g., phonemes and
intonation in speech, objects in images, or the story line in books.).

Motivation and Intuition
One of the challenges of predicting high-dimensional data is that unimodal losses such as mean-
squared error and cross-entropy are not very useful, and powerful conditional generative models
which need to reconstruct every detail in the data are usually required. But these models are
computationally intense, and waste capacity at modeling the complex relationships in the data 𝒙,
often ignoring the context 𝒄.
This suggests that modeling 𝑝 𝑥 𝑐 directly may not be optimal for the purpose of extracting shared
information between 𝑥 and 𝑐. When predicting future information we instead encode the target 𝑥
(future) and context 𝑐 (present) into a compact distributed vector representations (via non-linear
learned mappings) in a way that maximally preserves the mutual information of the original signals
𝒙 and 𝒄 defined as
𝐼 𝑥; 𝑐 =
𝑥,𝑐
𝑝 𝑥, 𝑐 log
𝑝 𝑥 𝑐
𝑝 𝑥
.

𝑓𝑘 𝑥𝑡+𝑘, 𝑐𝑡 ∝
𝑝 𝑥𝑡+𝑘 𝑐𝑡
𝑝 𝑥𝑡+𝑘
𝑓𝑘 𝑥𝑡+𝑘, 𝑐𝑡 = exp 𝑧𝑡+𝑘
𝑇
𝑾𝑘𝑐𝑡
with a different 𝑊𝑘 for every step 𝑘

ℒ𝑁 = −𝔼𝑋 log
𝑓𝑘 𝑥𝑡+𝑘, 𝑐𝑡
𝑥𝑗∈𝑋 𝑓𝑘 𝑥𝑗, 𝑐𝑡
• NCE(Noise-Contrastive Estimation) + Importance Sampling
• InfoNCE
𝑝 𝑑 = 𝑖 𝑋, 𝑐𝑡 =
𝑝 𝑥𝑖 𝑐𝑡 𝑙≠𝑖 𝑝 𝑥𝑙
𝑗=1
𝑁
𝑝 𝑥𝑗 𝑐𝑡 𝑙≠𝑗 𝑝 𝑥𝑙
=
𝑝 𝑥𝑖 𝑐𝑡
𝑝 𝑥𝑖
𝑗=1
𝑁 𝑝 𝑥𝑗 𝑐𝑡
𝑝 𝑥𝑗
𝐼 𝑥𝑡+𝑘, 𝑐𝑡 ≥ log 𝑁 − ℒ𝑁

Transient Global Attention (TGlobal)

• 𝑟 = 127 (sufficient in practice)
• Memory complexity is linear in input sequence length 𝑙:
𝑂 𝑙 × 𝑟 .
Local attention
• 𝑘 = 16 (sufficient in practice)
• Memory complexity is linear in input sequence length 𝑙:
𝑂 𝑙 ∗ 𝑙/𝑘 .
• 𝑂 𝑙 𝑟 +
𝑙
𝑘
Memory complexity

Principle Sentences Generation Pre-training (PEGASUS)
• Principle Ind-Uniq strategy

Principle Sentences Generation Pre-training (PEGASUS)

Pre-training and Fine-tuning
• Input sequence length: 4096
• Output sequence length: 910
Pre-training
• Input sequence length: 4096, 8192, 16384
Fine-tuning
Summarization tasks
• Input sequence length: 512 ~ 36864
QA tasks

QA
• Natural Questions (focus only on short answer)
• TriviaQA
Datasets

Input length vs Speed, Input length vs Performance

Principle Sentence Generation vs. Span Corruption

Long-ke-t5 (small)
Ko_corpora En_subset

Long-ke-t5 (small)
• Dim. Model: 512
• Dim. Feed forward: 1024
• Num. heads: 6
• Num. enc. layers: 8
• Num. dec. layers: 8
• Dim. Model: 768
• Dim. Feed forward: 2048
• Num. heads: 12
• Num. enc. layers: 12
• Num. dec. layers: 12
Base configuration
Small configuration

LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx

Recommended

Recommended

More Related Content

Similar to LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx

Similar to LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx (20)

More from San Kim

More from San Kim (19)

Recently uploaded

Recently uploaded (20)

LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx