SlideShare a Scribd company logo
1 of 24
LongT5: Efficient Text-To-Text Transformer for
Long Sequences
San Kim
2022.10.25
Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung,
Yinfei Yang
Google Research
Background
Either (1) increasing the input length or (2) increasing the model size can improve the performance of
Transformer-based neural models.
• A new Transformer architecture, LongT5, that allows
for scaling both input length and model scale at
the same time
• A new attention mechanism (TGlobal), which
mimics ETC’s local/global mechanism but is a drop-in
replacement to regular attention for existing
Transformer architectures like T5.
• SOTA results on the arXiv, PubMed, BigPatent,
MediaSum, and TriviaQA datasets.
An illustration of mechanisms to scale attention to long inputs
ETC (Extended Transformer Construction)
ETC (Extended Transformer Construction)
Objectives
1. Masked Language model (MLM)
2. Contrastive Predictive Coding (CPC)
CPC (Contrastive Predictive Coding)
Motivation and Intuition
The main intuition behind our model is to learn the representations that encode the underlying
shared information between different parts of the (high-dimensional) signal.
At the same time it discards low-level information and noise that is more local.
In time series and high-dimensional modeling, approaches that use next step prediction exploit the
local smoothness of the signal. When predicting further in the future, the amount of shared
information becomes much lower, and the model needs to infer more global structure.
These ‘slow features’ that span many time steps are often more interesting (e.g., phonemes and
intonation in speech, objects in images, or the story line in books.).
CPC (Contrastive Predictive Coding)
Motivation and Intuition
One of the challenges of predicting high-dimensional data is that unimodal losses such as mean-
squared error and cross-entropy are not very useful, and powerful conditional generative models
which need to reconstruct every detail in the data are usually required. But these models are
computationally intense, and waste capacity at modeling the complex relationships in the data 𝒙,
often ignoring the context 𝒄.
This suggests that modeling 𝑝 𝑥 𝑐 directly may not be optimal for the purpose of extracting shared
information between 𝑥 and 𝑐. When predicting future information we instead encode the target 𝑥
(future) and context 𝑐 (present) into a compact distributed vector representations (via non-linear
learned mappings) in a way that maximally preserves the mutual information of the original signals
𝒙 and 𝒄 defined as
𝐼 𝑥; 𝑐 =
𝑥,𝑐
𝑝 𝑥, 𝑐 log
𝑝 𝑥 𝑐
𝑝 𝑥
.
CPC (Contrastive Predictive Coding)
𝑓𝑘 𝑥𝑡+𝑘, 𝑐𝑡 ∝
𝑝 𝑥𝑡+𝑘 𝑐𝑡
𝑝 𝑥𝑡+𝑘
𝑓𝑘 𝑥𝑡+𝑘, 𝑐𝑡 = exp 𝑧𝑡+𝑘
𝑇
𝑾𝑘𝑐𝑡
with a different 𝑊𝑘 for every step 𝑘
CPC (Contrastive Predictive Coding)
ℒ𝑁 = −𝔼𝑋 log
𝑓𝑘 𝑥𝑡+𝑘, 𝑐𝑡
𝑥𝑗∈𝑋 𝑓𝑘 𝑥𝑗, 𝑐𝑡
• NCE(Noise-Contrastive Estimation) + Importance Sampling
• InfoNCE
𝑝 𝑑 = 𝑖 𝑋, 𝑐𝑡 =
𝑝 𝑥𝑖 𝑐𝑡 𝑙≠𝑖 𝑝 𝑥𝑙
𝑗=1
𝑁
𝑝 𝑥𝑗 𝑐𝑡 𝑙≠𝑗 𝑝 𝑥𝑙
=
𝑝 𝑥𝑖 𝑐𝑡
𝑝 𝑥𝑖
𝑗=1
𝑁 𝑝 𝑥𝑗 𝑐𝑡
𝑝 𝑥𝑗
𝐼 𝑥𝑡+𝑘, 𝑐𝑡 ≥ log 𝑁 − ℒ𝑁
ETC (Extended Transformer Construction)
Transient Global Attention (TGlobal)
Transient Global Attention (TGlobal)
• 𝑟 = 127 (sufficient in practice)
• Memory complexity is linear in input sequence length 𝑙:
𝑂 𝑙 × 𝑟 .
Local attention
Transient Global Attention (TGlobal)
• 𝑘 = 16 (sufficient in practice)
• Memory complexity is linear in input sequence length 𝑙:
𝑂 𝑙 ∗ 𝑙/𝑘 .
• 𝑂 𝑙 𝑟 +
𝑙
𝑘
Memory complexity
Principle Sentences Generation Pre-training (PEGASUS)
• Principle Ind-Uniq strategy
Principle Sentences Generation Pre-training (PEGASUS)
Pre-training and Fine-tuning
• Input sequence length: 4096
• Output sequence length: 910
Pre-training
• Input sequence length: 4096, 8192, 16384
• Output sequence length: 512
Fine-tuning
Summarization tasks
• Input sequence length: 512 ~ 36864
• Output sequence length: 128
QA tasks
Summarization
Summarization
Summarization
QA
• Natural Questions (focus only on short answer)
• TriviaQA
Datasets
QA
Input length vs Speed, Input length vs Performance
Principle Sentence Generation vs. Span Corruption
Long-ke-t5 (small)
Ko_corpora En_subset
Long-ke-t5 (small)
• Dim. Model: 512
• Dim. Feed forward: 1024
• Num. heads: 6
• Num. enc. layers: 8
• Num. dec. layers: 8
• Dim. Model: 768
• Dim. Feed forward: 2048
• Num. heads: 12
• Num. enc. layers: 12
• Num. dec. layers: 12
Base configuration
Small configuration

More Related Content

Similar to LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx

Attentive semantic alignment with offset aware correlation kernels
Attentive semantic alignment with offset aware correlation kernelsAttentive semantic alignment with offset aware correlation kernels
Attentive semantic alignment with offset aware correlation kernelsNAVER Engineering
 
Deep Learning for Biomedical Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time SeriesDeep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical Unstructured Time SeriesPetteriTeikariPhD
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning민재 정
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...IJCNCJournal
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...IJCNCJournal
 
Scheduling Using Multi Objective Genetic Algorithm
Scheduling Using Multi Objective Genetic AlgorithmScheduling Using Multi Objective Genetic Algorithm
Scheduling Using Multi Objective Genetic Algorithmiosrjce
 
abstrakty přijatých příspěvků.doc
abstrakty přijatých příspěvků.docabstrakty přijatých příspěvků.doc
abstrakty přijatých příspěvků.docbutest
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...kevig
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
 
Analysis of computational
Analysis of computationalAnalysis of computational
Analysis of computationalcsandit
 
Problems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor SystemProblems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor Systemijtsrd
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 

Similar to LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx (20)

Attentive semantic alignment with offset aware correlation kernels
Attentive semantic alignment with offset aware correlation kernelsAttentive semantic alignment with offset aware correlation kernels
Attentive semantic alignment with offset aware correlation kernels
 
Deep Learning for Biomedical Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time SeriesDeep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical Unstructured Time Series
 
cug2011-praveen
cug2011-praveencug2011-praveen
cug2011-praveen
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
LSTM Structured Pruning
LSTM Structured PruningLSTM Structured Pruning
LSTM Structured Pruning
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
 
M017327378
M017327378M017327378
M017327378
 
Scheduling Using Multi Objective Genetic Algorithm
Scheduling Using Multi Objective Genetic AlgorithmScheduling Using Multi Objective Genetic Algorithm
Scheduling Using Multi Objective Genetic Algorithm
 
abstrakty přijatých příspěvků.doc
abstrakty přijatých příspěvků.docabstrakty přijatých příspěvků.doc
abstrakty přijatých příspěvků.doc
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
Chatbot ppt
Chatbot pptChatbot ppt
Chatbot ppt
 
PPoPP15
PPoPP15PPoPP15
PPoPP15
 
Analysis of computational
Analysis of computationalAnalysis of computational
Analysis of computational
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Problems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor SystemProblems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor System
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 

More from San Kim

20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...San Kim
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptxSan Kim
 
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptxslide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptxSan Kim
 
Compeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxCompeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxSan Kim
 
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...San Kim
 
AI2 day.pptx
AI2 day.pptxAI2 day.pptx
AI2 day.pptxSan Kim
 
Temporal reasoning task
Temporal reasoning taskTemporal reasoning task
Temporal reasoning taskSan Kim
 
Answering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrievalAnswering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrievalSan Kim
 
Measuring massive multitask language understanding
Measuring massive multitask language understandingMeasuring massive multitask language understanding
Measuring massive multitask language understandingSan Kim
 
Abductive commonsense reasoning
Abductive commonsense reasoningAbductive commonsense reasoning
Abductive commonsense reasoningSan Kim
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa ReformerSan Kim
 
Transformer xl
Transformer xlTransformer xl
Transformer xlSan Kim
 
Face recognition v1
Face recognition v1Face recognition v1
Face recognition v1San Kim
 
Gan seminar
Gan seminarGan seminar
Gan seminarSan Kim
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3San Kim
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2San Kim
 
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1San Kim
 
Back propagation
Back propagationBack propagation
Back propagationSan Kim
 

More from San Kim (19)

20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx
 
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptxslide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
 
Compeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxCompeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptx
 
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
 
AI2 day.pptx
AI2 day.pptxAI2 day.pptx
AI2 day.pptx
 
Temporal reasoning task
Temporal reasoning taskTemporal reasoning task
Temporal reasoning task
 
Answering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrievalAnswering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrieval
 
Measuring massive multitask language understanding
Measuring massive multitask language understandingMeasuring massive multitask language understanding
Measuring massive multitask language understanding
 
Abductive commonsense reasoning
Abductive commonsense reasoningAbductive commonsense reasoning
Abductive commonsense reasoning
 
Electra
ElectraElectra
Electra
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
 
Transformer xl
Transformer xlTransformer xl
Transformer xl
 
Face recognition v1
Face recognition v1Face recognition v1
Face recognition v1
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2
 
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1
 
Back propagation
Back propagationBack propagation
Back propagation
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx

  • 1. LongT5: Efficient Text-To-Text Transformer for Long Sequences San Kim 2022.10.25 Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang Google Research
  • 2. Background Either (1) increasing the input length or (2) increasing the model size can improve the performance of Transformer-based neural models. • A new Transformer architecture, LongT5, that allows for scaling both input length and model scale at the same time • A new attention mechanism (TGlobal), which mimics ETC’s local/global mechanism but is a drop-in replacement to regular attention for existing Transformer architectures like T5. • SOTA results on the arXiv, PubMed, BigPatent, MediaSum, and TriviaQA datasets.
  • 3. An illustration of mechanisms to scale attention to long inputs
  • 5. ETC (Extended Transformer Construction) Objectives 1. Masked Language model (MLM) 2. Contrastive Predictive Coding (CPC)
  • 6. CPC (Contrastive Predictive Coding) Motivation and Intuition The main intuition behind our model is to learn the representations that encode the underlying shared information between different parts of the (high-dimensional) signal. At the same time it discards low-level information and noise that is more local. In time series and high-dimensional modeling, approaches that use next step prediction exploit the local smoothness of the signal. When predicting further in the future, the amount of shared information becomes much lower, and the model needs to infer more global structure. These ‘slow features’ that span many time steps are often more interesting (e.g., phonemes and intonation in speech, objects in images, or the story line in books.).
  • 7. CPC (Contrastive Predictive Coding) Motivation and Intuition One of the challenges of predicting high-dimensional data is that unimodal losses such as mean- squared error and cross-entropy are not very useful, and powerful conditional generative models which need to reconstruct every detail in the data are usually required. But these models are computationally intense, and waste capacity at modeling the complex relationships in the data 𝒙, often ignoring the context 𝒄. This suggests that modeling 𝑝 𝑥 𝑐 directly may not be optimal for the purpose of extracting shared information between 𝑥 and 𝑐. When predicting future information we instead encode the target 𝑥 (future) and context 𝑐 (present) into a compact distributed vector representations (via non-linear learned mappings) in a way that maximally preserves the mutual information of the original signals 𝒙 and 𝒄 defined as 𝐼 𝑥; 𝑐 = 𝑥,𝑐 𝑝 𝑥, 𝑐 log 𝑝 𝑥 𝑐 𝑝 𝑥 .
  • 8. CPC (Contrastive Predictive Coding) 𝑓𝑘 𝑥𝑡+𝑘, 𝑐𝑡 ∝ 𝑝 𝑥𝑡+𝑘 𝑐𝑡 𝑝 𝑥𝑡+𝑘 𝑓𝑘 𝑥𝑡+𝑘, 𝑐𝑡 = exp 𝑧𝑡+𝑘 𝑇 𝑾𝑘𝑐𝑡 with a different 𝑊𝑘 for every step 𝑘
  • 9. CPC (Contrastive Predictive Coding) ℒ𝑁 = −𝔼𝑋 log 𝑓𝑘 𝑥𝑡+𝑘, 𝑐𝑡 𝑥𝑗∈𝑋 𝑓𝑘 𝑥𝑗, 𝑐𝑡 • NCE(Noise-Contrastive Estimation) + Importance Sampling • InfoNCE 𝑝 𝑑 = 𝑖 𝑋, 𝑐𝑡 = 𝑝 𝑥𝑖 𝑐𝑡 𝑙≠𝑖 𝑝 𝑥𝑙 𝑗=1 𝑁 𝑝 𝑥𝑗 𝑐𝑡 𝑙≠𝑗 𝑝 𝑥𝑙 = 𝑝 𝑥𝑖 𝑐𝑡 𝑝 𝑥𝑖 𝑗=1 𝑁 𝑝 𝑥𝑗 𝑐𝑡 𝑝 𝑥𝑗 𝐼 𝑥𝑡+𝑘, 𝑐𝑡 ≥ log 𝑁 − ℒ𝑁
  • 10. ETC (Extended Transformer Construction)
  • 12. Transient Global Attention (TGlobal) • 𝑟 = 127 (sufficient in practice) • Memory complexity is linear in input sequence length 𝑙: 𝑂 𝑙 × 𝑟 . Local attention Transient Global Attention (TGlobal) • 𝑘 = 16 (sufficient in practice) • Memory complexity is linear in input sequence length 𝑙: 𝑂 𝑙 ∗ 𝑙/𝑘 . • 𝑂 𝑙 𝑟 + 𝑙 𝑘 Memory complexity
  • 13. Principle Sentences Generation Pre-training (PEGASUS) • Principle Ind-Uniq strategy
  • 14. Principle Sentences Generation Pre-training (PEGASUS)
  • 15. Pre-training and Fine-tuning • Input sequence length: 4096 • Output sequence length: 910 Pre-training • Input sequence length: 4096, 8192, 16384 • Output sequence length: 512 Fine-tuning Summarization tasks • Input sequence length: 512 ~ 36864 • Output sequence length: 128 QA tasks
  • 19. QA • Natural Questions (focus only on short answer) • TriviaQA Datasets
  • 20. QA
  • 21. Input length vs Speed, Input length vs Performance
  • 22. Principle Sentence Generation vs. Span Corruption
  • 24. Long-ke-t5 (small) • Dim. Model: 512 • Dim. Feed forward: 1024 • Num. heads: 6 • Num. enc. layers: 8 • Num. dec. layers: 8 • Dim. Model: 768 • Dim. Feed forward: 2048 • Num. heads: 12 • Num. enc. layers: 12 • Num. dec. layers: 12 Base configuration Small configuration