SlideShare a Scribd company logo
1 of 20
September 2020
Sujit Pal, Elsevier Labs
Transformer Mods for
Document Length Inputs
A survey of techniques to make long input
sequences practical to use with Transformers
About Me
• Work at Elsevier Labs
• Ex-search guy, Lucene and Solr mainly
• Started with NLP and ML as search started
using these techniques, got interested.
• Mostly focus on NLP problems nowadays.
2
Agenda
• Transformers
• Self-Attention and its limitations
• Approaches to address self-attention limitations
• Code walkthrough with LongFormer
3
Seq2seq, Attention, and Transformer
4
Attention amplifies signal for specific terms
Transformer and Self-Attention
• Embeddings for terms in sequence are input
into encoder and decoder in parallel.
• Input paths mingle in self-attention layer.
• Again parallelized when input to FFN layer.
• Each term vector split into Q, K, and V using
trainable weights WQ, WK, WV.
5
Self-Attention in depth
6
Self Attention is O(n2)
7
Self-Attention is sparse
• Self attention is O(n2) regardless of
whether seq2seq or transformer.
• Precludes use with large n (long input
sequences)
• Even though we no longer have issue
with sequential processing in RNN.
• But… self-attention matrix is sparse.
8
Sparse Transformers
• Autoregressive (left to right)
• Two-dimensional factorization of attention matrix
− Strided (center) – each position attends to its row and column
− Fixed (right) – each position attends to fixed column and elements after latest column
element
• Algorithmic complexity O(n√𝑛)
9
Transformer-XL
• Autoregressive, segments input into fixed size blocks
• Segment level recurrence with state reuse
• Analogous to BPTT (Back Propagation over Time), caches and applies
sequence of hidden states from previous segments.
• Better perplexity scores up to 900 tokens.
10
Reformer
• Uses Locally Sensitive Hashing
(LSH) to convert sparse attention
matrix to set of dense matrices
− Hashing input tokens
− Sorting and chunking
• Reversible Residuals
• Algorithmic complexity O(n log n)
• Can handle 64k token inputs
11
Routing Transformer
• Adds a sparse routing module based on online K-Means to self-attention
• Clusters K and Q matrices into clusters using K-Means
• Each attention step considers only context in same cluster
• Some notion of non-local or global context
• Algorithmic complexity O( 𝑛3 * k) where k = number of clusters
12
Sinkhorn Transformer
• Meta-sorting network that learns to
rearrange and sort input sequence
• Sequences blocked and attention
computed within each block.
• Memory complexity O(B2 + NB
2), where
B is block size, and NB is number of
blocks, NB << n.
• SortCut variant only looks at top k
nearest neighbors in block, reduces
complexity to O(n*k).
13
Linformer
14
• Observation: self-attention can be
approximated by low rank
matrix
• SVD component vectors and use
top k principal components
• Introduces new self-attention
pipeline, uses linear projection to
do dimensionality reduction.
Attention with Linear Complexity
• Uses matrix multiplication associativity property to convert:
• Algorithmic complexity changes from O(n2 * d) to O(n * d2) where d is the
size of the K, Q, and V vectors and n is input length, d << n.
15
Longformer
• Sparsify full attention matrix according to ”attention pattern”
• Sliding window of size k (k/2 left, k/2 right) to capture local context
• Optional dilated sliding window at different layers of multi-head attention to get bigger
receptive field
• Additional (task dependent) global attention – [CLS] for classification, question tokens for
QA
• Scales linearly with input length N and sliding window length k (O(n * k)), global context
effect minimal, k << n
16
Big Bird
• Consider self-attention as a DAG and apply graph sparsification principles
• Composed of
• Set of global tokens g that attend to all parts of sequence (ITC – subset of input
tokens, ETC – additional tokens such as [CLS]
• For each Q, set of r random keys it will attend to
• Local neighborhood of size w for each input token
• Complexity is O(n*w) because effect of g and r are negligible, w << n
17
References
• Attention is all you need (Vaswani, et al, 2017)
• Visualizing a Neural Machine Translation Model (Mechanics of Seq2Seq models with Attention) (Alammar,
2018)
• The Illustrated Transformer (Alammar, 2018)
• Generating Long Sequences with Sparse Transformers (Child, Gray, Radford, and Sutskever, 2019)
• Transformer-XL: Attentive Language Models beyond a fixed length content (Dai, et al, 2019)
• Reformer: The Efficient Transformer (Kitaev, Kaiser, and Levskaya, 2020)
• Efficient Content-based Sparse Attention with Routing Transformers (Roy, Saffar, Vaswani, and Grangier, 2020)
• Sparse Sinkhorn Attention (Tay, et al, 2020)
• Linformer: Self-Attention with Linear Complexity (Li, Khabsa, Fang, and Ma, 2020)
• Efficient Attention: Attention with Linear Complexities (Shen, et al, 2020)
• LongFormer: The Long Document Transformer (Beltagy, Peters, and Cohan, 2020)
• BigBird: Transformers for Longer Sequences (Zaheer, et al, 2020)
• A Survey of Long-Term Context in Transformers (May, 2020)
18
Notebooks
• lf1_longformer_pretrained.ipynb -- using Pre-trained and Fine-tuned
Longformer model for Document embedding and Question Answering
respectively.
• lf2_longformer_sentiment_training.ipynb – training a Longformer model for
sentiment classification.
19
The huggingface/transformers project provides (Pytorch and
TF) implementations for transformers discussed here.
• Transformer/XL
• Reformer
• Longformer
Thank you

More Related Content

What's hot

Designing a machine learning algorithm for Apache Spark
Designing a machine learning algorithm for Apache SparkDesigning a machine learning algorithm for Apache Spark
Designing a machine learning algorithm for Apache SparkMarco Gaido
 
Design and Implementation of the Security Graph Language
Design and Implementation of the Security Graph LanguageDesign and Implementation of the Security Graph Language
Design and Implementation of the Security Graph LanguageAsankhaya Sharma
 
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time CompilerA Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time Compilerkeanumit
 
Tutorial ns 3-tutorial-slides
Tutorial ns 3-tutorial-slidesTutorial ns 3-tutorial-slides
Tutorial ns 3-tutorial-slidesVinayagam D
 
Building Topology in NS3
Building Topology in NS3Building Topology in NS3
Building Topology in NS3Rahul Hada
 
Understanding Large Social Networks | IRE Major Project | Team 57
Understanding Large Social Networks | IRE Major Project | Team 57 Understanding Large Social Networks | IRE Major Project | Team 57
Understanding Large Social Networks | IRE Major Project | Team 57 Raj Patel
 
Applications of the Reverse Engineering Language REIL
Applications of the Reverse Engineering Language REILApplications of the Reverse Engineering Language REIL
Applications of the Reverse Engineering Language REILzynamics GmbH
 
Optimized Floating-point Complex number multiplier on FPGA
Optimized Floating-point Complex number multiplier on FPGAOptimized Floating-point Complex number multiplier on FPGA
Optimized Floating-point Complex number multiplier on FPGADr. Pushpa Kotipalli
 
Design and implementation of low power
Design and implementation of low powerDesign and implementation of low power
Design and implementation of low powerSurendra Bommavarapu
 
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Declarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierDeclarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierCrai Macdonald
 
QVT Traceability: What does it really mean?
QVT Traceability: What does it really mean?QVT Traceability: What does it really mean?
QVT Traceability: What does it really mean?Edward Willink
 
Algorithm(BFS, PRIM, DIJKSTRA, LCS)
Algorithm(BFS, PRIM, DIJKSTRA, LCS)Algorithm(BFS, PRIM, DIJKSTRA, LCS)
Algorithm(BFS, PRIM, DIJKSTRA, LCS)TanvirAhammed22
 
ETA Prediction with Graph Neural Networks in Google Maps
ETA Prediction with Graph Neural Networks in Google MapsETA Prediction with Graph Neural Networks in Google Maps
ETA Prediction with Graph Neural Networks in Google Mapsivaderivader
 
Service Mesh with Envoy and Istio
Service Mesh with Envoy and IstioService Mesh with Envoy and Istio
Service Mesh with Envoy and IstioArvind Thangamani
 
Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...
Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...
Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...Edward Willink
 
Evolution of JDK Tools for Multithreaded Programming
Evolution of JDK Tools for Multithreaded ProgrammingEvolution of JDK Tools for Multithreaded Programming
Evolution of JDK Tools for Multithreaded ProgrammingGlobalLogic Ukraine
 

What's hot (20)

Designing a machine learning algorithm for Apache Spark
Designing a machine learning algorithm for Apache SparkDesigning a machine learning algorithm for Apache Spark
Designing a machine learning algorithm for Apache Spark
 
Design and Implementation of the Security Graph Language
Design and Implementation of the Security Graph LanguageDesign and Implementation of the Security Graph Language
Design and Implementation of the Security Graph Language
 
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time CompilerA Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
 
Tutorial ns 3-tutorial-slides
Tutorial ns 3-tutorial-slidesTutorial ns 3-tutorial-slides
Tutorial ns 3-tutorial-slides
 
Building Topology in NS3
Building Topology in NS3Building Topology in NS3
Building Topology in NS3
 
Understanding Large Social Networks | IRE Major Project | Team 57
Understanding Large Social Networks | IRE Major Project | Team 57 Understanding Large Social Networks | IRE Major Project | Team 57
Understanding Large Social Networks | IRE Major Project | Team 57
 
Applications of the Reverse Engineering Language REIL
Applications of the Reverse Engineering Language REILApplications of the Reverse Engineering Language REIL
Applications of the Reverse Engineering Language REIL
 
Optimized Floating-point Complex number multiplier on FPGA
Optimized Floating-point Complex number multiplier on FPGAOptimized Floating-point Complex number multiplier on FPGA
Optimized Floating-point Complex number multiplier on FPGA
 
8 Bit A L U
8 Bit  A L U8 Bit  A L U
8 Bit A L U
 
Design and implementation of low power
Design and implementation of low powerDesign and implementation of low power
Design and implementation of low power
 
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
 
Declarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierDeclarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrier
 
QVT Traceability: What does it really mean?
QVT Traceability: What does it really mean?QVT Traceability: What does it really mean?
QVT Traceability: What does it really mean?
 
Algorithm(BFS, PRIM, DIJKSTRA, LCS)
Algorithm(BFS, PRIM, DIJKSTRA, LCS)Algorithm(BFS, PRIM, DIJKSTRA, LCS)
Algorithm(BFS, PRIM, DIJKSTRA, LCS)
 
ETA Prediction with Graph Neural Networks in Google Maps
ETA Prediction with Graph Neural Networks in Google MapsETA Prediction with Graph Neural Networks in Google Maps
ETA Prediction with Graph Neural Networks in Google Maps
 
Service Mesh with Envoy and Istio
Service Mesh with Envoy and IstioService Mesh with Envoy and Istio
Service Mesh with Envoy and Istio
 
Lecture verilog ii_c
Lecture verilog ii_cLecture verilog ii_c
Lecture verilog ii_c
 
NS3 Overview
NS3 OverviewNS3 Overview
NS3 Overview
 
Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...
Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...
Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...
 
Evolution of JDK Tools for Multithreaded Programming
Evolution of JDK Tools for Multithreaded ProgrammingEvolution of JDK Tools for Multithreaded Programming
Evolution of JDK Tools for Multithreaded Programming
 

Similar to Transformer Mods for Document Length Inputs

Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model佳蓉 倪
 
Deformable DETR Review [CDM]
Deformable DETR Review [CDM]Deformable DETR Review [CDM]
Deformable DETR Review [CDM]Dongmin Choi
 
Week9_Seq2seq.pptx
Week9_Seq2seq.pptxWeek9_Seq2seq.pptx
Week9_Seq2seq.pptxKhngNguyn81
 
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptxNibrasulIslam
 
A Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptxA Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptxssuser2624f71
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptxthanhdowork
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Universitat Politècnica de Catalunya
 
Basic electronics
Basic electronicsBasic electronics
Basic electronicspavi1234
 
Performance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelPerformance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelKoichi Shirahata
 
Direct digital frequency synthesizer
Direct digital frequency synthesizerDirect digital frequency synthesizer
Direct digital frequency synthesizerVenkat Malai Avichi
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)H K Yoon
 
Natural Language to Visualization by Neural Machine Translation
Natural Language to Visualization by Neural Machine TranslationNatural Language to Visualization by Neural Machine Translation
Natural Language to Visualization by Neural Machine Translationivaderivader
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
Queuing theory and traffic analysis in depth
Queuing theory and traffic analysis in depthQueuing theory and traffic analysis in depth
Queuing theory and traffic analysis in depthIdcIdk1
 
Term paper presentation
Term paper presentationTerm paper presentation
Term paper presentationmariam mehreen
 

Similar to Transformer Mods for Document Length Inputs (20)

Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
Deformable DETR Review [CDM]
Deformable DETR Review [CDM]Deformable DETR Review [CDM]
Deformable DETR Review [CDM]
 
Week9_Seq2seq.pptx
Week9_Seq2seq.pptxWeek9_Seq2seq.pptx
Week9_Seq2seq.pptx
 
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
 
DaViT.pdf
DaViT.pdfDaViT.pdf
DaViT.pdf
 
A Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptxA Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptx
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Use CNN for Sequence Modeling
Use CNN for Sequence ModelingUse CNN for Sequence Modeling
Use CNN for Sequence Modeling
 
Switching units
Switching unitsSwitching units
Switching units
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
Basic electronics
Basic electronicsBasic electronics
Basic electronics
 
Performance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelPerformance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming Model
 
Direct digital frequency synthesizer
Direct digital frequency synthesizerDirect digital frequency synthesizer
Direct digital frequency synthesizer
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
Natural Language to Visualization by Neural Machine Translation
Natural Language to Visualization by Neural Machine TranslationNatural Language to Visualization by Neural Machine Translation
Natural Language to Visualization by Neural Machine Translation
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
Queuing theory and traffic analysis in depth
Queuing theory and traffic analysis in depthQueuing theory and traffic analysis in depth
Queuing theory and traffic analysis in depth
 
Term paper presentation
Term paper presentationTerm paper presentation
Term paper presentation
 

More from Sujit Pal

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSujit Pal
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Sujit Pal
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question AnsweringSujit Pal
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and TestSujit Pal
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Sujit Pal
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringSujit Pal
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop VisualizationSujit Pal
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudSujit Pal
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Sujit Pal
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubSujit Pal
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalSujit Pal
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesSujit Pal
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSSujit Pal
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingSujit Pal
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildSujit Pal
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSujit Pal
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSujit Pal
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchSujit Pal
 

More from Sujit Pal (20)

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestring
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal Club
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
 

Recently uploaded

Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 

Recently uploaded (20)

Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 

Transformer Mods for Document Length Inputs

  • 1. September 2020 Sujit Pal, Elsevier Labs Transformer Mods for Document Length Inputs A survey of techniques to make long input sequences practical to use with Transformers
  • 2. About Me • Work at Elsevier Labs • Ex-search guy, Lucene and Solr mainly • Started with NLP and ML as search started using these techniques, got interested. • Mostly focus on NLP problems nowadays. 2
  • 3. Agenda • Transformers • Self-Attention and its limitations • Approaches to address self-attention limitations • Code walkthrough with LongFormer 3
  • 4. Seq2seq, Attention, and Transformer 4 Attention amplifies signal for specific terms
  • 5. Transformer and Self-Attention • Embeddings for terms in sequence are input into encoder and decoder in parallel. • Input paths mingle in self-attention layer. • Again parallelized when input to FFN layer. • Each term vector split into Q, K, and V using trainable weights WQ, WK, WV. 5
  • 8. Self-Attention is sparse • Self attention is O(n2) regardless of whether seq2seq or transformer. • Precludes use with large n (long input sequences) • Even though we no longer have issue with sequential processing in RNN. • But… self-attention matrix is sparse. 8
  • 9. Sparse Transformers • Autoregressive (left to right) • Two-dimensional factorization of attention matrix − Strided (center) – each position attends to its row and column − Fixed (right) – each position attends to fixed column and elements after latest column element • Algorithmic complexity O(n√𝑛) 9
  • 10. Transformer-XL • Autoregressive, segments input into fixed size blocks • Segment level recurrence with state reuse • Analogous to BPTT (Back Propagation over Time), caches and applies sequence of hidden states from previous segments. • Better perplexity scores up to 900 tokens. 10
  • 11. Reformer • Uses Locally Sensitive Hashing (LSH) to convert sparse attention matrix to set of dense matrices − Hashing input tokens − Sorting and chunking • Reversible Residuals • Algorithmic complexity O(n log n) • Can handle 64k token inputs 11
  • 12. Routing Transformer • Adds a sparse routing module based on online K-Means to self-attention • Clusters K and Q matrices into clusters using K-Means • Each attention step considers only context in same cluster • Some notion of non-local or global context • Algorithmic complexity O( 𝑛3 * k) where k = number of clusters 12
  • 13. Sinkhorn Transformer • Meta-sorting network that learns to rearrange and sort input sequence • Sequences blocked and attention computed within each block. • Memory complexity O(B2 + NB 2), where B is block size, and NB is number of blocks, NB << n. • SortCut variant only looks at top k nearest neighbors in block, reduces complexity to O(n*k). 13
  • 14. Linformer 14 • Observation: self-attention can be approximated by low rank matrix • SVD component vectors and use top k principal components • Introduces new self-attention pipeline, uses linear projection to do dimensionality reduction.
  • 15. Attention with Linear Complexity • Uses matrix multiplication associativity property to convert: • Algorithmic complexity changes from O(n2 * d) to O(n * d2) where d is the size of the K, Q, and V vectors and n is input length, d << n. 15
  • 16. Longformer • Sparsify full attention matrix according to ”attention pattern” • Sliding window of size k (k/2 left, k/2 right) to capture local context • Optional dilated sliding window at different layers of multi-head attention to get bigger receptive field • Additional (task dependent) global attention – [CLS] for classification, question tokens for QA • Scales linearly with input length N and sliding window length k (O(n * k)), global context effect minimal, k << n 16
  • 17. Big Bird • Consider self-attention as a DAG and apply graph sparsification principles • Composed of • Set of global tokens g that attend to all parts of sequence (ITC – subset of input tokens, ETC – additional tokens such as [CLS] • For each Q, set of r random keys it will attend to • Local neighborhood of size w for each input token • Complexity is O(n*w) because effect of g and r are negligible, w << n 17
  • 18. References • Attention is all you need (Vaswani, et al, 2017) • Visualizing a Neural Machine Translation Model (Mechanics of Seq2Seq models with Attention) (Alammar, 2018) • The Illustrated Transformer (Alammar, 2018) • Generating Long Sequences with Sparse Transformers (Child, Gray, Radford, and Sutskever, 2019) • Transformer-XL: Attentive Language Models beyond a fixed length content (Dai, et al, 2019) • Reformer: The Efficient Transformer (Kitaev, Kaiser, and Levskaya, 2020) • Efficient Content-based Sparse Attention with Routing Transformers (Roy, Saffar, Vaswani, and Grangier, 2020) • Sparse Sinkhorn Attention (Tay, et al, 2020) • Linformer: Self-Attention with Linear Complexity (Li, Khabsa, Fang, and Ma, 2020) • Efficient Attention: Attention with Linear Complexities (Shen, et al, 2020) • LongFormer: The Long Document Transformer (Beltagy, Peters, and Cohan, 2020) • BigBird: Transformers for Longer Sequences (Zaheer, et al, 2020) • A Survey of Long-Term Context in Transformers (May, 2020) 18
  • 19. Notebooks • lf1_longformer_pretrained.ipynb -- using Pre-trained and Fine-tuned Longformer model for Document embedding and Question Answering respectively. • lf2_longformer_sentiment_training.ipynb – training a Longformer model for sentiment classification. 19 The huggingface/transformers project provides (Pytorch and TF) implementations for transformers discussed here. • Transformer/XL • Reformer • Longformer