SlideShare a Scribd company logo
1 of 14
XLNET: GENERALIZED
AUTOREGRESSIVE PRETRAINING
FOR LANGUAGE
UNDERSTANDING
ZHILIN YANG , ZIHANG DAI, YIMING YANG , JAIME CARBONELL
, RUSLAN SALAKHUTDINOV , QUOC V. LE
AR and AE
Autoregressive (AR) language
modeling:
An autoregressive model’s output ht at
time t depends on not just xt, but also all
xs from previous time steps.
given a text sequence x = (x1, · · · ,
xT ), AR language modeling factorizes
the likelihood into a forward
product. p(x) = p(xt | x<t)
Examples:
GPT , ELMO(The simple
combination of 2 AR models)
AR and AE
Autoencoding(AE) Language Modeling:
The AE language model aims to
reconstruct the original data from
corrupted input.
Corrupted input: The corrupted input
means we use [MASK] to replace the
original token
Example:
BERT
Pros and Cons of AR and AE
AR:
Advantages: good at
generative NLP tasks.
(When generating
context,It’s usually
forward)
Disadvantages: it only
concerns one
direction(forward or
backward.)
AE:
Advantages: bi-directional.
Downstream language understanding
tasks often require bidirectional
context information.
Disadvantages: 1. the artificial
symbols like [MASK] used by BERT
during pretraining are absent from real
data at finetuning time, resulting in a
pretrain-finetune discrepancy.
2. It assumes the predicted
(masked) tokens are independent of
each other given the unmasked tokens.
Example: I am [mask] in the [mask] of
Waterloo.
Can we combine the 2 methods so that we
can take their pros and avoid their cons?
Yes!
XLNet, a generalized autoregressive method
that leverages the best of both AR language
modeling and AE while avoiding their
limitations.
XLNet
Recall: In AR methods, we maximize the likelyhood in a fixed forward
or backward factorization order
Idea:
XLNet maximizes the expected log likelihood of a sequence w.r.t.
all possible permutations of the factorization order.
Permutations:
For example, we have a sentence with 4 tokens [x1 x2 x3 x4], and
we want to predict x3.
Then we will have 4!(24) permutations.
[x1 x2 x3 x4],[x1 x3 x2 x4],[x1 x4 x3 x4],[x2 x1 x3 x4 ]…
Every token can appear before x3, then if we apply the forward
maximum likelihood function, it concerns all tokens in this sentence.
XLNet
Maximize the expectation for
a factor order in all
permutations
Also, , XLNet does not rely on data
corruption.(which means no masks in XLnet)
FYI: For Bert, since Bert introduces
masks, mt indicates if it’s a mask
Problems:
contradictory in a standard Transformer architecture:
1.To predict the token x_t, the model should only see
the position of x_t, not the content of x_t
2.To predict the token x_t, the model should encode
all tokens before x_t as the content
(In transformers, word embedding and position
information are combined)
Solution: 2-stream self-attention
The model will only encounter text sequences with the
natural order during finetuning. Which means we can not
chenge the sentences, we have to implement permutation
in encoders.
Solution: Attention Mask
Two-Stream Self-
Attention
• Content stream attention: the standard self-attention in
transformers.
• query stream attention.: it’s used for predicting x_t
Original sequence order:
[x1,x2,x3,x4]
Sample a random
factorization order:
[x3,x2,x4,x1]
Calculate content stream
attention:
KV = [h1, h2, h3, h4] Q=h1
Calculate query stream
attention:
KV = [h2, h3, h4] Q=g1
The initial value for h_i =
e(x_i) and
g_i = w
Recal
l:
Where g is from the last layer
of query representation
In this graph, the other parts of
the encoder are omitted. Actually,
they used the same structure as
Transformer-XL
Partial prediction
It’s expensive since N! is really large
and it will cause slow convergence
Formally, we split z into a non-target subsequence z≤c
and a target subsequence z>c, where c is the cutting
point. And we only predict the tokens after c
Generally, only 1/K tokens will be selected for
predictions. (k=6 or 7 is recommended)
Some methods in Transformer-XL are
incorporated.
Such as relative positional encoding scheme
and the segment recurrence
mechanism.(They are helpful for long
context)
Results:
XLNet outperforms BERT.
ROBERTa was released after XLNet.
And It’s hard to tell which one is
better, XLNet may be better in
reading comprehension
tasks(especially for longer
context).
Reference:
https://arxiv.org/pdf/1906.08237.pdf
https://eigenfoo.xyz/deep-autoregressive-models/
https://towardsdatascience.com/what-is-xlnet-and-why-it-outperforms-bert-
8d8fce710335

More Related Content

Similar to Ruifeng.pptx

Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Conditional Random Fields
Conditional Random FieldsConditional Random Fields
Conditional Random Fieldslswing
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowAltoros
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisSilvio Cesare
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Universitat Politècnica de Catalunya
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacesbutest
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacesbutest
 
The Theory of Finite Automata.pptx
The Theory of Finite Automata.pptxThe Theory of Finite Automata.pptx
The Theory of Finite Automata.pptxssuser039bf6
 
Dsm as theory building
Dsm as theory buildingDsm as theory building
Dsm as theory buildingClarkTony
 
Digital communication lab lectures
Digital communication lab  lecturesDigital communication lab  lectures
Digital communication lab lecturesmarwaeng
 
Adversarial_Examples_in_Audio_and_Text.pptx
Adversarial_Examples_in_Audio_and_Text.pptxAdversarial_Examples_in_Audio_and_Text.pptx
Adversarial_Examples_in_Audio_and_Text.pptxujjawalchaurasia1
 
Introduction to Tree-LSTMs
Introduction to Tree-LSTMsIntroduction to Tree-LSTMs
Introduction to Tree-LSTMsDaniel Perez
 
The Concurrent Constraint Programming Research Programmes -- Redux (part2)
The Concurrent Constraint Programming Research Programmes -- Redux (part2)The Concurrent Constraint Programming Research Programmes -- Redux (part2)
The Concurrent Constraint Programming Research Programmes -- Redux (part2)Pierre Schaus
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashingDmitriy Selivanov
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Mail.ru Group
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learningKazuki Fujikawa
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)ananth
 

Similar to Ruifeng.pptx (20)

Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
 
Conditional Random Fields
Conditional Random FieldsConditional Random Fields
Conditional Random Fields
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
Theory of computing
Theory of computingTheory of computing
Theory of computing
 
The Theory of Finite Automata.pptx
The Theory of Finite Automata.pptxThe Theory of Finite Automata.pptx
The Theory of Finite Automata.pptx
 
Dsm as theory building
Dsm as theory buildingDsm as theory building
Dsm as theory building
 
Digital communication lab lectures
Digital communication lab  lecturesDigital communication lab  lectures
Digital communication lab lectures
 
Adversarial_Examples_in_Audio_and_Text.pptx
Adversarial_Examples_in_Audio_and_Text.pptxAdversarial_Examples_in_Audio_and_Text.pptx
Adversarial_Examples_in_Audio_and_Text.pptx
 
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
 
Introduction to Tree-LSTMs
Introduction to Tree-LSTMsIntroduction to Tree-LSTMs
Introduction to Tree-LSTMs
 
The Concurrent Constraint Programming Research Programmes -- Redux (part2)
The Concurrent Constraint Programming Research Programmes -- Redux (part2)The Concurrent Constraint Programming Research Programmes -- Redux (part2)
The Concurrent Constraint Programming Research Programmes -- Redux (part2)
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learning
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)
 
Hmm and neural networks
Hmm and neural networksHmm and neural networks
Hmm and neural networks
 

Recently uploaded

UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSrknatarajan
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 

Recently uploaded (20)

UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 

Ruifeng.pptx

  • 1. XLNET: GENERALIZED AUTOREGRESSIVE PRETRAINING FOR LANGUAGE UNDERSTANDING ZHILIN YANG , ZIHANG DAI, YIMING YANG , JAIME CARBONELL , RUSLAN SALAKHUTDINOV , QUOC V. LE
  • 2. AR and AE Autoregressive (AR) language modeling: An autoregressive model’s output ht at time t depends on not just xt, but also all xs from previous time steps. given a text sequence x = (x1, · · · , xT ), AR language modeling factorizes the likelihood into a forward product. p(x) = p(xt | x<t) Examples: GPT , ELMO(The simple combination of 2 AR models)
  • 3. AR and AE Autoencoding(AE) Language Modeling: The AE language model aims to reconstruct the original data from corrupted input. Corrupted input: The corrupted input means we use [MASK] to replace the original token Example: BERT
  • 4. Pros and Cons of AR and AE AR: Advantages: good at generative NLP tasks. (When generating context,It’s usually forward) Disadvantages: it only concerns one direction(forward or backward.) AE: Advantages: bi-directional. Downstream language understanding tasks often require bidirectional context information. Disadvantages: 1. the artificial symbols like [MASK] used by BERT during pretraining are absent from real data at finetuning time, resulting in a pretrain-finetune discrepancy. 2. It assumes the predicted (masked) tokens are independent of each other given the unmasked tokens. Example: I am [mask] in the [mask] of Waterloo.
  • 5. Can we combine the 2 methods so that we can take their pros and avoid their cons? Yes! XLNet, a generalized autoregressive method that leverages the best of both AR language modeling and AE while avoiding their limitations.
  • 6. XLNet Recall: In AR methods, we maximize the likelyhood in a fixed forward or backward factorization order Idea: XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order. Permutations: For example, we have a sentence with 4 tokens [x1 x2 x3 x4], and we want to predict x3. Then we will have 4!(24) permutations. [x1 x2 x3 x4],[x1 x3 x2 x4],[x1 x4 x3 x4],[x2 x1 x3 x4 ]… Every token can appear before x3, then if we apply the forward maximum likelihood function, it concerns all tokens in this sentence.
  • 7. XLNet Maximize the expectation for a factor order in all permutations Also, , XLNet does not rely on data corruption.(which means no masks in XLnet) FYI: For Bert, since Bert introduces masks, mt indicates if it’s a mask
  • 8.
  • 9. Problems: contradictory in a standard Transformer architecture: 1.To predict the token x_t, the model should only see the position of x_t, not the content of x_t 2.To predict the token x_t, the model should encode all tokens before x_t as the content (In transformers, word embedding and position information are combined) Solution: 2-stream self-attention The model will only encounter text sequences with the natural order during finetuning. Which means we can not chenge the sentences, we have to implement permutation in encoders. Solution: Attention Mask
  • 10. Two-Stream Self- Attention • Content stream attention: the standard self-attention in transformers. • query stream attention.: it’s used for predicting x_t
  • 11. Original sequence order: [x1,x2,x3,x4] Sample a random factorization order: [x3,x2,x4,x1] Calculate content stream attention: KV = [h1, h2, h3, h4] Q=h1 Calculate query stream attention: KV = [h2, h3, h4] Q=g1 The initial value for h_i = e(x_i) and g_i = w Recal l: Where g is from the last layer of query representation In this graph, the other parts of the encoder are omitted. Actually, they used the same structure as Transformer-XL
  • 12. Partial prediction It’s expensive since N! is really large and it will cause slow convergence Formally, we split z into a non-target subsequence z≤c and a target subsequence z>c, where c is the cutting point. And we only predict the tokens after c Generally, only 1/K tokens will be selected for predictions. (k=6 or 7 is recommended) Some methods in Transformer-XL are incorporated. Such as relative positional encoding scheme and the segment recurrence mechanism.(They are helpful for long context)
  • 13. Results: XLNet outperforms BERT. ROBERTa was released after XLNet. And It’s hard to tell which one is better, XLNet may be better in reading comprehension tasks(especially for longer context).