SlideShare a Scribd company logo
1 of 13
Download to read offline
Breaking the Softmax Bottleneck
via Learnable Monotonic
Pointwise Non-linearities
Octavian-Eugen Ganea, Sylvain Gelly, Gary Bécigneul,
Aliaksei Severyn
ICML 2019
2019/9/28
1
• [Yang+ 18]
Softmax bottleneck
– Softmax bottleneck
logit
–
• [Yang+ 18]
–
–
2
•
– P(I have a dream) > P(a have I dream)
–
• perplexity
–
– →
NTTW
2i
Encoder-Decoder 2
2RNN Encoder-Decoder
P(I have a dream) > P(a have I dream) > P(fuga spam hoge
:
•  2RNN e
•  2 P
P(I have a dream)
= P(I)P(have | I)P(a | I have)P(dream | I have a)
I have a dream
3
• Noisy channel model
– P(T)
–
•
–
•
4
RNN
• RNN
•
RNN RNN RNN RNN
<BOS> I have a
P(I) P(have | I) P(a | I have) P(dream | I have a)
5
RNN 1/2
• c
P
RNN RNN
<BOS> I
hc
c
[P (x1 | c ), P(x2 | c ), …, ]
hc W
6
RNN 1/2
•
→ (N) M A
RNN RNN
<BOS> I
[P (x1 | c1 ), P(x2 | c1 ), …, ]
RNN
have
[P (x1 | c2 ), P(x2 | c2 ), …, ]
……
A
Hθ HθWT
A HθWT = log A
we are asking the following
such that P✓(X|c) = P⇤
(X|c
We start by looking at a Softm
2.1 SOFTMAX
The majority of parametric l
(or hidden state) hc and a wo
specifically, the model distrib
where hc is a function of c, a
the context vector hc and th
h>
c wx is called a logit.
To help discuss the expressiv
H✓ =
2
6
6
4
h>
c1
h>
c2
· · ·
h>
cN
3
7
7
5 ; W✓ =
2
6
6
4
w>
x
w>
x
· ·
w>
x
where H✓ 2 RN⇥d
, W✓ 2
context vectors, word embed
We use the subscript ✓ becaus7
Softmax bottleneck [Yang+ 18]
• RNN A
A*
• HθWT = log A ≒ log A*
• Softmax bottleneck log A*
RNN
HθWT
• HθWT
500 ~ 1000
• log A* 10k
8
Softmax bottleneck
• HθWT
– [Yang+ 18,
Takase+ 18]
– Sigmoid Sigsoftmax
[Kanai+ 18]
•
– [Yang+ 18, Takase+ 18]
Softmax Sigsoftmax
9
Piecewise Linear Increasing Function (PLIF)
• f
– si bi x
– T K
x Vi b0
si 0
10
•
–
• PTB WT2
– unk
• 3 LSTM [Merity+ 17]
1)
on,
ne
PTB WikiText-2
Vocab 10,000 33,278
Train 929,590 2,088,628
#Token Valid 73,761 217,646
Test 82,431 245,569
Table 1: Statistics of PTB and WikiText-2. 11
• Softmax Linear-softmax
• [Yang+ 18]
• [Yang+ 18]
•
– [Yang+ 18] PPL PTB 54.44 WT2 61.45
• PTB 22M
– [Takase+ 18] PPL PTB 52.38 WT2 58.03
– Transformer-XL [Dai+ 19] PTB 54.52
Single model perplexities on validation and test sets on Penn Treebank and WikiText-2 datasets. For a fair comparison,
re obtained by running the respective open-source implementations locally, however, being comparable to the published
show the training time per epoch when using a single Tesla P100 GPU.
PENN TREEBANK WIKITEXT-2
#PARAM VALID PPL TEST PPL #SEC/EP #PARAM VALID PPL TEST PPL #SEC/EP
LINEAR-SOFTMAX
W/ AWD-LSTM, W/O FINETUNE
(MERITY ET AL., 2017)
24.2M 60.83 58.37 ⇠60 33M 68.11 65.22 ⇠120
OURS LMS-PLIF, 105
KNOTS
W/ AWD-LSTM, W/O FINETUNE
24.4M 59.45 57.25 ⇠70 33.2M 67.87 64.86 ⇠150
MOS, K = 15
W/ AWD-LSTM, W/O FINETUNE
(YANG ET AL., 2017)
26.6M 58.58 56.43 ⇠150 33M 66.01 63.33 ⇠550
MOS(15 COMP) +
OUR PLIF (106
KNOTS)
W/ AWD-LSTM, W/O FINETUNE
28.6M 58.20 56.02 ⇠220 - - - -
nd WikiText-2 (Merity et al., 2016). These datasets
ord vocabulary sizes of 10,000 and 33,000.
nes. We integrate our PLIF layer on top of the state
art language models of AWD-LSTM (Merity et al.,
nd AWD-LSTM+MoS (Yang et al., 2017). Addition-
r PLIF architecture can also be combined with MoS
of standard Softmax. We call this model "MoS +
We use the AWD-LSTM open source implementa-
All the models in table 1 5
were ran locally and we
hese results; we did this to understand how differ-
Figure 4. Learned function for the model in table 2
12
Perplexity PPL
• Softmax bottleneck
–
– [Yang+ 18, Takase+ 18]
–
• [Yang+ 18, Takase+ 18]
•
13

More Related Content

What's hot

VLSI Implementation of 32-Bit Unsigned Multiplier Using CSLA & CLAA
VLSI Implementation of 32-Bit Unsigned Multiplier Using CSLA & CLAAVLSI Implementation of 32-Bit Unsigned Multiplier Using CSLA & CLAA
VLSI Implementation of 32-Bit Unsigned Multiplier Using CSLA & CLAAIJMTST Journal
 
Graph 500 DISLIB powered optimized version
Graph 500 DISLIB powered optimized versionGraph 500 DISLIB powered optimized version
Graph 500 DISLIB powered optimized versionAnton Korzh
 
Synchronization Issues in OFDM Systems
Synchronization Issues in OFDM SystemsSynchronization Issues in OFDM Systems
Synchronization Issues in OFDM SystemsDeeptanu Datta
 
Ripple look-ahead-header
Ripple look-ahead-headerRipple look-ahead-header
Ripple look-ahead-headerAbid Ali
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMLinaro
 
carry maskable look ahead adder for approximate computing
carry maskable look ahead adder for approximate computingcarry maskable look ahead adder for approximate computing
carry maskable look ahead adder for approximate computingMahesh Dhava
 
Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...
Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...
Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...T. E. BOGALE
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportLinaro
 
04 -ece_3125_~_ece_3242_-_oct_10_202_-_assignment_1_-_due_oct_17_2012
04  -ece_3125_~_ece_3242_-_oct_10_202_-_assignment_1_-_due_oct_17_201204  -ece_3125_~_ece_3242_-_oct_10_202_-_assignment_1_-_due_oct_17_2012
04 -ece_3125_~_ece_3242_-_oct_10_202_-_assignment_1_-_due_oct_17_2012Emad ALmarday
 
Adaptive Channel Prediction, Beamforming and Scheduling Design for 5G V2I Net...
Adaptive Channel Prediction, Beamforming and Scheduling Design for 5G V2I Net...Adaptive Channel Prediction, Beamforming and Scheduling Design for 5G V2I Net...
Adaptive Channel Prediction, Beamforming and Scheduling Design for 5G V2I Net...T. E. BOGALE
 
Poster SCGlowTTS Interspeech 2021
Poster SCGlowTTS Interspeech 2021Poster SCGlowTTS Interspeech 2021
Poster SCGlowTTS Interspeech 2021Bilkent University
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention NetworksTaeoh Kim
 
Iisrt zzzz shamili ch
Iisrt zzzz shamili chIisrt zzzz shamili ch
Iisrt zzzz shamili chIISRT
 
A Generate-Test-Aggregate Parallel Programming Library on Spark
A Generate-Test-Aggregate Parallel Programming Library on SparkA Generate-Test-Aggregate Parallel Programming Library on Spark
A Generate-Test-Aggregate Parallel Programming Library on SparkYu Liu
 
A comparative study of different multiplier designs
A comparative study of different multiplier designsA comparative study of different multiplier designs
A comparative study of different multiplier designsHoopeer Hoopeer
 
Using R in remote computer clusters
Using R in remote computer clustersUsing R in remote computer clusters
Using R in remote computer clustersBurak Himmetoglu
 

What's hot (20)

VLSI Implementation of 32-Bit Unsigned Multiplier Using CSLA & CLAA
VLSI Implementation of 32-Bit Unsigned Multiplier Using CSLA & CLAAVLSI Implementation of 32-Bit Unsigned Multiplier Using CSLA & CLAA
VLSI Implementation of 32-Bit Unsigned Multiplier Using CSLA & CLAA
 
Graph 500 DISLIB powered optimized version
Graph 500 DISLIB powered optimized versionGraph 500 DISLIB powered optimized version
Graph 500 DISLIB powered optimized version
 
Synchronization Issues in OFDM Systems
Synchronization Issues in OFDM SystemsSynchronization Issues in OFDM Systems
Synchronization Issues in OFDM Systems
 
Ripple look-ahead-header
Ripple look-ahead-headerRipple look-ahead-header
Ripple look-ahead-header
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
carry maskable look ahead adder for approximate computing
carry maskable look ahead adder for approximate computingcarry maskable look ahead adder for approximate computing
carry maskable look ahead adder for approximate computing
 
Modulation techniques matlab_code
Modulation techniques matlab_codeModulation techniques matlab_code
Modulation techniques matlab_code
 
Travelling salesman problem
Travelling salesman problemTravelling salesman problem
Travelling salesman problem
 
Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...
Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...
Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
 
04 -ece_3125_~_ece_3242_-_oct_10_202_-_assignment_1_-_due_oct_17_2012
04  -ece_3125_~_ece_3242_-_oct_10_202_-_assignment_1_-_due_oct_17_201204  -ece_3125_~_ece_3242_-_oct_10_202_-_assignment_1_-_due_oct_17_2012
04 -ece_3125_~_ece_3242_-_oct_10_202_-_assignment_1_-_due_oct_17_2012
 
Adaptive Channel Prediction, Beamforming and Scheduling Design for 5G V2I Net...
Adaptive Channel Prediction, Beamforming and Scheduling Design for 5G V2I Net...Adaptive Channel Prediction, Beamforming and Scheduling Design for 5G V2I Net...
Adaptive Channel Prediction, Beamforming and Scheduling Design for 5G V2I Net...
 
Poster SCGlowTTS Interspeech 2021
Poster SCGlowTTS Interspeech 2021Poster SCGlowTTS Interspeech 2021
Poster SCGlowTTS Interspeech 2021
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention Networks
 
Chennai python augustmeetup
Chennai python augustmeetupChennai python augustmeetup
Chennai python augustmeetup
 
Iisrt zzzz shamili ch
Iisrt zzzz shamili chIisrt zzzz shamili ch
Iisrt zzzz shamili ch
 
A Generate-Test-Aggregate Parallel Programming Library on Spark
A Generate-Test-Aggregate Parallel Programming Library on SparkA Generate-Test-Aggregate Parallel Programming Library on Spark
A Generate-Test-Aggregate Parallel Programming Library on Spark
 
Lecture set 5
Lecture set 5Lecture set 5
Lecture set 5
 
A comparative study of different multiplier designs
A comparative study of different multiplier designsA comparative study of different multiplier designs
A comparative study of different multiplier designs
 
Using R in remote computer clusters
Using R in remote computer clustersUsing R in remote computer clusters
Using R in remote computer clusters
 

Similar to Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities

Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...Shinya Takamaeda-Y
 
Dimensioning of IP Backbone
Dimensioning of IP BackboneDimensioning of IP Backbone
Dimensioning of IP BackboneEM Archieve
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural NetworksShinya Takamaeda-Y
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015Kohei KaiGai
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...Edge AI and Vision Alliance
 
zkStudy Club: Subquadratic SNARGs in the Random Oracle Model
zkStudy Club: Subquadratic SNARGs in the Random Oracle ModelzkStudy Club: Subquadratic SNARGs in the Random Oracle Model
zkStudy Club: Subquadratic SNARGs in the Random Oracle ModelAlex Pruden
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"YeChan(Paul) Kim
 

Similar to Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities (8)

Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
 
Dimensioning of IP Backbone
Dimensioning of IP BackboneDimensioning of IP Backbone
Dimensioning of IP Backbone
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural Networks
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
 
zkStudy Club: Subquadratic SNARGs in the Random Oracle Model
zkStudy Club: Subquadratic SNARGs in the Random Oracle ModelzkStudy Club: Subquadratic SNARGs in the Random Oracle Model
zkStudy Club: Subquadratic SNARGs in the Random Oracle Model
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
 

More from Sho Takase

Transformerを多層にする際の勾配消失問題と解決法について
Transformerを多層にする際の勾配消失問題と解決法についてTransformerを多層にする際の勾配消失問題と解決法について
Transformerを多層にする際の勾配消失問題と解決法についてSho Takase
 
ニューラルネットワークを用いた自然言語処理
ニューラルネットワークを用いた自然言語処理ニューラルネットワークを用いた自然言語処理
ニューラルネットワークを用いた自然言語処理Sho Takase
 
NeurIPS2020参加報告
NeurIPS2020参加報告NeurIPS2020参加報告
NeurIPS2020参加報告Sho Takase
 
STAIR Lab Seminar 202105
STAIR Lab Seminar 202105STAIR Lab Seminar 202105
STAIR Lab Seminar 202105Sho Takase
 
Rethinking Perturbations in Encoder-Decoders for Fast Training
Rethinking Perturbations in Encoder-Decoders for Fast TrainingRethinking Perturbations in Encoder-Decoders for Fast Training
Rethinking Perturbations in Encoder-Decoders for Fast TrainingSho Takase
 
Robust Neural Machine Translation with Doubly Adversarial Inputs
Robust Neural Machine Translation with Doubly Adversarial InputsRobust Neural Machine Translation with Doubly Adversarial Inputs
Robust Neural Machine Translation with Doubly Adversarial InputsSho Takase
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationSho Takase
 
Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesHarnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesSho Takase
 
Learning Composition Models for Phrase Embeddings
Learning Composition Models for Phrase EmbeddingsLearning Composition Models for Phrase Embeddings
Learning Composition Models for Phrase EmbeddingsSho Takase
 
Retrofitting Word Vectors to Semantic Lexicons
Retrofitting Word Vectors to Semantic LexiconsRetrofitting Word Vectors to Semantic Lexicons
Retrofitting Word Vectors to Semantic LexiconsSho Takase
 
NLP2015 構成性に基づく関係パタンの意味計算
NLP2015 構成性に基づく関係パタンの意味計算NLP2015 構成性に基づく関係パタンの意味計算
NLP2015 構成性に基づく関係パタンの意味計算Sho Takase
 
Lexical Inference over Multi-Word Predicates
Lexical Inference over Multi-Word PredicatesLexical Inference over Multi-Word Predicates
Lexical Inference over Multi-Word PredicatesSho Takase
 
dont_count_predict_in_acl2014
dont_count_predict_in_acl2014dont_count_predict_in_acl2014
dont_count_predict_in_acl2014Sho Takase
 

More from Sho Takase (14)

Transformerを多層にする際の勾配消失問題と解決法について
Transformerを多層にする際の勾配消失問題と解決法についてTransformerを多層にする際の勾配消失問題と解決法について
Transformerを多層にする際の勾配消失問題と解決法について
 
ニューラルネットワークを用いた自然言語処理
ニューラルネットワークを用いた自然言語処理ニューラルネットワークを用いた自然言語処理
ニューラルネットワークを用いた自然言語処理
 
NeurIPS2020参加報告
NeurIPS2020参加報告NeurIPS2020参加報告
NeurIPS2020参加報告
 
STAIR Lab Seminar 202105
STAIR Lab Seminar 202105STAIR Lab Seminar 202105
STAIR Lab Seminar 202105
 
Rethinking Perturbations in Encoder-Decoders for Fast Training
Rethinking Perturbations in Encoder-Decoders for Fast TrainingRethinking Perturbations in Encoder-Decoders for Fast Training
Rethinking Perturbations in Encoder-Decoders for Fast Training
 
Robust Neural Machine Translation with Doubly Adversarial Inputs
Robust Neural Machine Translation with Doubly Adversarial InputsRobust Neural Machine Translation with Doubly Adversarial Inputs
Robust Neural Machine Translation with Doubly Adversarial Inputs
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword Information
 
Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesHarnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic Rules
 
4thNLPDL
4thNLPDL4thNLPDL
4thNLPDL
 
Learning Composition Models for Phrase Embeddings
Learning Composition Models for Phrase EmbeddingsLearning Composition Models for Phrase Embeddings
Learning Composition Models for Phrase Embeddings
 
Retrofitting Word Vectors to Semantic Lexicons
Retrofitting Word Vectors to Semantic LexiconsRetrofitting Word Vectors to Semantic Lexicons
Retrofitting Word Vectors to Semantic Lexicons
 
NLP2015 構成性に基づく関係パタンの意味計算
NLP2015 構成性に基づく関係パタンの意味計算NLP2015 構成性に基づく関係パタンの意味計算
NLP2015 構成性に基づく関係パタンの意味計算
 
Lexical Inference over Multi-Word Predicates
Lexical Inference over Multi-Word PredicatesLexical Inference over Multi-Word Predicates
Lexical Inference over Multi-Word Predicates
 
dont_count_predict_in_acl2014
dont_count_predict_in_acl2014dont_count_predict_in_acl2014
dont_count_predict_in_acl2014
 

Recently uploaded

Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxpritamlangde
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdfKamal Acharya
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Ramkumar k
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...vershagrag
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilVinayVitekari
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 

Recently uploaded (20)

Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 

Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities

  • 1. Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities Octavian-Eugen Ganea, Sylvain Gelly, Gary Bécigneul, Aliaksei Severyn ICML 2019 2019/9/28 1
  • 2. • [Yang+ 18] Softmax bottleneck – Softmax bottleneck logit – • [Yang+ 18] – – 2
  • 3. • – P(I have a dream) > P(a have I dream) – • perplexity – – → NTTW 2i Encoder-Decoder 2 2RNN Encoder-Decoder P(I have a dream) > P(a have I dream) > P(fuga spam hoge : •  2RNN e •  2 P P(I have a dream) = P(I)P(have | I)P(a | I have)P(dream | I have a) I have a dream 3
  • 4. • Noisy channel model – P(T) – • – • 4
  • 5. RNN • RNN • RNN RNN RNN RNN <BOS> I have a P(I) P(have | I) P(a | I have) P(dream | I have a) 5
  • 6. RNN 1/2 • c P RNN RNN <BOS> I hc c [P (x1 | c ), P(x2 | c ), …, ] hc W 6
  • 7. RNN 1/2 • → (N) M A RNN RNN <BOS> I [P (x1 | c1 ), P(x2 | c1 ), …, ] RNN have [P (x1 | c2 ), P(x2 | c2 ), …, ] …… A Hθ HθWT A HθWT = log A we are asking the following such that P✓(X|c) = P⇤ (X|c We start by looking at a Softm 2.1 SOFTMAX The majority of parametric l (or hidden state) hc and a wo specifically, the model distrib where hc is a function of c, a the context vector hc and th h> c wx is called a logit. To help discuss the expressiv H✓ = 2 6 6 4 h> c1 h> c2 · · · h> cN 3 7 7 5 ; W✓ = 2 6 6 4 w> x w> x · · w> x where H✓ 2 RN⇥d , W✓ 2 context vectors, word embed We use the subscript ✓ becaus7
  • 8. Softmax bottleneck [Yang+ 18] • RNN A A* • HθWT = log A ≒ log A* • Softmax bottleneck log A* RNN HθWT • HθWT 500 ~ 1000 • log A* 10k 8
  • 9. Softmax bottleneck • HθWT – [Yang+ 18, Takase+ 18] – Sigmoid Sigsoftmax [Kanai+ 18] • – [Yang+ 18, Takase+ 18] Softmax Sigsoftmax 9
  • 10. Piecewise Linear Increasing Function (PLIF) • f – si bi x – T K x Vi b0 si 0 10
  • 11. • – • PTB WT2 – unk • 3 LSTM [Merity+ 17] 1) on, ne PTB WikiText-2 Vocab 10,000 33,278 Train 929,590 2,088,628 #Token Valid 73,761 217,646 Test 82,431 245,569 Table 1: Statistics of PTB and WikiText-2. 11
  • 12. • Softmax Linear-softmax • [Yang+ 18] • [Yang+ 18] • – [Yang+ 18] PPL PTB 54.44 WT2 61.45 • PTB 22M – [Takase+ 18] PPL PTB 52.38 WT2 58.03 – Transformer-XL [Dai+ 19] PTB 54.52 Single model perplexities on validation and test sets on Penn Treebank and WikiText-2 datasets. For a fair comparison, re obtained by running the respective open-source implementations locally, however, being comparable to the published show the training time per epoch when using a single Tesla P100 GPU. PENN TREEBANK WIKITEXT-2 #PARAM VALID PPL TEST PPL #SEC/EP #PARAM VALID PPL TEST PPL #SEC/EP LINEAR-SOFTMAX W/ AWD-LSTM, W/O FINETUNE (MERITY ET AL., 2017) 24.2M 60.83 58.37 ⇠60 33M 68.11 65.22 ⇠120 OURS LMS-PLIF, 105 KNOTS W/ AWD-LSTM, W/O FINETUNE 24.4M 59.45 57.25 ⇠70 33.2M 67.87 64.86 ⇠150 MOS, K = 15 W/ AWD-LSTM, W/O FINETUNE (YANG ET AL., 2017) 26.6M 58.58 56.43 ⇠150 33M 66.01 63.33 ⇠550 MOS(15 COMP) + OUR PLIF (106 KNOTS) W/ AWD-LSTM, W/O FINETUNE 28.6M 58.20 56.02 ⇠220 - - - - nd WikiText-2 (Merity et al., 2016). These datasets ord vocabulary sizes of 10,000 and 33,000. nes. We integrate our PLIF layer on top of the state art language models of AWD-LSTM (Merity et al., nd AWD-LSTM+MoS (Yang et al., 2017). Addition- r PLIF architecture can also be combined with MoS of standard Softmax. We call this model "MoS + We use the AWD-LSTM open source implementa- All the models in table 1 5 were ran locally and we hese results; we did this to understand how differ- Figure 4. Learned function for the model in table 2 12 Perplexity PPL
  • 13. • Softmax bottleneck – – [Yang+ 18, Takase+ 18] – • [Yang+ 18, Takase+ 18] • 13