Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities

•

0 likes•303 views

Sho Takase

Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities（ICML 2019）の論文紹介、第11回最先端NLP勉強会にて

Engineering

Breaking the Softmax Bottleneck
via Learnable Monotonic
Pointwise Non-linearities
Octavian-Eugen Ganea, Sylvain Gelly, Gary Bécigneul,
Aliaksei Severyn
ICML 2019
2019/9/28
1

• [Yang+ 18]
Softmax bottleneck
– Softmax bottleneck
logit
–
• [Yang+ 18]
–
–
2

•
– P(I have a dream) > P(a have I dream)
–
• perplexity
–
– →
NTTW
2i
Encoder-Decoder 2
2RNN Encoder-Decoder
P(I have a dream) > P(a have I dream) > P(fuga spam hoge
:
•  2RNN e
•  2 P
P(I have a dream)
= P(I)P(have | I)P(a | I have)P(dream | I have a)
I have a dream
3

RNN
• RNN
•
RNN RNN RNN RNN
<BOS> I have a
P(I) P(have | I) P(a | I have) P(dream | I have a)
5

RNN 1/2
• c
P
RNN RNN
<BOS> I
hc
c
[P (x1 | c ), P(x2 | c ), …, ]
hc W
6

RNN 1/2
•
→ (N) M A
RNN RNN
<BOS> I
[P (x1 | c1 ), P(x2 | c1 ), …, ]
RNN
have
[P (x1 | c2 ), P(x2 | c2 ), …, ]
……
A
Hθ HθWT
A HθWT = log A
we are asking the following
such that P✓(X|c) = P⇤
(X|c
We start by looking at a Softm
2.1 SOFTMAX
The majority of parametric l
(or hidden state) hc and a wo
speciﬁcally, the model distrib
where hc is a function of c, a
the context vector hc and th
h>
c wx is called a logit.
To help discuss the expressiv
H✓ =
2
6
6
4
h>
c1
h>
c2
· · ·
h>
cN
3
7
7
5 ; W✓ =
2
6
6
4
w>
x
w>
x
· ·
w>
x
where H✓ 2 RN⇥d
, W✓ 2
context vectors, word embed
We use the subscript ✓ becaus7

Softmax bottleneck [Yang+ 18]
• RNN A
A*
• HθWT = log A ≒ log A*
• Softmax bottleneck log A*
RNN
HθWT
• HθWT
500 ~ 1000
• log A* 10k
8

Softmax bottleneck
• HθWT
– [Yang+ 18,
Takase+ 18]
– Sigmoid Sigsoftmax
[Kanai+ 18]
•
– [Yang+ 18, Takase+ 18]
Softmax Sigsoftmax
9

Piecewise Linear Increasing Function (PLIF)
• f
– si bi x
– T K
x Vi b0
si 0
10

•
–
• PTB WT2
– unk
• 3 LSTM [Merity+ 17]
1)
on,
ne
PTB WikiText-2
Vocab 10,000 33,278
Train 929,590 2,088,628
#Token Valid 73,761 217,646
Test 82,431 245,569
Table 1: Statistics of PTB and WikiText-2. 11

• Softmax Linear-softmax
• [Yang+ 18]
• [Yang+ 18]
•
– [Yang+ 18] PPL PTB 54.44 WT2 61.45
• PTB 22M
– [Takase+ 18] PPL PTB 52.38 WT2 58.03
– Transformer-XL [Dai+ 19] PTB 54.52
Single model perplexities on validation and test sets on Penn Treebank and WikiText-2 datasets. For a fair comparison,
re obtained by running the respective open-source implementations locally, however, being comparable to the published
show the training time per epoch when using a single Tesla P100 GPU.
PENN TREEBANK WIKITEXT-2
#PARAM VALID PPL TEST PPL #SEC/EP #PARAM VALID PPL TEST PPL #SEC/EP
LINEAR-SOFTMAX
W/ AWD-LSTM, W/O FINETUNE
(MERITY ET AL., 2017)
24.2M 60.83 58.37 ⇠60 33M 68.11 65.22 ⇠120
OURS LMS-PLIF, 105
KNOTS
W/ AWD-LSTM, W/O FINETUNE
24.4M 59.45 57.25 ⇠70 33.2M 67.87 64.86 ⇠150
MOS, K = 15
W/ AWD-LSTM, W/O FINETUNE
(YANG ET AL., 2017)
26.6M 58.58 56.43 ⇠150 33M 66.01 63.33 ⇠550
MOS(15 COMP) +
OUR PLIF (106
KNOTS)
W/ AWD-LSTM, W/O FINETUNE
28.6M 58.20 56.02 ⇠220 - - - -
nd WikiText-2 (Merity et al., 2016). These datasets
ord vocabulary sizes of 10,000 and 33,000.
nes. We integrate our PLIF layer on top of the state
art language models of AWD-LSTM (Merity et al.,
nd AWD-LSTM+MoS (Yang et al., 2017). Addition-
r PLIF architecture can also be combined with MoS
of standard Softmax. We call this model "MoS +
We use the AWD-LSTM open source implementa-
All the models in table 1 5
were ran locally and we
hese results; we did this to understand how differ-
Figure 4. Learned function for the model in table 2
12
Perplexity PPL

• Softmax bottleneck
–
– [Yang+ 18, Takase+ 18]
–
• [Yang+ 18, Takase+ 18]
•
13

What's hot

VLSI Implementation of 32-Bit Unsigned Multiplier Using CSLA & CLAAIJMTST Journal

Graph 500 DISLIB powered optimized versionAnton Korzh

Synchronization Issues in OFDM SystemsDeeptanu Datta

Ripple look-ahead-headerAbid Ali

Compilation of COSMO for GPU using LLVMLinaro

carry maskable look ahead adder for approximate computingMahesh Dhava

Modulation techniques matlab_codeВахидреза Мохсени

Travelling salesman problemDimitris Mavrommatis

Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...T. E. BOGALE

Arm tools and roadmap for SVE compiler supportLinaro

04 -ece_3125_~_ece_3242_-_oct_10_202_-_assignment_1_-_due_oct_17_2012Emad ALmarday

Adaptive Channel Prediction, Beamforming and Scheduling Design for 5G V2I Net...T. E. BOGALE

Poster SCGlowTTS Interspeech 2021Bilkent University

CNN Attention NetworksTaeoh Kim

Chennai python augustmeetupElectronics and Communication Engineering, Institute of Road and Transport Technology

Iisrt zzzz shamili chIISRT

A Generate-Test-Aggregate Parallel Programming Library on SparkYu Liu

Lecture set 5Gopi Saiteja

A comparative study of different multiplier designsHoopeer Hoopeer

Using R in remote computer clustersBurak Himmetoglu

What's hot (20)

VLSI Implementation of 32-Bit Unsigned Multiplier Using CSLA & CLAA

Graph 500 DISLIB powered optimized version

Synchronization Issues in OFDM Systems

Ripple look-ahead-header

Compilation of COSMO for GPU using LLVM

carry maskable look ahead adder for approximate computing

Modulation techniques matlab_code

Travelling salesman problem

Pilot Contamination Mitigation for Wideband Massive MIMO: Number of Cells Vs ...

Arm tools and roadmap for SVE compiler support

04 -ece_3125_~_ece_3242_-_oct_10_202_-_assignment_1_-_due_oct_17_2012

Adaptive Channel Prediction, Beamforming and Scheduling Design for 5G V2I Net...

Poster SCGlowTTS Interspeech 2021

CNN Attention Networks

Chennai python augustmeetup

Iisrt zzzz shamili ch

A Generate-Test-Aggregate Parallel Programming Library on Spark

Lecture set 5

A comparative study of different multiplier designs

Using R in remote computer clusters

Similar to Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities

Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを目指したミックスドパラダイム型高位合成 (FPGAX 201...Shinya Takamaeda-Y

Dimensioning of IP BackboneEM Archieve

A CGRA-based Approachfor Accelerating Convolutional Neural NetworksShinya Takamaeda-Y

PG-Strom - GPGPU meets PostgreSQL, PGcon2015Kohei KaiGai

Accelerating microbiome research with OpenACCIgor Sfiligoi

“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...Edge AI and Vision Alliance

zkStudy Club: Subquadratic SNARGs in the Random Oracle ModelAlex Pruden

pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"YeChan(Paul) Kim

Similar to Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities (8)

Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを目指したミックスドパラダイム型高位合成 (FPGAX 201...

Dimensioning of IP Backbone

A CGRA-based Approachfor Accelerating Convolutional Neural Networks

PG-Strom - GPGPU meets PostgreSQL, PGcon2015

Accelerating microbiome research with OpenACC

“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...

zkStudy Club: Subquadratic SNARGs in the Random Oracle Model

pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"

Recently uploaded

Digital Communication Essentials: DPCM, DM, and ADM .pptxpritamlangde

data_management_and _data_science_cheat_sheet.pdfJiananWang21

Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai

Hospital management system project report.pdfKamal Acharya

Hostel management system project report..pdfKamal Acharya

Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X79953056974 Low Rate Call Girls In Saket, Delhi NCR

Introduction to Serverless with AWS LambdaOmar Fathy

S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture

AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248

Employee leave management system project.Kamal Acharya

Design For Accessibility: Getting it right from the startQuintin Balsdon

Theory of Time 2024 (Universal Theory for Everything)Ramkumar k

Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156

1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30

💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...vershagrag

scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2

HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR

Moment Distribution Method For Btech CivilVinayVitekari

A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1

Thermal Engineering-R & A / C - unit - VDineshKumar4165

Recently uploaded (20)

Digital Communication Essentials: DPCM, DM, and ADM .pptx

data_management_and _data_science_cheat_sheet.pdf

Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...

Hospital management system project report.pdf

Hostel management system project report..pdf

Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7

Introduction to Serverless with AWS Lambda

S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx

AIRCANVAS[1].pdf mini project for btech students

Employee leave management system project.

Design For Accessibility: Getting it right from the start

Theory of Time 2024 (Universal Theory for Everything)

Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service

1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf

💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...

scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...

HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR

Moment Distribution Method For Btech Civil

A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx

Thermal Engineering-R & A / C - unit - V

Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities

1. Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities Octavian-Eugen Ganea, Sylvain Gelly, Gary Bécigneul, Aliaksei Severyn ICML 2019 2019/9/28 1

2. • [Yang+ 18] Softmax bottleneck – Softmax bottleneck logit – • [Yang+ 18] – – 2

3. • – P(I have a dream) > P(a have I dream) – • perplexity – – → NTTW 2i Encoder-Decoder 2 2RNN Encoder-Decoder P(I have a dream) > P(a have I dream) > P(fuga spam hoge : •  2RNN e •  2 P P(I have a dream) = P(I)P(have | I)P(a | I have)P(dream | I have a) I have a dream 3

4. • Noisy channel model – P(T) – • – • 4

5. RNN • RNN • RNN RNN RNN RNN <BOS> I have a P(I) P(have | I) P(a | I have) P(dream | I have a) 5

6. RNN 1/2 • c P RNN RNN <BOS> I hc c [P (x1 | c ), P(x2 | c ), …, ] hc W 6

7. RNN 1/2 • → (N) M A RNN RNN <BOS> I [P (x1 | c1 ), P(x2 | c1 ), …, ] RNN have [P (x1 | c2 ), P(x2 | c2 ), …, ] …… A Hθ HθWT A HθWT = log A we are asking the following such that P✓(X|c) = P⇤ (X|c We start by looking at a Softm 2.1 SOFTMAX The majority of parametric l (or hidden state) hc and a wo speciﬁcally, the model distrib where hc is a function of c, a the context vector hc and th h> c wx is called a logit. To help discuss the expressiv H✓ = 2 6 6 4 h> c1 h> c2 · · · h> cN 3 7 7 5 ; W✓ = 2 6 6 4 w> x w> x · · w> x where H✓ 2 RN⇥d , W✓ 2 context vectors, word embed We use the subscript ✓ becaus7

8. Softmax bottleneck [Yang+ 18] • RNN A A* • HθWT = log A ≒ log A* • Softmax bottleneck log A* RNN HθWT • HθWT 500 ~ 1000 • log A* 10k 8

9. Softmax bottleneck • HθWT – [Yang+ 18, Takase+ 18] – Sigmoid Sigsoftmax [Kanai+ 18] • – [Yang+ 18, Takase+ 18] Softmax Sigsoftmax 9

10. Piecewise Linear Increasing Function (PLIF) • f – si bi x – T K x Vi b0 si 0 10

11. • – • PTB WT2 – unk • 3 LSTM [Merity+ 17] 1) on, ne PTB WikiText-2 Vocab 10,000 33,278 Train 929,590 2,088,628 #Token Valid 73,761 217,646 Test 82,431 245,569 Table 1: Statistics of PTB and WikiText-2. 11

12. • Softmax Linear-softmax • [Yang+ 18] • [Yang+ 18] • – [Yang+ 18] PPL PTB 54.44 WT2 61.45 • PTB 22M – [Takase+ 18] PPL PTB 52.38 WT2 58.03 – Transformer-XL [Dai+ 19] PTB 54.52 Single model perplexities on validation and test sets on Penn Treebank and WikiText-2 datasets. For a fair comparison, re obtained by running the respective open-source implementations locally, however, being comparable to the published show the training time per epoch when using a single Tesla P100 GPU. PENN TREEBANK WIKITEXT-2 #PARAM VALID PPL TEST PPL #SEC/EP #PARAM VALID PPL TEST PPL #SEC/EP LINEAR-SOFTMAX W/ AWD-LSTM, W/O FINETUNE (MERITY ET AL., 2017) 24.2M 60.83 58.37 ⇠60 33M 68.11 65.22 ⇠120 OURS LMS-PLIF, 105 KNOTS W/ AWD-LSTM, W/O FINETUNE 24.4M 59.45 57.25 ⇠70 33.2M 67.87 64.86 ⇠150 MOS, K = 15 W/ AWD-LSTM, W/O FINETUNE (YANG ET AL., 2017) 26.6M 58.58 56.43 ⇠150 33M 66.01 63.33 ⇠550 MOS(15 COMP) + OUR PLIF (106 KNOTS) W/ AWD-LSTM, W/O FINETUNE 28.6M 58.20 56.02 ⇠220 - - - - nd WikiText-2 (Merity et al., 2016). These datasets ord vocabulary sizes of 10,000 and 33,000. nes. We integrate our PLIF layer on top of the state art language models of AWD-LSTM (Merity et al., nd AWD-LSTM+MoS (Yang et al., 2017). Addition- r PLIF architecture can also be combined with MoS of standard Softmax. We call this model "MoS + We use the AWD-LSTM open source implementa- All the models in table 1 5 were ran locally and we hese results; we did this to understand how differ- Figure 4. Learned function for the model in table 2 12 Perplexity PPL

13. • Softmax bottleneck – – [Yang+ 18, Takase+ 18] – • [Yang+ 18, Takase+ 18] • 13

Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities

Similar to Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities (8)

More from Sho Takase

More from Sho Takase (14)

Recently uploaded

Recently uploaded (20)

Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities