SlideShare a Scribd company logo
Conformer: Convolution-augmented Transformer for Speech Recognition
Presented by: June-Woo Kim
Kyungpook National University
6, Nov. 2020.
Gulati, Anmol, et al
Abstract
• This paper achieved the best of both worlds by studying how to combine convolution neural networks and
transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way.
• This paper proposed the convolution-augmented transformer for speech recognition, named Conformer, and it
significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.
• On the LibriSpeech benchmark, they achieved % score on test/testother set.
• 2.1%/4.3% of Character Error Rate (CER) without using a language model.
• 1.9%/3.9% with an external language model.
Background
• End-to-end automatic speech recognition (ASR) systems based on neural networks have seen large improvements
in recent years.
• RNN have been the de-facto choice for ASR as they can model the temporal dependencies in the audio sequences
effectively.
• Recently, Transformer [1] architecture based on self-attention has enjoyed widespread adoption for modeling sequences
due to its ability to capture long distance interactions and the high training efficiency.
• Alternatively, CNN have also been successful for ASR, which capture local context progressively via a local receptive field
layer by layer.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Background
• However, models with self-attention or CNN each has its limitations.
• While Transformers are good at modeling long-range global context, they are less capable to extract fine-grained local
feature patterns.
• CNNs, on the other hand, exploit local information and are used as the de-facto computational block in vision.
• One limitation of using local connectivity is that we need many more layers or parameters to capture global information.
• To combat this issue, contemporary work ContextNet [2] adopts the squeeze-and-excitation module [3] in each
residual block to capture longer context.
• Squeeze-and-excitation module’s goal is to improve the representational power of a network by explicitly modelling the
interdependencies between the channels of its convolutional features.
[2] Poudel, Rudra PK, et al. "Contextnet: Exploring context and detail for semantic segmentation in real-time." arXiv preprint arXiv:1805.04554 (2018).
[3] Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
Background: SENet(Squeeze and excitation networks)
• The goal of SENet is to improve the representational power of a network by explicitly modelling the
interdependencies between the channels of its convolutional features.
• Contribution of this paper
• It can be attached directly to any network such as VGG, GoogLeNet, ResNet, etc.
• The improvement in model performance is significant compared to the increase in parameters.
• This has the advantage that model complexity and computational burden are not increasing significantly
Background: SENet(Squeeze and excitation networks)
• Squeeze: Global information embedding
• This paper used GAP (Global Average Pooling), one of the most common methodologies for extracting the important
information.
• This paper described that using GAP can compress global spatial information into a channel descriptor.
• Excitation: Adaptive recalibration
• Calculate the channel-wise dependencies.
• In this paper, this is simply calculated by adjusting the Fully connected layer and the nonlinear function.
• Finally, 𝑠 and each of the 𝐶 feature maps before GAP in the existing network are multiplied.
𝑠 = 𝐹𝑒𝑥 𝑧, 𝑊 = 𝜎 𝑊2 𝛿 𝑊1 𝑧
Where 𝛿 =ReLU, 𝜎=Sigmoid,
𝑊1, 𝑊2 =FC layer,
Background: SENet(Squeeze and excitation networks)
• Exemplars: SE-Inception and SE-ResNet
Background
• However, it is still limited in capturing dynamic global context as it only applies a global averaging over the
entire sequence.
• Recent works have shown that combining convolution and self-attention improves over using them individually.
• Papers like [4], [5] have augmented self-attention with relative position based information that maintains equivariance.
[4] Yang, Baosong, et al. "Convolutional self-attention networks." arXiv preprint arXiv:1904.03107 (2019).
[5] Yu, Adams Wei, et al. "QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension." International Conference on Learning Representations. 2018.
Method
• This paper showed how to organically combine convolutions with self-attention in ASR models.
• The distinctive feature of our model is the use of Conformer blocks in the place of Transformer blocks.
• A conformer block is composed of four modules stacked together, i.e, a feed-forward module, a self-attention module, a convolution module, and a second feed-
forward module in the end.
Method: Multi-Headed Self-Attention Module
• This paper employed multi-headed self-attention (MHSA) while integrating an important technique from
Transformer-XL, the relative sinusoidal positional encoding scheme.
• The relative positional encoding allows the self-attention module to generalize better on different input length and the
resulting encoder is more robust to the variance of the utterance length.
• This paper used pre-norm residual units [6, 7] with dropout which helps training and regularizing deeper models.
[6] Wang, Qiang, et al. "Learning Deep Transformer Models for Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
[7] Nguyen, Toan Q., and Julian Salazar. "Transformers without tears: Improving the normalization of self-attention." arXiv preprint arXiv:1910.05895 (2019).
Method: Convolution module
• The convolution module starts with a gating mechanism [8]—a pointwise convolution and a gated linear unit
(GLU).
• This is followed by a single 1-D depthwise convolution layer.
• Batchnorm is deployed just after the convolution to aid training deep models.
• [8] show this mechanism to be useful for language modeling as it allows the model to select which words or
features are relevant for predicting the next word.
[8] Dauphin, Yann N., et al. "Language modeling with gated convolutional networks." International conference on machine learning. 2017.
Method: Feed Forward Module
• The Transformer architecture deploys a feed forward module after the MHSA layer and is composed of two linear
transformations and a nonlinear activation in between.
• A residual connection is added over the feed-forward layers, followed by layer normalization.
Method: Feed Forward Module
• However, this paper followed pre-norm residual units and apply layer normalization within the residual unit and
on the input before the first linear layer.
• This paper also applied Swish activation [9] and dropout, which helps regularizing the network.
[9] Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. "Searching for activation functions." arXiv preprint arXiv:1710.05941 (2017).
This paper proposed to leverage automatic search
techniques to discover new activation functions.
𝑓 𝑥 = 𝑥 ∙ 𝜎(𝛽𝑥)
Where 𝜎 𝑧 = 1 + exp −𝑧 −1
and 𝛽 is a constant or
trainable parameter
Method: conformer Block
• Their proposed Conformer block contains two Feed Forward modules sandwiching the Multi-Headed Self-Attention
module and the Convolution module.
𝑥𝑖 = 𝑥𝑖 +
1
2
𝐹𝐹𝑁(𝑥𝑖)
𝑥𝑖
′
= 𝑥𝑖 + 𝑀𝐻𝑆𝐴( 𝑥𝑖)
𝑥𝑖
′′
= 𝑥𝑖
′
+ 𝐶𝑜𝑛𝑣(𝑥𝑖
′
)
𝑦𝑖 = 𝐿𝑎𝑦𝑒𝑟𝑛𝑜𝑟𝑚(𝑥𝑖
′′
+
1
2
𝐹𝐹𝑁 𝑥𝑖
′′
)
Experiments
• Dataset: Librispeech
• 970 hours of labeled speech and an additional 800M word token text-only corpus for building language model.
• Preprocessing
• 80-channel filterbanks features computed from a 25ms window with a stride of 10ms
• SpecAugment also used with mask parameter (𝐹 = 27).
• Ten time masks with maximum time-mask ratio (𝑝𝑠=0.05), where the maximum-size of the time mask is set to 𝑝𝑠 times the
length of the utterance.
Hyper-parameters for Conformer
• Decoder: a single-LSTM-layer.
• Applied dropout in each residual unit of the conformer
• 𝑃𝑑𝑟𝑜𝑝 = 0.1
• 𝑙2 regularization with 1𝑒 − 6 weight is also added to all the trainable weights.
• Adam optimizer with 𝛽1 = 0.9, 𝛽2 = 0.98 and 𝜖 = 10−9
.
• Transformer learning rate schedule with 10k warm-up steps.
• LM
• 3-layer LSTM with width 4096 trained on the LibrisSpeech language model corpus with the LibriSpeech960h transcripts
added, tokenized with the 1k- WPM built from LibriSpeech 960h.
• LM has word-level perplexity 63.9 on the dev-set transcripts.
Results
Ablation Studies
• Conformer Block vs. Transformer Block
• Among all differences, convolution sub-block is the most important feature, while having a Macaron-style FFN pair is also
more effective than a single FFN of the same number of parameters.
• They also found that using swish activations led to faster convergence in the Conformer models.
Ablation Studies
• Combinations of Convolution and Transformer Modules
• First, performance significantly dropped because replacing the depthwise convolution in the convolution module with a
lightweight convolution in devother set.
• They found that to split the input into parallel branches of multi-headed self attention module and a convolution module
with their output concatenated worsens the performance when compared to their proposed architecture.
Ablation Studies
• Macaron FFN
• Table 5 shows the impact of changing the Conformer block to use a single FFN or full-step residuals.
Ablation Studies
• # of Attention Heads
• They performed experiments to study the effect of varying the number of attention heads from 4 to 32 in our large model,
using the same number of heads in all layers.
• They found that increasing attention heads up to 16 improves the accuracy, especially over the devother datasets.
Ablation Studies
• Convolution Kernel Sizes
• They sweep the kernel size in {3, 7, 17, 32, 65} of the large model, using the same kernel size for all layers.
• They found that the performance improves with larger kernel sizes till kernel sizes 17 and 32.
• However, worsens in the case of kernel size 65.
Worse
Contribution
• In this work, authors introduced Conformer, an architecture that integrates components from CNNs and
Transformers for end-to-end speech recognition.
• They studied the importance of each component, and demonstrated that the inclusion of convolution modules is
critical to the performance of the Conformer model.
• The model exhibits better accuracy with fewer parameters than previous work on the LibriSpeech dataset, and
achieves a new state-of-the-art performance at 1.9%/3.9% for test/testother.
Reference
• Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
• Poudel, Rudra PK, et al. "Contextnet: Exploring context and detail for semantic segmentation in real-time." arXiv
preprint arXiv:1805.04554 (2018).
• Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer
vision and pattern recognition. 2018.
• Yang, Baosong, et al. "Convolutional self-attention networks." arXiv preprint arXiv:1904.03107 (2019).
• Yu, Adams Wei, et al. "QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension."
International Conference on Learning Representations. 2018.
• Wang, Qiang, et al. "Learning Deep Transformer Models for Machine Translation." Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics. 2019.
• Nguyen, Toan Q., and Julian Salazar. "Transformers without tears: Improving the normalization of self-attention." arXiv
preprint arXiv:1910.05895 (2019).
• Dauphin, Yann N., et al. "Language modeling with gated convolutional networks." International conference on machine
learning. 2017.
• Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. "Searching for activation functions." arXiv preprint
arXiv:1710.05941 (2017).
Thank you!

More Related Content

What's hot

ニューラルネットと深層学習の歴史
ニューラルネットと深層学習の歴史ニューラルネットと深層学習の歴史
ニューラルネットと深層学習の歴史
Akinori Abe
 
CVIM#11 3. 最小化のための数値計算
CVIM#11 3. 最小化のための数値計算CVIM#11 3. 最小化のための数値計算
CVIM#11 3. 最小化のための数値計算
sleepy_yoshi
 
CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1
CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1
CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1
Masashi Shibata
 
A Brief Survey of Schrödinger Bridge (Part I)
A Brief Survey of Schrödinger Bridge (Part I)A Brief Survey of Schrödinger Bridge (Part I)
A Brief Survey of Schrödinger Bridge (Part I)
Morpho, Inc.
 
EMアルゴリズム
EMアルゴリズムEMアルゴリズム
[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio
Deep Learning JP
 
ラビットチャレンジレポート 深層学習Day2
ラビットチャレンジレポート 深層学習Day2ラビットチャレンジレポート 深層学習Day2
ラビットチャレンジレポート 深層学習Day2
HiroyukiTerada4
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5
Masahito Ohue
 
Self training with noisy student
Self training with noisy studentSelf training with noisy student
Self training with noisy student
harmonylab
 
論文紹介:Transferable Decoding with Visual Entities for Zero-Shot Image Captioning
論文紹介:Transferable Decoding with Visual Entities for Zero-Shot Image Captioning論文紹介:Transferable Decoding with Visual Entities for Zero-Shot Image Captioning
論文紹介:Transferable Decoding with Visual Entities for Zero-Shot Image Captioning
Toru Tamaki
 
2014.02.20_5章ニューラルネットワーク
2014.02.20_5章ニューラルネットワーク2014.02.20_5章ニューラルネットワーク
2014.02.20_5章ニューラルネットワークTakeshi Sakaki
 
PRML復々習レーン#9 前回までのあらすじ
PRML復々習レーン#9 前回までのあらすじPRML復々習レーン#9 前回までのあらすじ
PRML復々習レーン#9 前回までのあらすじ
sleepy_yoshi
 
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
Deep Learning JP
 
オープンCAEとPython
オープンCAEとPythonオープンCAEとPython
オープンCAEとPython
TatsuyaKatayama
 
PRML 2.3.2-2.3.4 ガウス分布
PRML 2.3.2-2.3.4 ガウス分布PRML 2.3.2-2.3.4 ガウス分布
PRML 2.3.2-2.3.4 ガウス分布
Akihiro Nitta
 
完全自動運転実現のための信頼度付き自己位置推定の提案
完全自動運転実現のための信頼度付き自己位置推定の提案完全自動運転実現のための信頼度付き自己位置推定の提案
完全自動運転実現のための信頼度付き自己位置推定の提案
Naoki Akai
 
第5回パターン認識勉強会
第5回パターン認識勉強会第5回パターン認識勉強会
第5回パターン認識勉強会Yohei Sato
 
ニューラルネットワーク入門
ニューラルネットワーク入門ニューラルネットワーク入門
ニューラルネットワーク入門
naoto moriyama
 
2015年度GPGPU実践プログラミング 第9回 行列計算(行列-行列積)
2015年度GPGPU実践プログラミング 第9回 行列計算(行列-行列積)2015年度GPGPU実践プログラミング 第9回 行列計算(行列-行列積)
2015年度GPGPU実践プログラミング 第9回 行列計算(行列-行列積)
智啓 出川
 

What's hot (20)

PRML 5.3-5.4
PRML 5.3-5.4PRML 5.3-5.4
PRML 5.3-5.4
 
ニューラルネットと深層学習の歴史
ニューラルネットと深層学習の歴史ニューラルネットと深層学習の歴史
ニューラルネットと深層学習の歴史
 
CVIM#11 3. 最小化のための数値計算
CVIM#11 3. 最小化のための数値計算CVIM#11 3. 最小化のための数値計算
CVIM#11 3. 最小化のための数値計算
 
CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1
CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1
CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1
 
A Brief Survey of Schrödinger Bridge (Part I)
A Brief Survey of Schrödinger Bridge (Part I)A Brief Survey of Schrödinger Bridge (Part I)
A Brief Survey of Schrödinger Bridge (Part I)
 
EMアルゴリズム
EMアルゴリズムEMアルゴリズム
EMアルゴリズム
 
[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio
 
ラビットチャレンジレポート 深層学習Day2
ラビットチャレンジレポート 深層学習Day2ラビットチャレンジレポート 深層学習Day2
ラビットチャレンジレポート 深層学習Day2
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5
 
Self training with noisy student
Self training with noisy studentSelf training with noisy student
Self training with noisy student
 
論文紹介:Transferable Decoding with Visual Entities for Zero-Shot Image Captioning
論文紹介:Transferable Decoding with Visual Entities for Zero-Shot Image Captioning論文紹介:Transferable Decoding with Visual Entities for Zero-Shot Image Captioning
論文紹介:Transferable Decoding with Visual Entities for Zero-Shot Image Captioning
 
2014.02.20_5章ニューラルネットワーク
2014.02.20_5章ニューラルネットワーク2014.02.20_5章ニューラルネットワーク
2014.02.20_5章ニューラルネットワーク
 
PRML復々習レーン#9 前回までのあらすじ
PRML復々習レーン#9 前回までのあらすじPRML復々習レーン#9 前回までのあらすじ
PRML復々習レーン#9 前回までのあらすじ
 
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
 
オープンCAEとPython
オープンCAEとPythonオープンCAEとPython
オープンCAEとPython
 
PRML 2.3.2-2.3.4 ガウス分布
PRML 2.3.2-2.3.4 ガウス分布PRML 2.3.2-2.3.4 ガウス分布
PRML 2.3.2-2.3.4 ガウス分布
 
完全自動運転実現のための信頼度付き自己位置推定の提案
完全自動運転実現のための信頼度付き自己位置推定の提案完全自動運転実現のための信頼度付き自己位置推定の提案
完全自動運転実現のための信頼度付き自己位置推定の提案
 
第5回パターン認識勉強会
第5回パターン認識勉強会第5回パターン認識勉強会
第5回パターン認識勉強会
 
ニューラルネットワーク入門
ニューラルネットワーク入門ニューラルネットワーク入門
ニューラルネットワーク入門
 
2015年度GPGPU実践プログラミング 第9回 行列計算(行列-行列積)
2015年度GPGPU実践プログラミング 第9回 行列計算(行列-行列積)2015年度GPGPU実践プログラミング 第9回 行列計算(行列-行列積)
2015年度GPGPU実践プログラミング 第9回 行列計算(行列-行列積)
 

Similar to Conformer review

A PROPOSAL TO IMPROVE SEP ROUTING PROTOCOL USING INSENSITIVE FUZZY C-MEANS IN...
A PROPOSAL TO IMPROVE SEP ROUTING PROTOCOL USING INSENSITIVE FUZZY C-MEANS IN...A PROPOSAL TO IMPROVE SEP ROUTING PROTOCOL USING INSENSITIVE FUZZY C-MEANS IN...
A PROPOSAL TO IMPROVE SEP ROUTING PROTOCOL USING INSENSITIVE FUZZY C-MEANS IN...
IJCNCJournal
 
08039246
0803924608039246
08039246
Thilip Kumar
 
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...
Virendra Uppalwar
 
Implementation of energy efficient coverage aware routing protocol for wirele...
Implementation of energy efficient coverage aware routing protocol for wirele...Implementation of energy efficient coverage aware routing protocol for wirele...
Implementation of energy efficient coverage aware routing protocol for wirele...
ijfcstjournal
 
Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
ssuser849b73
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
June-Woo Kim
 
Energy aware clustering protocol (eacp)
Energy aware clustering protocol (eacp)Energy aware clustering protocol (eacp)
Energy aware clustering protocol (eacp)
IJCNCJournal
 
MuMHR: Multi-path, Multi-hop Hierarchical Routing
MuMHR: Multi-path, Multi-hop Hierarchical RoutingMuMHR: Multi-path, Multi-hop Hierarchical Routing
MuMHR: Multi-path, Multi-hop Hierarchical RoutingM H
 
An Adaptive Routing Algorithm for Communication Networks using Back Pressure...
An Adaptive Routing Algorithm for Communication Networks  using Back Pressure...An Adaptive Routing Algorithm for Communication Networks  using Back Pressure...
An Adaptive Routing Algorithm for Communication Networks using Back Pressure...
IJMER
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Jinwon Lee
 
Performance evaluation of variants of particle swarm optimization algorithms ...
Performance evaluation of variants of particle swarm optimization algorithms ...Performance evaluation of variants of particle swarm optimization algorithms ...
Performance evaluation of variants of particle swarm optimization algorithms ...Aayush Gupta
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
csandit
 
IRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural Network
IRJET Journal
 
[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx
thanhdowork
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment review
June-Woo Kim
 
DL for sentence classification project Write-up
DL for sentence classification project Write-upDL for sentence classification project Write-up
DL for sentence classification project Write-up
Hoàng Triều Trịnh
 
wsn
wsnwsn
Computer Simulation of Nano-Structures
Computer Simulation of Nano-StructuresComputer Simulation of Nano-Structures
Computer Simulation of Nano-Structures
Aqeel Khudhair
 
A COMPARATIVE STUDY OF BACKPROPAGATION ALGORITHMS IN FINANCIAL PREDICTION
A COMPARATIVE STUDY OF BACKPROPAGATION ALGORITHMS IN FINANCIAL PREDICTIONA COMPARATIVE STUDY OF BACKPROPAGATION ALGORITHMS IN FINANCIAL PREDICTION
A COMPARATIVE STUDY OF BACKPROPAGATION ALGORITHMS IN FINANCIAL PREDICTION
IJCSEA Journal
 

Similar to Conformer review (20)

A PROPOSAL TO IMPROVE SEP ROUTING PROTOCOL USING INSENSITIVE FUZZY C-MEANS IN...
A PROPOSAL TO IMPROVE SEP ROUTING PROTOCOL USING INSENSITIVE FUZZY C-MEANS IN...A PROPOSAL TO IMPROVE SEP ROUTING PROTOCOL USING INSENSITIVE FUZZY C-MEANS IN...
A PROPOSAL TO IMPROVE SEP ROUTING PROTOCOL USING INSENSITIVE FUZZY C-MEANS IN...
 
08039246
0803924608039246
08039246
 
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...
 
Implementation of energy efficient coverage aware routing protocol for wirele...
Implementation of energy efficient coverage aware routing protocol for wirele...Implementation of energy efficient coverage aware routing protocol for wirele...
Implementation of energy efficient coverage aware routing protocol for wirele...
 
Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
 
Energy aware clustering protocol (eacp)
Energy aware clustering protocol (eacp)Energy aware clustering protocol (eacp)
Energy aware clustering protocol (eacp)
 
MuMHR: Multi-path, Multi-hop Hierarchical Routing
MuMHR: Multi-path, Multi-hop Hierarchical RoutingMuMHR: Multi-path, Multi-hop Hierarchical Routing
MuMHR: Multi-path, Multi-hop Hierarchical Routing
 
An Adaptive Routing Algorithm for Communication Networks using Back Pressure...
An Adaptive Routing Algorithm for Communication Networks  using Back Pressure...An Adaptive Routing Algorithm for Communication Networks  using Back Pressure...
An Adaptive Routing Algorithm for Communication Networks using Back Pressure...
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
Performance evaluation of variants of particle swarm optimization algorithms ...
Performance evaluation of variants of particle swarm optimization algorithms ...Performance evaluation of variants of particle swarm optimization algorithms ...
Performance evaluation of variants of particle swarm optimization algorithms ...
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
 
IRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural Network
 
[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment review
 
D028018022
D028018022D028018022
D028018022
 
DL for sentence classification project Write-up
DL for sentence classification project Write-upDL for sentence classification project Write-up
DL for sentence classification project Write-up
 
wsn
wsnwsn
wsn
 
Computer Simulation of Nano-Structures
Computer Simulation of Nano-StructuresComputer Simulation of Nano-Structures
Computer Simulation of Nano-Structures
 
A COMPARATIVE STUDY OF BACKPROPAGATION ALGORITHMS IN FINANCIAL PREDICTION
A COMPARATIVE STUDY OF BACKPROPAGATION ALGORITHMS IN FINANCIAL PREDICTIONA COMPARATIVE STUDY OF BACKPROPAGATION ALGORITHMS IN FINANCIAL PREDICTION
A COMPARATIVE STUDY OF BACKPROPAGATION ALGORITHMS IN FINANCIAL PREDICTION
 

More from June-Woo Kim

Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention review
June-Woo Kim
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
June-Woo Kim
 
ICLR 2 papers review in signal processing domain
ICLR 2 papers review in signal processing domain ICLR 2 papers review in signal processing domain
ICLR 2 papers review in signal processing domain
June-Woo Kim
 
Blow review
Blow reviewBlow review
Blow review
June-Woo Kim
 
Voice Impersonation Using Generative Adversarial Networks review
Voice Impersonation Using Generative Adversarial Networks reviewVoice Impersonation Using Generative Adversarial Networks review
Voice Impersonation Using Generative Adversarial Networks review
June-Woo Kim
 
Translatotron review
Translatotron reviewTranslatotron review
Translatotron review
June-Woo Kim
 

More from June-Woo Kim (6)

Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention review
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
 
ICLR 2 papers review in signal processing domain
ICLR 2 papers review in signal processing domain ICLR 2 papers review in signal processing domain
ICLR 2 papers review in signal processing domain
 
Blow review
Blow reviewBlow review
Blow review
 
Voice Impersonation Using Generative Adversarial Networks review
Voice Impersonation Using Generative Adversarial Networks reviewVoice Impersonation Using Generative Adversarial Networks review
Voice Impersonation Using Generative Adversarial Networks review
 
Translatotron review
Translatotron reviewTranslatotron review
Translatotron review
 

Recently uploaded

Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
PrashantGoswami42
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 

Recently uploaded (20)

Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 

Conformer review

  • 1. Conformer: Convolution-augmented Transformer for Speech Recognition Presented by: June-Woo Kim Kyungpook National University 6, Nov. 2020. Gulati, Anmol, et al
  • 2. Abstract • This paper achieved the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. • This paper proposed the convolution-augmented transformer for speech recognition, named Conformer, and it significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. • On the LibriSpeech benchmark, they achieved % score on test/testother set. • 2.1%/4.3% of Character Error Rate (CER) without using a language model. • 1.9%/3.9% with an external language model.
  • 3. Background • End-to-end automatic speech recognition (ASR) systems based on neural networks have seen large improvements in recent years. • RNN have been the de-facto choice for ASR as they can model the temporal dependencies in the audio sequences effectively. • Recently, Transformer [1] architecture based on self-attention has enjoyed widespread adoption for modeling sequences due to its ability to capture long distance interactions and the high training efficiency. • Alternatively, CNN have also been successful for ASR, which capture local context progressively via a local receptive field layer by layer. [1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
  • 4. Background • However, models with self-attention or CNN each has its limitations. • While Transformers are good at modeling long-range global context, they are less capable to extract fine-grained local feature patterns. • CNNs, on the other hand, exploit local information and are used as the de-facto computational block in vision. • One limitation of using local connectivity is that we need many more layers or parameters to capture global information. • To combat this issue, contemporary work ContextNet [2] adopts the squeeze-and-excitation module [3] in each residual block to capture longer context. • Squeeze-and-excitation module’s goal is to improve the representational power of a network by explicitly modelling the interdependencies between the channels of its convolutional features. [2] Poudel, Rudra PK, et al. "Contextnet: Exploring context and detail for semantic segmentation in real-time." arXiv preprint arXiv:1805.04554 (2018). [3] Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
  • 5. Background: SENet(Squeeze and excitation networks) • The goal of SENet is to improve the representational power of a network by explicitly modelling the interdependencies between the channels of its convolutional features. • Contribution of this paper • It can be attached directly to any network such as VGG, GoogLeNet, ResNet, etc. • The improvement in model performance is significant compared to the increase in parameters. • This has the advantage that model complexity and computational burden are not increasing significantly
  • 6. Background: SENet(Squeeze and excitation networks) • Squeeze: Global information embedding • This paper used GAP (Global Average Pooling), one of the most common methodologies for extracting the important information. • This paper described that using GAP can compress global spatial information into a channel descriptor. • Excitation: Adaptive recalibration • Calculate the channel-wise dependencies. • In this paper, this is simply calculated by adjusting the Fully connected layer and the nonlinear function. • Finally, 𝑠 and each of the 𝐶 feature maps before GAP in the existing network are multiplied. 𝑠 = 𝐹𝑒𝑥 𝑧, 𝑊 = 𝜎 𝑊2 𝛿 𝑊1 𝑧 Where 𝛿 =ReLU, 𝜎=Sigmoid, 𝑊1, 𝑊2 =FC layer,
  • 7. Background: SENet(Squeeze and excitation networks) • Exemplars: SE-Inception and SE-ResNet
  • 8. Background • However, it is still limited in capturing dynamic global context as it only applies a global averaging over the entire sequence. • Recent works have shown that combining convolution and self-attention improves over using them individually. • Papers like [4], [5] have augmented self-attention with relative position based information that maintains equivariance. [4] Yang, Baosong, et al. "Convolutional self-attention networks." arXiv preprint arXiv:1904.03107 (2019). [5] Yu, Adams Wei, et al. "QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension." International Conference on Learning Representations. 2018.
  • 9. Method • This paper showed how to organically combine convolutions with self-attention in ASR models. • The distinctive feature of our model is the use of Conformer blocks in the place of Transformer blocks. • A conformer block is composed of four modules stacked together, i.e, a feed-forward module, a self-attention module, a convolution module, and a second feed- forward module in the end.
  • 10. Method: Multi-Headed Self-Attention Module • This paper employed multi-headed self-attention (MHSA) while integrating an important technique from Transformer-XL, the relative sinusoidal positional encoding scheme. • The relative positional encoding allows the self-attention module to generalize better on different input length and the resulting encoder is more robust to the variance of the utterance length. • This paper used pre-norm residual units [6, 7] with dropout which helps training and regularizing deeper models. [6] Wang, Qiang, et al. "Learning Deep Transformer Models for Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. [7] Nguyen, Toan Q., and Julian Salazar. "Transformers without tears: Improving the normalization of self-attention." arXiv preprint arXiv:1910.05895 (2019).
  • 11. Method: Convolution module • The convolution module starts with a gating mechanism [8]—a pointwise convolution and a gated linear unit (GLU). • This is followed by a single 1-D depthwise convolution layer. • Batchnorm is deployed just after the convolution to aid training deep models. • [8] show this mechanism to be useful for language modeling as it allows the model to select which words or features are relevant for predicting the next word. [8] Dauphin, Yann N., et al. "Language modeling with gated convolutional networks." International conference on machine learning. 2017.
  • 12. Method: Feed Forward Module • The Transformer architecture deploys a feed forward module after the MHSA layer and is composed of two linear transformations and a nonlinear activation in between. • A residual connection is added over the feed-forward layers, followed by layer normalization.
  • 13. Method: Feed Forward Module • However, this paper followed pre-norm residual units and apply layer normalization within the residual unit and on the input before the first linear layer. • This paper also applied Swish activation [9] and dropout, which helps regularizing the network. [9] Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. "Searching for activation functions." arXiv preprint arXiv:1710.05941 (2017). This paper proposed to leverage automatic search techniques to discover new activation functions. 𝑓 𝑥 = 𝑥 ∙ 𝜎(𝛽𝑥) Where 𝜎 𝑧 = 1 + exp −𝑧 −1 and 𝛽 is a constant or trainable parameter
  • 14. Method: conformer Block • Their proposed Conformer block contains two Feed Forward modules sandwiching the Multi-Headed Self-Attention module and the Convolution module. 𝑥𝑖 = 𝑥𝑖 + 1 2 𝐹𝐹𝑁(𝑥𝑖) 𝑥𝑖 ′ = 𝑥𝑖 + 𝑀𝐻𝑆𝐴( 𝑥𝑖) 𝑥𝑖 ′′ = 𝑥𝑖 ′ + 𝐶𝑜𝑛𝑣(𝑥𝑖 ′ ) 𝑦𝑖 = 𝐿𝑎𝑦𝑒𝑟𝑛𝑜𝑟𝑚(𝑥𝑖 ′′ + 1 2 𝐹𝐹𝑁 𝑥𝑖 ′′ )
  • 15. Experiments • Dataset: Librispeech • 970 hours of labeled speech and an additional 800M word token text-only corpus for building language model. • Preprocessing • 80-channel filterbanks features computed from a 25ms window with a stride of 10ms • SpecAugment also used with mask parameter (𝐹 = 27). • Ten time masks with maximum time-mask ratio (𝑝𝑠=0.05), where the maximum-size of the time mask is set to 𝑝𝑠 times the length of the utterance.
  • 16. Hyper-parameters for Conformer • Decoder: a single-LSTM-layer. • Applied dropout in each residual unit of the conformer • 𝑃𝑑𝑟𝑜𝑝 = 0.1 • 𝑙2 regularization with 1𝑒 − 6 weight is also added to all the trainable weights. • Adam optimizer with 𝛽1 = 0.9, 𝛽2 = 0.98 and 𝜖 = 10−9 . • Transformer learning rate schedule with 10k warm-up steps. • LM • 3-layer LSTM with width 4096 trained on the LibrisSpeech language model corpus with the LibriSpeech960h transcripts added, tokenized with the 1k- WPM built from LibriSpeech 960h. • LM has word-level perplexity 63.9 on the dev-set transcripts.
  • 18. Ablation Studies • Conformer Block vs. Transformer Block • Among all differences, convolution sub-block is the most important feature, while having a Macaron-style FFN pair is also more effective than a single FFN of the same number of parameters. • They also found that using swish activations led to faster convergence in the Conformer models.
  • 19. Ablation Studies • Combinations of Convolution and Transformer Modules • First, performance significantly dropped because replacing the depthwise convolution in the convolution module with a lightweight convolution in devother set. • They found that to split the input into parallel branches of multi-headed self attention module and a convolution module with their output concatenated worsens the performance when compared to their proposed architecture.
  • 20. Ablation Studies • Macaron FFN • Table 5 shows the impact of changing the Conformer block to use a single FFN or full-step residuals.
  • 21. Ablation Studies • # of Attention Heads • They performed experiments to study the effect of varying the number of attention heads from 4 to 32 in our large model, using the same number of heads in all layers. • They found that increasing attention heads up to 16 improves the accuracy, especially over the devother datasets.
  • 22. Ablation Studies • Convolution Kernel Sizes • They sweep the kernel size in {3, 7, 17, 32, 65} of the large model, using the same kernel size for all layers. • They found that the performance improves with larger kernel sizes till kernel sizes 17 and 32. • However, worsens in the case of kernel size 65. Worse
  • 23. Contribution • In this work, authors introduced Conformer, an architecture that integrates components from CNNs and Transformers for end-to-end speech recognition. • They studied the importance of each component, and demonstrated that the inclusion of convolution modules is critical to the performance of the Conformer model. • The model exhibits better accuracy with fewer parameters than previous work on the LibriSpeech dataset, and achieves a new state-of-the-art performance at 1.9%/3.9% for test/testother.
  • 24. Reference • Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. • Poudel, Rudra PK, et al. "Contextnet: Exploring context and detail for semantic segmentation in real-time." arXiv preprint arXiv:1805.04554 (2018). • Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. • Yang, Baosong, et al. "Convolutional self-attention networks." arXiv preprint arXiv:1904.03107 (2019). • Yu, Adams Wei, et al. "QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension." International Conference on Learning Representations. 2018. • Wang, Qiang, et al. "Learning Deep Transformer Models for Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. • Nguyen, Toan Q., and Julian Salazar. "Transformers without tears: Improving the normalization of self-attention." arXiv preprint arXiv:1910.05895 (2019). • Dauphin, Yann N., et al. "Language modeling with gated convolutional networks." International conference on machine learning. 2017. • Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. "Searching for activation functions." arXiv preprint arXiv:1710.05941 (2017).

Editor's Notes

  1. Beta = 1  Sigmoid-weighted Linear Unit  강화학습에 제안됨 Beta = 0  스케일링 된 선형함수 𝑓 𝑥 = 𝑥 2 Beta = 무한  ReLu