Conformer, Gulati, Anmol, et al. "Conformer: Convolution-augmented Transformer for Speech Recognition." arXiv preprint arXiv:2005.08100 (2020). review by June-Woo Kim
論文紹介:Transferable Decoding with Visual Entities for Zero-Shot Image CaptioningToru Tamaki
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning
Junjie Fei, Teng Wang, Jinrui Zhang, Zhenyu He, Chengjie Wang, Feng Zheng; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3136-3146
https://openaccess.thecvf.com/content/ICCV2023/html/Fei_Transferable_Decoding_with_Visual_Entities_for_Zero-Shot_Image_Captioning_ICCV_2023_paper.html
A PROPOSAL TO IMPROVE SEP ROUTING PROTOCOL USING INSENSITIVE FUZZY C-MEANS IN...IJCNCJournal
Nowadays, with the rapid development of science and technology and the ever-increasing demand in every field, wireless sensor networks are emerging as a necessary scientific achievement to meet the demand of human in modern society. The wireless sensor network (WSN) is designed to help us not lose too much energy, workforce, avoid danger and they bring high efficiency to work. Various routing protocols are being used to increase the energy efficiency of the network, with two distinct types of protocols, homogenous and heterogeneous. In these two protocols, the SEP (Stable Election Protocol) is one of the most effective heterogeneous protocols which increase the stability of the network. In this paper, we propose an approaching the εFCM algorithm in clustering the SEP protocol which makes the WSN network more energy efficient. The simulation results showed that the SEP-εFCM proposed protocol performed better than the conventional SEP protocol
論文紹介:Transferable Decoding with Visual Entities for Zero-Shot Image CaptioningToru Tamaki
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning
Junjie Fei, Teng Wang, Jinrui Zhang, Zhenyu He, Chengjie Wang, Feng Zheng; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3136-3146
https://openaccess.thecvf.com/content/ICCV2023/html/Fei_Transferable_Decoding_with_Visual_Entities_for_Zero-Shot_Image_Captioning_ICCV_2023_paper.html
A PROPOSAL TO IMPROVE SEP ROUTING PROTOCOL USING INSENSITIVE FUZZY C-MEANS IN...IJCNCJournal
Nowadays, with the rapid development of science and technology and the ever-increasing demand in every field, wireless sensor networks are emerging as a necessary scientific achievement to meet the demand of human in modern society. The wireless sensor network (WSN) is designed to help us not lose too much energy, workforce, avoid danger and they bring high efficiency to work. Various routing protocols are being used to increase the energy efficiency of the network, with two distinct types of protocols, homogenous and heterogeneous. In these two protocols, the SEP (Stable Election Protocol) is one of the most effective heterogeneous protocols which increase the stability of the network. In this paper, we propose an approaching the εFCM algorithm in clustering the SEP protocol which makes the WSN network more energy efficient. The simulation results showed that the SEP-εFCM proposed protocol performed better than the conventional SEP protocol
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...Virendra Uppalwar
Here I am providing a complete power point presentation for students who are searching for femtocell based technology study material. In our project for improving spectral efficiency of femtocell based handoff we use LZMA data compression techique. we obtain a positive results on our performance metrics parameter like Delay, Energy and Throughput.
Implementation of energy efficient coverage aware routing protocol for wirele...ijfcstjournal
In recent years, wireless sensor network have been used in many application such as disaster reservation,
agriculture, environmental observation and forecasting .Coverage preservation and energy consumption
are two most important issues in wireless sensor networks. To increase the network lifetime, we propose an
energy efficient coverage aware routing protocol for wireless sensor network for randomly deployed sensor
nodes. Some of the routing protocol is based on energy efficiency and some are based on coverage aware.
The proposed routing protocol is based on both the issues i.e. coverage and energy, in which we first find
the k-mean i.e. the degree of coverage, so that we can use this in the selection of cluster heads in wireless
sensor network by using Genetic Algorithm for increasing network lifetime and coverage. For cluster head
selection each node evaluates its k-mean and energy by internal function which used as fitness function in
genetic algorithm. The proposed algorithm “Implementation of energy efficient coverage aware routing
protocol for Wireless Sensor Network” is designed for homogeneous wireless sensor network. Simulations
results show that proposed algorithm increases the network lifetime by reduce the energy consumption and
preserve coverage. Simulation is done with MATLAB and a comparison of algorithm with benchmark
algorithms is also performed.
Parallel WaveGAN, Yamamoto, Ryuichi, Eunwoo Song, and Jae-Min Kim. "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram." ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020. review by June-Woo Kim
Energy aware clustering protocol (eacp)IJCNCJournal
Energy saving to prolong the network life is an important design issue while developing a new routing
protocol for wireless sensor network. Clustering is a key technique for this and helps in maximizing the
network lifetime and scalability. Most of the routing and data dissemination protocols of WSN assume a
homogeneous network architecture, in which all sensors have the same capabilities in terms of battery
power, communication, sensing, storage, and processing. Recently, there has been an interest in
heterogeneous sensor networks, especially for real deployments. This research paper has proposed a new
energy aware clustering protocol (EACP) for heterogeneous wireless sensor networks. Heterogeneity is
introduced in EACP by using two types of nodes: normal and advanced. In EACP cluster heads for normal
nodes are elected with the help of a probability scheme based on residual and average energy of the
normal nodes. This will ensure that only the high residual normal nodes can become the cluster head in a
round. Advanced nodes use a separate probability based scheme for cluster head election and they will
further act as a gateway for normal cluster heads and transmit their data load to base station when they
are not doing the duty of a cluster head. Finally a sleep state is suggested for some sensor nodes during
cluster formation phase to save network energy. The performance of EACP is compared with SEP and
simulation result shows the better result for stability period, network life and energy saving than SEP.
An Adaptive Routing Algorithm for Communication Networks using Back Pressure...IJMER
The basic idea of backpressure techniques is to prioritize transmissions over links that have
the highest queue differentials. Backpressure method effectively makes packets flow through the network
as though pulled by gravity towards the destination end, which has the smallest queue size of zero. Under
high traffic conditions, this method works very well, and backpressure is able to fully utilize the available
network resources in a highly dynamic fashion. Under low traffic conditions, however, because many
other hosts may also have a small or zero queue size, there is inefficiency in terms of an increase in
delay, as packets may loop or take a long time to make their way to the destination end. In this paper we
use the concept of shadow queues. Each node has to maintain some counters, called as shadow queues,
per destination. This is very similar to the idea of maintaining a routing table (for routing purpose) per
destination. Using the concept of shadow queues, we partially decouple routing and the scheduling. A
shadow network is maintained to update a probabilistic routing table that packets use upon arrival at a
node. The same shadow network, with back-pressure technique, is used to activate transmissions between
nodes. The routing algorithm is designed to minimize the average number of hops used by the packets in
the network. This idea, along with the scheduling and routing decoupling, leads to delay reduction
compared with the traditional back-pressure algorithm
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee
#PR12 #PR344
안녕하세요 TensorFlow Korea 논문 읽기 모임 PR-12의 344번째 논문 리뷰입니다.
오늘은 중국과기대와 MSRA에서 나온 A Battle of Network Structures라는 강렬한 제목을 가진 논문입니다.
부제에서 잘 나와있듯이 이 논문은 computer vision에서 CNN, Transformer, MLP에 대해서 같은 환경에서 비교를 통해 어떤 특징들이 있는지를 알아본 논문입니다.
우선 같은 조건에서 실험하기 위하여 SPACH라는 unified framework을 만들고 그 안에 CNN, Transformer, MLP를 넣어서 실험을 합니다.
셋 모두 조건이 잘 갖춰지면 비슷한 성능을 내지만, MLP는 model size가 커지면 overfitting이 발생하고
CNN은 Transformer에 비해서 적은 data에서도 좋은 성능이 나오는 generalization capability가 좋고,
Transformer는 model capacity가 커서 data가 충분하고 연산량도 큰 환경에서 잘한다는 것이 실험의 한가지 결과입니다.
또하나는 global receptive field를 갖는 transformer나 MLP의 경우에도 local한 연산을 하는 local model을 같이 써줄때에 성능이 좋아진다는 것입니다.
이런 insight들을 통해서 이 논문에서는 CNN과 Transformer를 결합한 형태의 Hybrid model을 제안하여 SOTA 성능을 낼 수 있음을 보여줍니다.
개인적으로 놀랄만한 insight를 발견한 것은 아니었지만 세가지 network의 특징과 장단점에 대해서 정리해볼 수 있는 그런 논문이라고 평하고 싶습니다.
자세한 내용은 영상을 참고해주세요! 감사합니다
영상링크: https://youtu.be/NVLMZZglx14
논문링크: https://arxiv.org/abs/2108.13002
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...csandit
Single-channel speech intelligibility enhancement is much more difficult than multi-channel
intelligibility enhancement. It has recently been reported that machine learning training-based
single-channel speech intelligibility enhancement algorithms perform better than traditional
algorithms. In this paper, the performance of a deep neural network method using a multiresolution
cochlea-gram feature set recently proposed to perform single-channel speech
intelligibility enhancement processing is evaluated. Various conditions such as different
speakers for training and testing as well as different noise conditions are tested. Simulations
and objective test results show that the method performs better than another deep neural
networks setup recently proposed for the same task, and leads to a more robust convergence
compared to a recently proposed Gaussian mixture model approach.
SpecAugment, Park, Daniel S., et al. "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition}}." Proc. Interspeech 2019 (2019): 2613-2617. review by June-Woo Kim
A WSN primary outline issue for a sensor system is protection of the vitality accessible at every sensor node. We propose to convey different, versatile base stations to delay the lifetime of the sensor system. We split the lifetime of the sensor system into equivalent stretches of time known as rounds. Base stations are migrated toward the begin of a round. Our strategy utilizes a whole number straight program to focus new areas for the base stations and in view of steering convention to guarantee vitality proficient directing amid every round. We propose four assessment measurements and look at our answer utilizing these measurements. Taking into account the reproduction results we demonstrate that utilizing various, versatile base stations as per the arrangement given by our plans would altogether expand the lifetime of the sensor system.
A COMPARATIVE STUDY OF BACKPROPAGATION ALGORITHMS IN FINANCIAL PREDICTIONIJCSEA Journal
Stock market price index prediction is a challenging task for investors and scholars. Artificial neural networks have been widely employed to predict financial stock market levels thanks to their ability to model nonlinear functions. The accuracy of backpropagation neural networks trained with different heuristic and numerical algorithms is measured for comparison purpose. It is found that numerical algorithm outperform heuristic techniques.
Monotonic Multihead Attention, Ma, Xutai, et al. "Monotonic Multihead Attention." International Conference on Learning Representations. 2020. review by June-Woo Kim
Non autoregressive neural text-to-speech reviewJune-Woo Kim
Non autoregressive neural text-to-speech, Peng, Kainan, et al. "Non-autoregressive neural text-to-speech." International Conference on Machine Learning. PMLR, 2020. review by June-Woo Kim
ICLR 2 papers review in signal processing domain June-Woo Kim
1. DDSP: Differentiable Digital Signal Processing (Spotlight), Engel, Jesse, Chenjie Gu, and Adam Roberts. "DDSP: Differentiable Digital Signal Processing." International Conference on Learning Representations. 2020.
2. High Fidelity Speech Synthesis with Adversarial Networks (Talk), Bińkowski, Mikołaj, et al. "High Fidelity Speech Synthesis with Adversarial Networks." International Conference on Learning Representations. 2020.
review by June-Woo Kim
Blow, Serrà, Joan, Santiago Pascual, and Carlos Segura Perales. "Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion." Advances in Neural Information Processing Systems. 2019. review by June-Woo Kim
Voice Impersonation Using Generative Adversarial Networks reviewJune-Woo Kim
Voice Impersonation Using Generative Adversarial Networks, Gao, Yang, Rita Singh, and Bhiksha Raj. "Voice impersonation using generative adversarial networks." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018. review by June-Woo Kim
Translatotron, Jia, Ye, et al. "Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model}}." Proc. Interspeech 2019 (2019): 1123-1127. review by June-Woo Kim
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
2. Abstract
• This paper achieved the best of both worlds by studying how to combine convolution neural networks and
transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way.
• This paper proposed the convolution-augmented transformer for speech recognition, named Conformer, and it
significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.
• On the LibriSpeech benchmark, they achieved % score on test/testother set.
• 2.1%/4.3% of Character Error Rate (CER) without using a language model.
• 1.9%/3.9% with an external language model.
3. Background
• End-to-end automatic speech recognition (ASR) systems based on neural networks have seen large improvements
in recent years.
• RNN have been the de-facto choice for ASR as they can model the temporal dependencies in the audio sequences
effectively.
• Recently, Transformer [1] architecture based on self-attention has enjoyed widespread adoption for modeling sequences
due to its ability to capture long distance interactions and the high training efficiency.
• Alternatively, CNN have also been successful for ASR, which capture local context progressively via a local receptive field
layer by layer.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
4. Background
• However, models with self-attention or CNN each has its limitations.
• While Transformers are good at modeling long-range global context, they are less capable to extract fine-grained local
feature patterns.
• CNNs, on the other hand, exploit local information and are used as the de-facto computational block in vision.
• One limitation of using local connectivity is that we need many more layers or parameters to capture global information.
• To combat this issue, contemporary work ContextNet [2] adopts the squeeze-and-excitation module [3] in each
residual block to capture longer context.
• Squeeze-and-excitation module’s goal is to improve the representational power of a network by explicitly modelling the
interdependencies between the channels of its convolutional features.
[2] Poudel, Rudra PK, et al. "Contextnet: Exploring context and detail for semantic segmentation in real-time." arXiv preprint arXiv:1805.04554 (2018).
[3] Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
5. Background: SENet(Squeeze and excitation networks)
• The goal of SENet is to improve the representational power of a network by explicitly modelling the
interdependencies between the channels of its convolutional features.
• Contribution of this paper
• It can be attached directly to any network such as VGG, GoogLeNet, ResNet, etc.
• The improvement in model performance is significant compared to the increase in parameters.
• This has the advantage that model complexity and computational burden are not increasing significantly
6. Background: SENet(Squeeze and excitation networks)
• Squeeze: Global information embedding
• This paper used GAP (Global Average Pooling), one of the most common methodologies for extracting the important
information.
• This paper described that using GAP can compress global spatial information into a channel descriptor.
• Excitation: Adaptive recalibration
• Calculate the channel-wise dependencies.
• In this paper, this is simply calculated by adjusting the Fully connected layer and the nonlinear function.
• Finally, 𝑠 and each of the 𝐶 feature maps before GAP in the existing network are multiplied.
𝑠 = 𝐹𝑒𝑥 𝑧, 𝑊 = 𝜎 𝑊2 𝛿 𝑊1 𝑧
Where 𝛿 =ReLU, 𝜎=Sigmoid,
𝑊1, 𝑊2 =FC layer,
8. Background
• However, it is still limited in capturing dynamic global context as it only applies a global averaging over the
entire sequence.
• Recent works have shown that combining convolution and self-attention improves over using them individually.
• Papers like [4], [5] have augmented self-attention with relative position based information that maintains equivariance.
[4] Yang, Baosong, et al. "Convolutional self-attention networks." arXiv preprint arXiv:1904.03107 (2019).
[5] Yu, Adams Wei, et al. "QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension." International Conference on Learning Representations. 2018.
9. Method
• This paper showed how to organically combine convolutions with self-attention in ASR models.
• The distinctive feature of our model is the use of Conformer blocks in the place of Transformer blocks.
• A conformer block is composed of four modules stacked together, i.e, a feed-forward module, a self-attention module, a convolution module, and a second feed-
forward module in the end.
10. Method: Multi-Headed Self-Attention Module
• This paper employed multi-headed self-attention (MHSA) while integrating an important technique from
Transformer-XL, the relative sinusoidal positional encoding scheme.
• The relative positional encoding allows the self-attention module to generalize better on different input length and the
resulting encoder is more robust to the variance of the utterance length.
• This paper used pre-norm residual units [6, 7] with dropout which helps training and regularizing deeper models.
[6] Wang, Qiang, et al. "Learning Deep Transformer Models for Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
[7] Nguyen, Toan Q., and Julian Salazar. "Transformers without tears: Improving the normalization of self-attention." arXiv preprint arXiv:1910.05895 (2019).
11. Method: Convolution module
• The convolution module starts with a gating mechanism [8]—a pointwise convolution and a gated linear unit
(GLU).
• This is followed by a single 1-D depthwise convolution layer.
• Batchnorm is deployed just after the convolution to aid training deep models.
• [8] show this mechanism to be useful for language modeling as it allows the model to select which words or
features are relevant for predicting the next word.
[8] Dauphin, Yann N., et al. "Language modeling with gated convolutional networks." International conference on machine learning. 2017.
12. Method: Feed Forward Module
• The Transformer architecture deploys a feed forward module after the MHSA layer and is composed of two linear
transformations and a nonlinear activation in between.
• A residual connection is added over the feed-forward layers, followed by layer normalization.
13. Method: Feed Forward Module
• However, this paper followed pre-norm residual units and apply layer normalization within the residual unit and
on the input before the first linear layer.
• This paper also applied Swish activation [9] and dropout, which helps regularizing the network.
[9] Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. "Searching for activation functions." arXiv preprint arXiv:1710.05941 (2017).
This paper proposed to leverage automatic search
techniques to discover new activation functions.
𝑓 𝑥 = 𝑥 ∙ 𝜎(𝛽𝑥)
Where 𝜎 𝑧 = 1 + exp −𝑧 −1
and 𝛽 is a constant or
trainable parameter
15. Experiments
• Dataset: Librispeech
• 970 hours of labeled speech and an additional 800M word token text-only corpus for building language model.
• Preprocessing
• 80-channel filterbanks features computed from a 25ms window with a stride of 10ms
• SpecAugment also used with mask parameter (𝐹 = 27).
• Ten time masks with maximum time-mask ratio (𝑝𝑠=0.05), where the maximum-size of the time mask is set to 𝑝𝑠 times the
length of the utterance.
16. Hyper-parameters for Conformer
• Decoder: a single-LSTM-layer.
• Applied dropout in each residual unit of the conformer
• 𝑃𝑑𝑟𝑜𝑝 = 0.1
• 𝑙2 regularization with 1𝑒 − 6 weight is also added to all the trainable weights.
• Adam optimizer with 𝛽1 = 0.9, 𝛽2 = 0.98 and 𝜖 = 10−9
.
• Transformer learning rate schedule with 10k warm-up steps.
• LM
• 3-layer LSTM with width 4096 trained on the LibrisSpeech language model corpus with the LibriSpeech960h transcripts
added, tokenized with the 1k- WPM built from LibriSpeech 960h.
• LM has word-level perplexity 63.9 on the dev-set transcripts.
18. Ablation Studies
• Conformer Block vs. Transformer Block
• Among all differences, convolution sub-block is the most important feature, while having a Macaron-style FFN pair is also
more effective than a single FFN of the same number of parameters.
• They also found that using swish activations led to faster convergence in the Conformer models.
19. Ablation Studies
• Combinations of Convolution and Transformer Modules
• First, performance significantly dropped because replacing the depthwise convolution in the convolution module with a
lightweight convolution in devother set.
• They found that to split the input into parallel branches of multi-headed self attention module and a convolution module
with their output concatenated worsens the performance when compared to their proposed architecture.
20. Ablation Studies
• Macaron FFN
• Table 5 shows the impact of changing the Conformer block to use a single FFN or full-step residuals.
21. Ablation Studies
• # of Attention Heads
• They performed experiments to study the effect of varying the number of attention heads from 4 to 32 in our large model,
using the same number of heads in all layers.
• They found that increasing attention heads up to 16 improves the accuracy, especially over the devother datasets.
22. Ablation Studies
• Convolution Kernel Sizes
• They sweep the kernel size in {3, 7, 17, 32, 65} of the large model, using the same kernel size for all layers.
• They found that the performance improves with larger kernel sizes till kernel sizes 17 and 32.
• However, worsens in the case of kernel size 65.
Worse
23. Contribution
• In this work, authors introduced Conformer, an architecture that integrates components from CNNs and
Transformers for end-to-end speech recognition.
• They studied the importance of each component, and demonstrated that the inclusion of convolution modules is
critical to the performance of the Conformer model.
• The model exhibits better accuracy with fewer parameters than previous work on the LibriSpeech dataset, and
achieves a new state-of-the-art performance at 1.9%/3.9% for test/testother.
24. Reference
• Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
• Poudel, Rudra PK, et al. "Contextnet: Exploring context and detail for semantic segmentation in real-time." arXiv
preprint arXiv:1805.04554 (2018).
• Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer
vision and pattern recognition. 2018.
• Yang, Baosong, et al. "Convolutional self-attention networks." arXiv preprint arXiv:1904.03107 (2019).
• Yu, Adams Wei, et al. "QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension."
International Conference on Learning Representations. 2018.
• Wang, Qiang, et al. "Learning Deep Transformer Models for Machine Translation." Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics. 2019.
• Nguyen, Toan Q., and Julian Salazar. "Transformers without tears: Improving the normalization of self-attention." arXiv
preprint arXiv:1910.05895 (2019).
• Dauphin, Yann N., et al. "Language modeling with gated convolutional networks." International conference on machine
learning. 2017.
• Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. "Searching for activation functions." arXiv preprint
arXiv:1710.05941 (2017).