SlideShare a Scribd company logo
Semantic Mask for Transformer
Based End-to-End Speech Recognition
Author : 𝐶ℎ𝑒𝑛𝑔𝑦𝑖 𝑊𝑎𝑛𝑔, 𝑌𝑢 𝑊𝑢, 𝑌𝑢𝑗𝑖𝑎𝑜 𝐷𝑢⸭, 𝐽𝑖𝑛𝑦𝑢 𝐿𝑖⸭, 𝑆ℎ𝑢𝑗𝑖𝑒 𝐿𝑖𝑢, 𝐿𝑖𝑎𝑛𝑔 𝐿𝑢⸭, 𝑆ℎ𝑢𝑜 𝑅𝑒𝑛, 𝐺𝑢𝑜𝑙𝑖 𝐿𝑒⸭, 𝑆ℎ𝑒𝑛𝑔 𝑍ℎ𝑎𝑜⸭, 𝑀𝑖𝑛𝑔 𝑍ℎ𝑢𝑜
 Microsoft Research Asia, Beijing
⸭ Microsoft Speech and Language Group
⸭ Beijing University of Posts and Telecommunications
PAPER PRESENTATION
Whenty Ariyanti
DEPARTMENT OF COMPUTER SCIENCE AND INFORMATION
ENGINEERINGNATIONAL CENTRAL UNIVERSITY
TAIWAN
March 23, 2020
OUTLINE
OVERVIEW01
• Masking Strategy
• Why Semantic Mask Works?
SEMANTIC MASKING02
• CNN Layer
• Transformer Block
• ASR Training and Decoding
MODEL ARCHITECTURE03
• Librispeech 960h
• TedLium2
EXPERIMENTS04
RESULTS05
OVERVIEW
01
Attention-based encoder-decoder model has achieved
impressive results for both automatic speech
recognition (ASR) and text-to-speech (TTS) tasks
03This model is prone to overfitting, especially when
the amount of training data is limited
05
The idea is to mask the input features
corresponding to a particular output token (e.g., a
word or a word-piece)
02
This approach takes advantage of the memorization
capacity of neural networks to learn the mapping from
the input sequence to the output sequence from
scratch (without assumption of prior knowledge such
as the alignments)
04
Inspired by SpecAugment and BERT, proposed
semantic mask based regularization for training such
kind of end-to-end (E2E) model
06
This research study the transformer-based model
for ASR for this work and perform experiments on
Librispeech 960h and TedLium2 dataset.
INDEX TERMS :
End-to-End ASR, Transformer, Semantic Mask
BACKGROUND
End-to-End (E2E) acoustic model, particularly with the
attention-based encoder-decoder framework, have
achieved a competitive recognition accuracy in a wide
range of speech dataset
End-to-End (E2E)
Learn the mapping from the input acoustic signals to
the output transcriptions without decomposing the
problems into several different modules such as
lexicon modeling, acoustic modeling and language
modeling as in the conventional hybrid architecture
To improves the generalization capacity of the model and
the strength of the language modeling power, this study
propose a semantic approach (Inspired by SpecAugment
and BERT)
PROPOSED METHOD
This method masks out the whole patch of the
features corresponding to an output token during
training (e.g., a word or a word-piece)
This study focus on the transformer architecture, which
originally proposed for neural machine translation. Compared
with RNNs the transformer based encoder can capture the
long-term correlations with a computational complexity
instead of using many steps of BPPT as in RNN
• Difficult to tune the strength of each component
• Tends to make grammatical error (indicate the language
modeling power of the model is weak)
• Mismatch between the training and evaluation data (due to
the small amount of training data)
E2E Weakness :
01
02
03
04
05
06
SEMANTICMASKING
M A S K I N G
S T R A T E G Y
Figure 1. An example of semantic mask
Requires the alignment information in order to perform the token-wise
masking (as shown in Figure 1)
APPROACH
Used Montreal Forced Aligner trained with the training data to perform
forced-alignment between the acoustic signals and the transcription to
obtain the world-level timing information
TOOLKIT
Randomly select a percentage of the tokens and mask the corresponding
speech segments in each iteration
TRAINING
Randomly sample 15% of the tokens and set the masked piece to the
mean value of the whole utterance
PROPOSED WORK
01
02
03
04
Adopt a time wrap, frequency masking and time masking strategy
MASKING STRATEGY
05
Idea of Speech Augment
SEMANTICMASKING
03
01
04
02
Spectrum augmentation similar to this method. Both
propose to mask spectrum for E2E model training but
the intuitions behind those two are different
SpecAugment randomly masks spectrum in order to add
noise to the source input, making the E2E ASR problem
harder and prevents the over-fitting problem in a large
E2E model
E2E model has to predict the token based on other
signals, tokens that have generated or other unmasked
speech features (to alleviate over-fitting)
Reduces the hyper-parameter tuning workload of
SpecAugment and is more robust when the variance of
input audio length is large.
WHY SEMANTIC
M A S K W O R K ?
CNN LAYER
Model Architecture
Figure 2. CNN Layer Architecture
Represent input signals as a sequence of log-Mel filter bank features,
𝑋 = (𝑋0 … 𝑋 𝑛) where 𝑋𝑖 is 83-dim vector.
01
Use VGG-like convolution block with layer normalization and max-
pooling function
02
The specific architecture outperforms Convolution 2D subsampling
method
03
Use 1D-CNN in the decoder to extract local features replacing the
position embedding
04
TRANSFORMER BLOCK
Model Architecture
Transformer module consumes the outputs of CNN and extract features
with a self-attention mechanism
01
Suppose that 𝑄, 𝐾, and 𝑉 are inputs of transformer block, its output are
calculated as :
SelfAttention 𝑄, 𝐾, 𝑉 = softmax
𝑄𝐾
𝑑 𝑘
𝑉
02
Multi-head attention is proposed to enable dealing with multiple
attention as :
Multihead 𝑄, 𝐾, 𝑉 = [𝐻1 … 𝐻 𝑑ℎ𝑒𝑎𝑑
]𝑊ℎ𝑒𝑎𝑑
where 𝐻𝑖 = SelfAttention(𝑄𝑖, 𝐾𝑖, 𝑉𝑖)
03
Residual connection, feed-forward layer and layer normalization are
indispensable parts in Transformer
04
ASR TRAINING AND DECODING
Model Architecture
Both the E2E model decoder and the CTC module predict the frame-wise
distribution of 𝑌 given corresponding source 𝑋, denoted as 𝑃𝑠2𝑠(Y|X)
and 𝑃𝑐𝑡𝑐(Y|X)
01
Weighted averaged two negative log likelihoods to train the model :
𝐿 = −𝛼 log 𝑃𝑠2𝑠 Y X = 1 − 𝛼 log 𝑃𝑐𝑡𝑐(Y|X)
Where 𝛼 is set to 0.7
02
Combine scores of E2E model 𝑃𝑠2𝑠, CTC score 𝑃𝑐𝑡𝑐 and a RNN based
language model 𝑃𝑟𝑛𝑛 in the decoding process as :
03
Rescore the beam outputs based on another right-to-left language
model 𝑃𝑟2𝑙(Y) and the sentence length penalty Wordcount (Y)
formulated as :04
Reranked outputs of a left-to-right s2s model with a right-to-left
language model in the NLP community (since the right-to-left model is
more sensitive to the errors existing in the right part of a sentence)
Where 𝑃𝑡𝑟𝑎𝑛𝑠_𝑙𝑚 denotes the sentence generative probability given by a
Transformer language model
05
EXPERIMENTS
The transformer language model for rescoring is trained on
LibriSpeech language model corpus with the GPT-2 base setting
The learning rate decreases proportionally to inverse square root
of the step number after 25000th step
Represent input signals as a sequence of 8—dim log-Mel filter
bank with 3-dim pitch features
Train the model 40 epoch on 4 P40 GPUs, which costs 5 days to
coverage and apply speed perturbation by changing the audio
sped to 0.9,1.0 and 1.1
Base model structure :
12 encoder layers, 6 decoders, attention vector size 512 with 8
heads, containing 75M parameters
LIBRISPEECH 690h
EXPERIMENTS
The vocabulary size is set to 1000
The corpus consists of 207 hours of speech data accompanying
90k transcripts
The utterances with more than 3000 fames or more than 400
characters are discarded
The acoustic features are 80-dim log-Mel filter bank and 3-dim
pitch features, which is normalized by the mean and the standard
deviation for training set
TEDLIUM2
Table 1. Comparison of the Librispeech ASR benchmark
You can simply impress your audience and
add a unique zing and appeal to your
Presentations.
You can simply impress your audience and
add a unique zing and appeal to your
Presentations.
Contents Title
RESULTS
 All model are in model based setting and
shallow fused with the RNN language
model
ANALYSIS
You can simply impress your audience and
add a unique zing and appeal to your
Presentations.
You can simply impress your audience and
add a unique zing and appeal to your
Presentations.
Contents Title
Performance
TEDLIUM2
Table 2. Ablation test of difference masking methods. The fourth line is a default
setting of SpecAugment. The fifth line uses word mask to replace random time
mask, and the last line combine both methods on the time axis
Table 3. Experiment results on TEDLIUM2
RESULTS
CONCLUSION
This study elaborate a new architecture for
E2E model, achieving state-of-the-art
performance on the Librispeech test set in the
scope of E2E model
This study presents a semantic mask method
for E2E speech recognition, which is able to
train a model to better consider the whole
audio context for disambiguation
THANK YOUFor Your Patience !

More Related Content

What's hot

Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Model
gerogepatton
 
2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-distributed representations-of_words_and_phrases2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-distributed representations-of_words_and_phrases
JAEMINJEONG5
 
2021 04-04-google nmt
2021 04-04-google nmt2021 04-04-google nmt
2021 04-04-google nmt
JAEMINJEONG5
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
gohyunwoong
 
Mjfg now
Mjfg nowMjfg now
Mjfg now
Prabha P
 
Architecture neural network deep optimizing based on self organizing feature ...
Architecture neural network deep optimizing based on self organizing feature ...Architecture neural network deep optimizing based on self organizing feature ...
Architecture neural network deep optimizing based on self organizing feature ...
journalBEEI
 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
ijaia
 
Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio
Armando Vieira
 
Speech Compression Using Wavelets
Speech Compression Using Wavelets Speech Compression Using Wavelets
Speech Compression Using Wavelets
IJMER
 
Speaker recognition systems
Speaker recognition systemsSpeaker recognition systems
Speaker recognition systems
Namratha Dcruz
 
Bert
BertBert
BERT introduction
BERT introductionBERT introduction
BERT introduction
Hanwha System / ICT
 
ANN Based POS Tagging For Nepali Text
ANN Based POS Tagging For Nepali Text ANN Based POS Tagging For Nepali Text
ANN Based POS Tagging For Nepali Text
ijnlc
 
Efficient implementation of bit parallel finite
Efficient implementation of bit parallel finite Efficient implementation of bit parallel finite
Efficient implementation of bit parallel finite
eSAT Journals
 
Efficient implementation of bit parallel finite field multipliers
Efficient implementation of bit parallel finite field multipliersEfficient implementation of bit parallel finite field multipliers
Efficient implementation of bit parallel finite field multipliers
eSAT Publishing House
 
ECCV2010: feature learning for image classification, part 3
ECCV2010: feature learning for image classification, part 3ECCV2010: feature learning for image classification, part 3
ECCV2010: feature learning for image classification, part 3zukun
 
Deep Multi-Task Learning with Shared Memory
Deep Multi-Task Learning with Shared MemoryDeep Multi-Task Learning with Shared Memory
Deep Multi-Task Learning with Shared Memory
marujirou
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
Viet-Trung TRAN
 
An Index Based K-Partitions Multiple Pattern Matching Algorithm
An Index Based K-Partitions Multiple Pattern Matching AlgorithmAn Index Based K-Partitions Multiple Pattern Matching Algorithm
An Index Based K-Partitions Multiple Pattern Matching Algorithm
IDES Editor
 
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert SystemModeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
csandit
 

What's hot (20)

Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Model
 
2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-distributed representations-of_words_and_phrases2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-distributed representations-of_words_and_phrases
 
2021 04-04-google nmt
2021 04-04-google nmt2021 04-04-google nmt
2021 04-04-google nmt
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Mjfg now
Mjfg nowMjfg now
Mjfg now
 
Architecture neural network deep optimizing based on self organizing feature ...
Architecture neural network deep optimizing based on self organizing feature ...Architecture neural network deep optimizing based on self organizing feature ...
Architecture neural network deep optimizing based on self organizing feature ...
 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
 
Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio
 
Speech Compression Using Wavelets
Speech Compression Using Wavelets Speech Compression Using Wavelets
Speech Compression Using Wavelets
 
Speaker recognition systems
Speaker recognition systemsSpeaker recognition systems
Speaker recognition systems
 
Bert
BertBert
Bert
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
ANN Based POS Tagging For Nepali Text
ANN Based POS Tagging For Nepali Text ANN Based POS Tagging For Nepali Text
ANN Based POS Tagging For Nepali Text
 
Efficient implementation of bit parallel finite
Efficient implementation of bit parallel finite Efficient implementation of bit parallel finite
Efficient implementation of bit parallel finite
 
Efficient implementation of bit parallel finite field multipliers
Efficient implementation of bit parallel finite field multipliersEfficient implementation of bit parallel finite field multipliers
Efficient implementation of bit parallel finite field multipliers
 
ECCV2010: feature learning for image classification, part 3
ECCV2010: feature learning for image classification, part 3ECCV2010: feature learning for image classification, part 3
ECCV2010: feature learning for image classification, part 3
 
Deep Multi-Task Learning with Shared Memory
Deep Multi-Task Learning with Shared MemoryDeep Multi-Task Learning with Shared Memory
Deep Multi-Task Learning with Shared Memory
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
An Index Based K-Partitions Multiple Pattern Matching Algorithm
An Index Based K-Partitions Multiple Pattern Matching AlgorithmAn Index Based K-Partitions Multiple Pattern Matching Algorithm
An Index Based K-Partitions Multiple Pattern Matching Algorithm
 
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert SystemModeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
 

Similar to Semantic Mask for Transformer Based End-to-End Speech Recognition

Automated Essay Scoring Using Efficient Transformer-Based Language Models
Automated Essay Scoring Using Efficient Transformer-Based Language ModelsAutomated Essay Scoring Using Efficient Transformer-Based Language Models
Automated Essay Scoring Using Efficient Transformer-Based Language Models
Nat Rice
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
IRJET Journal
 
Deep Learning Project.pptx
Deep Learning Project.pptxDeep Learning Project.pptx
Deep Learning Project.pptx
TasnimRahman54
 
EEND-SS.pdf
EEND-SS.pdfEEND-SS.pdf
EEND-SS.pdf
ssuser849b73
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
kevig
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
kevig
 
Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...
Editor IJCATR
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief Review
Editor IJCATR
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
ssuser849b73
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
ssuser849b73
 
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
thanhdowork
 
Isolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantizationIsolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantization
eSAT Journals
 
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
tsysglobalsolutions
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
Susang Kim
 
ENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKINGENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKING
ijasuc
 
BERT MODULE FOR TEXT CLASSIFICATION.pptx
BERT MODULE FOR TEXT CLASSIFICATION.pptxBERT MODULE FOR TEXT CLASSIFICATION.pptx
BERT MODULE FOR TEXT CLASSIFICATION.pptx
ManvanthBC
 
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
ayaha osaki
 
Applying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language ServicesApplying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language Services
Yannis Flet-Berliac
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition
Pruthvij Thakar
 
Wavesplit.pdf
Wavesplit.pdfWavesplit.pdf
Wavesplit.pdf
ssuser849b73
 

Similar to Semantic Mask for Transformer Based End-to-End Speech Recognition (20)

Automated Essay Scoring Using Efficient Transformer-Based Language Models
Automated Essay Scoring Using Efficient Transformer-Based Language ModelsAutomated Essay Scoring Using Efficient Transformer-Based Language Models
Automated Essay Scoring Using Efficient Transformer-Based Language Models
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
Deep Learning Project.pptx
Deep Learning Project.pptxDeep Learning Project.pptx
Deep Learning Project.pptx
 
EEND-SS.pdf
EEND-SS.pdfEEND-SS.pdf
EEND-SS.pdf
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
 
Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief Review
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
 
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
 
Isolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantizationIsolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantization
 
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
ENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKINGENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKING
 
BERT MODULE FOR TEXT CLASSIFICATION.pptx
BERT MODULE FOR TEXT CLASSIFICATION.pptxBERT MODULE FOR TEXT CLASSIFICATION.pptx
BERT MODULE FOR TEXT CLASSIFICATION.pptx
 
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
 
Applying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language ServicesApplying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language Services
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition
 
Wavesplit.pdf
Wavesplit.pdfWavesplit.pdf
Wavesplit.pdf
 

Recently uploaded

Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Po-Chuan Chen
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 

Recently uploaded (20)

Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 

Semantic Mask for Transformer Based End-to-End Speech Recognition

  • 1. Semantic Mask for Transformer Based End-to-End Speech Recognition Author : 𝐶ℎ𝑒𝑛𝑔𝑦𝑖 𝑊𝑎𝑛𝑔, 𝑌𝑢 𝑊𝑢, 𝑌𝑢𝑗𝑖𝑎𝑜 𝐷𝑢⸭, 𝐽𝑖𝑛𝑦𝑢 𝐿𝑖⸭, 𝑆ℎ𝑢𝑗𝑖𝑒 𝐿𝑖𝑢, 𝐿𝑖𝑎𝑛𝑔 𝐿𝑢⸭, 𝑆ℎ𝑢𝑜 𝑅𝑒𝑛, 𝐺𝑢𝑜𝑙𝑖 𝐿𝑒⸭, 𝑆ℎ𝑒𝑛𝑔 𝑍ℎ𝑎𝑜⸭, 𝑀𝑖𝑛𝑔 𝑍ℎ𝑢𝑜  Microsoft Research Asia, Beijing ⸭ Microsoft Speech and Language Group ⸭ Beijing University of Posts and Telecommunications PAPER PRESENTATION Whenty Ariyanti DEPARTMENT OF COMPUTER SCIENCE AND INFORMATION ENGINEERINGNATIONAL CENTRAL UNIVERSITY TAIWAN March 23, 2020
  • 2. OUTLINE OVERVIEW01 • Masking Strategy • Why Semantic Mask Works? SEMANTIC MASKING02 • CNN Layer • Transformer Block • ASR Training and Decoding MODEL ARCHITECTURE03 • Librispeech 960h • TedLium2 EXPERIMENTS04 RESULTS05
  • 3. OVERVIEW 01 Attention-based encoder-decoder model has achieved impressive results for both automatic speech recognition (ASR) and text-to-speech (TTS) tasks 03This model is prone to overfitting, especially when the amount of training data is limited 05 The idea is to mask the input features corresponding to a particular output token (e.g., a word or a word-piece) 02 This approach takes advantage of the memorization capacity of neural networks to learn the mapping from the input sequence to the output sequence from scratch (without assumption of prior knowledge such as the alignments) 04 Inspired by SpecAugment and BERT, proposed semantic mask based regularization for training such kind of end-to-end (E2E) model 06 This research study the transformer-based model for ASR for this work and perform experiments on Librispeech 960h and TedLium2 dataset. INDEX TERMS : End-to-End ASR, Transformer, Semantic Mask
  • 4. BACKGROUND End-to-End (E2E) acoustic model, particularly with the attention-based encoder-decoder framework, have achieved a competitive recognition accuracy in a wide range of speech dataset End-to-End (E2E) Learn the mapping from the input acoustic signals to the output transcriptions without decomposing the problems into several different modules such as lexicon modeling, acoustic modeling and language modeling as in the conventional hybrid architecture To improves the generalization capacity of the model and the strength of the language modeling power, this study propose a semantic approach (Inspired by SpecAugment and BERT) PROPOSED METHOD This method masks out the whole patch of the features corresponding to an output token during training (e.g., a word or a word-piece) This study focus on the transformer architecture, which originally proposed for neural machine translation. Compared with RNNs the transformer based encoder can capture the long-term correlations with a computational complexity instead of using many steps of BPPT as in RNN • Difficult to tune the strength of each component • Tends to make grammatical error (indicate the language modeling power of the model is weak) • Mismatch between the training and evaluation data (due to the small amount of training data) E2E Weakness : 01 02 03 04 05 06
  • 5. SEMANTICMASKING M A S K I N G S T R A T E G Y Figure 1. An example of semantic mask Requires the alignment information in order to perform the token-wise masking (as shown in Figure 1) APPROACH Used Montreal Forced Aligner trained with the training data to perform forced-alignment between the acoustic signals and the transcription to obtain the world-level timing information TOOLKIT Randomly select a percentage of the tokens and mask the corresponding speech segments in each iteration TRAINING Randomly sample 15% of the tokens and set the masked piece to the mean value of the whole utterance PROPOSED WORK 01 02 03 04 Adopt a time wrap, frequency masking and time masking strategy MASKING STRATEGY 05 Idea of Speech Augment
  • 6. SEMANTICMASKING 03 01 04 02 Spectrum augmentation similar to this method. Both propose to mask spectrum for E2E model training but the intuitions behind those two are different SpecAugment randomly masks spectrum in order to add noise to the source input, making the E2E ASR problem harder and prevents the over-fitting problem in a large E2E model E2E model has to predict the token based on other signals, tokens that have generated or other unmasked speech features (to alleviate over-fitting) Reduces the hyper-parameter tuning workload of SpecAugment and is more robust when the variance of input audio length is large. WHY SEMANTIC M A S K W O R K ?
  • 7. CNN LAYER Model Architecture Figure 2. CNN Layer Architecture Represent input signals as a sequence of log-Mel filter bank features, 𝑋 = (𝑋0 … 𝑋 𝑛) where 𝑋𝑖 is 83-dim vector. 01 Use VGG-like convolution block with layer normalization and max- pooling function 02 The specific architecture outperforms Convolution 2D subsampling method 03 Use 1D-CNN in the decoder to extract local features replacing the position embedding 04
  • 8. TRANSFORMER BLOCK Model Architecture Transformer module consumes the outputs of CNN and extract features with a self-attention mechanism 01 Suppose that 𝑄, 𝐾, and 𝑉 are inputs of transformer block, its output are calculated as : SelfAttention 𝑄, 𝐾, 𝑉 = softmax 𝑄𝐾 𝑑 𝑘 𝑉 02 Multi-head attention is proposed to enable dealing with multiple attention as : Multihead 𝑄, 𝐾, 𝑉 = [𝐻1 … 𝐻 𝑑ℎ𝑒𝑎𝑑 ]𝑊ℎ𝑒𝑎𝑑 where 𝐻𝑖 = SelfAttention(𝑄𝑖, 𝐾𝑖, 𝑉𝑖) 03 Residual connection, feed-forward layer and layer normalization are indispensable parts in Transformer 04
  • 9. ASR TRAINING AND DECODING Model Architecture Both the E2E model decoder and the CTC module predict the frame-wise distribution of 𝑌 given corresponding source 𝑋, denoted as 𝑃𝑠2𝑠(Y|X) and 𝑃𝑐𝑡𝑐(Y|X) 01 Weighted averaged two negative log likelihoods to train the model : 𝐿 = −𝛼 log 𝑃𝑠2𝑠 Y X = 1 − 𝛼 log 𝑃𝑐𝑡𝑐(Y|X) Where 𝛼 is set to 0.7 02 Combine scores of E2E model 𝑃𝑠2𝑠, CTC score 𝑃𝑐𝑡𝑐 and a RNN based language model 𝑃𝑟𝑛𝑛 in the decoding process as : 03 Rescore the beam outputs based on another right-to-left language model 𝑃𝑟2𝑙(Y) and the sentence length penalty Wordcount (Y) formulated as :04 Reranked outputs of a left-to-right s2s model with a right-to-left language model in the NLP community (since the right-to-left model is more sensitive to the errors existing in the right part of a sentence) Where 𝑃𝑡𝑟𝑎𝑛𝑠_𝑙𝑚 denotes the sentence generative probability given by a Transformer language model 05
  • 10. EXPERIMENTS The transformer language model for rescoring is trained on LibriSpeech language model corpus with the GPT-2 base setting The learning rate decreases proportionally to inverse square root of the step number after 25000th step Represent input signals as a sequence of 8—dim log-Mel filter bank with 3-dim pitch features Train the model 40 epoch on 4 P40 GPUs, which costs 5 days to coverage and apply speed perturbation by changing the audio sped to 0.9,1.0 and 1.1 Base model structure : 12 encoder layers, 6 decoders, attention vector size 512 with 8 heads, containing 75M parameters LIBRISPEECH 690h
  • 11. EXPERIMENTS The vocabulary size is set to 1000 The corpus consists of 207 hours of speech data accompanying 90k transcripts The utterances with more than 3000 fames or more than 400 characters are discarded The acoustic features are 80-dim log-Mel filter bank and 3-dim pitch features, which is normalized by the mean and the standard deviation for training set TEDLIUM2
  • 12. Table 1. Comparison of the Librispeech ASR benchmark You can simply impress your audience and add a unique zing and appeal to your Presentations. You can simply impress your audience and add a unique zing and appeal to your Presentations. Contents Title RESULTS
  • 13.  All model are in model based setting and shallow fused with the RNN language model ANALYSIS You can simply impress your audience and add a unique zing and appeal to your Presentations. You can simply impress your audience and add a unique zing and appeal to your Presentations. Contents Title Performance TEDLIUM2 Table 2. Ablation test of difference masking methods. The fourth line is a default setting of SpecAugment. The fifth line uses word mask to replace random time mask, and the last line combine both methods on the time axis Table 3. Experiment results on TEDLIUM2 RESULTS
  • 14. CONCLUSION This study elaborate a new architecture for E2E model, achieving state-of-the-art performance on the Librispeech test set in the scope of E2E model This study presents a semantic mask method for E2E speech recognition, which is able to train a model to better consider the whole audio context for disambiguation
  • 15. THANK YOUFor Your Patience !