Efficient Acoustic Model Refinement for Low Resource Languages

Efficient Acoustic Model Refinement for Low Resource
Languages using Semi-Supervised Learning Methods
Chellapriyadharshini M (MT2016041)
Guide: Prof. V. Ramasubramanian
June, 2018
Chellapriyadharshini M (IIIT-Bangalore) Efficient Acoustic Model Refinement for Low Resource Languages using Semi-Supervised LeJune, 2018 1 / 34

Outline
1 Overview
Introduction
Motivation
2 Related Work
3 Proposed Framework
Overview
Corpus & Environment
Baseline: Non-Iterative Procedure
Iterative: Progressive Decoding of DU
Iterative: Progressive Decoding of Bins
Combined Procedure: Active Learning + Semi-Supervised Learning
4 Results
Comparison of Results
5 Future Work & Conclusion
Future Work
Conclusion

Outline
1 Overview
Introduction
Motivation
2 Related Work
3 Proposed Framework
Overview
Baseline: Non-Iterative Procedure
Iterative: Progressive Decoding of DU
Iterative: Progressive Decoding of Bins
Combined Procedure: Active Learning + Semi-Supervised Learning
4 Results
Comparison of Results
5 Future Work & Conclusion
Future Work
Conclusion

Overview
Introduction
This thesis addresses the problem of eﬃcient acoustic-model
reﬁnement using semi-supervised learning for low resource languages.

Overview
Introduction
This thesis addresses the problem of efficient acoustic-model
refinement using semi-supervised learning for low resource languages.
Proposed Method
The proposed semi-supervised learning method decodes the unlabeled large
training corpus using the seed model and through various protocols, selects
the decoded utterances with high reliability using confidence levels and
iterative bootstrapping. Also improve seed model using active learning.
M.Chellapriyadharshini, Anoop Toffy, SrinivasaRaghavan K.M,
V.Ramasubramanian. “Semi-supervised and active-learning scenarios:
Efficient acoustic model refinement for a low resource Indian
language”. Accepted in INTERSPEECH 2018. Hyderabad, India.

Overview
Motivation
Deep Learning Techniques - requirement of very large training corpus

Overview
Motivation
Resource Scarce Languages
1 limited availability of digital spoken language corpus
2 lack of script level representations
3 limited means of labeling the speech corpus
4 limited access to linguistic knowledge, expertise or resources by which
to acquire lexical representations, annotations etc.
5 labeling is expensive - due to the high throughput of the incoming data
- Voice Search

Overview
Motivation
Resource Scarce Languages
1 limited availability of digital spoken language corpus
2 lack of script level representations
3 limited means of labeling the speech corpus
4 limited access to linguistic knowledge, expertise or resources by which
to acquire lexical representations, annotations etc.
5 labeling is expensive - due to the high throughput of the incoming data
- Voice Search
Semi-Supervised learning method is extremely necessary in the ASR
context as it reduces the need for resource requirements or labeled
transcriptions.

Related Work [1]
Lightly Supervised : as explored in

Related Work [2]
Semi-Supervised / Unsupervised : as explored in

Related Work [3]
Active Learning : as explored in
Low-Resource Languages : as explored in

Related Work [4]
Data Selection Strategies : as explored in

Overview
Not Applicable:
Lack of availability of large amounts of Approximate Transcriptions /
Text Corpora
× Lightly-Supervised
× Language Models trained from large text corpora & interpolation

Overview
Not Applicable:
Text Corpora
Lack of availability of large amounts of Audio corpus
× Iterative Strategy : Data Doubling based on Conﬁdence

Overview
Not Applicable:
Text Corpora
Limitations speciﬁc to the Language
× Models trained on close Dialects
× Multi-lingually trained Monolingual systems

Overview
Not Applicable:
Text Corpora
Applicable Methods Reused:
Semi-Supervised Self-Training Approach
Conﬁdence Scores based on Aposteriori Probability of acoustic units

Overview
Not Applicable:
Text Corpora
Applicable Methods Reused:
Semi-Supervised Self-Training Approach
Conﬁdence Scores based on Aposteriori Probability of acoustic units
What’s Diﬀerent?
∗ Iterative Strategies - to make the best use of available data
∗ Combined Approach: Active Learning + Semi-Supervised Learning

Tamil language read speech data provided by SpeechOcean and
Microsoft for the ‘Low Resource Speech Recognition Challenge for
Indian Languages’ in Interspeech 2018.
Total: 15.07 hours
Lexicon : IIT-Madras Common Label Set Lexicon for Tamil.
Vocabulary : 32540 words
Experiments done in ‘Kaldi’-ASR Toolkit.
Acoustic Model training - DNN-HMM framework.
Language Model - word level tri-gram language model using the
training corpus.

Baseline: Non-Iterative Procedure [1]
Semi-Supervised Learning using 25% Seed Data:

Conﬁdence Score : measure of accuracy of the predicted labels.
It is the aposteriori probability of a phone or word hypothesis w, given
a sequence of acoustic feature vectors OT .
Figure: Conﬁdence Level vs. WER

Semi-Supervised Self-Training:
Seed acoustic model AMseed built on Dseed .
AMseed used to predict approximate transcriptions for DU. Measure
of accuracy of decoding - Confidence Scores.
Confidence Intervals: (.95, 1), (.9, .95), (.85, .9), (.8, .85), (0, 0.8)
Most confident of the predicted transcriptions are then added to the
training corpus for Re-Training.

Framework

WER Proﬁle on T

Distribution of Conﬁdence Bins

Iterative: Progressive Decoding of DU [1]
Decode the entire DU repeatedly to derive progressively better
decoding in such a way that the bins have progressively increasing
population of utterances.
The reuse of the iteratively reﬁned bins result in progressively more
accurate acoustic-models.
Iterative procedure yields a lower WER proﬁle than the non-iterative
procedure.
Considering the best model resulting from the above iterative
procedure carry out a ‘global’ iteration.

Framework

Redistribution of utterances in bins

WER Proﬁle on T

Iterative: Progressive Decoding of Bins [1]
The utterances belonging to each bin obtained after the first decoding
are frozen.
Only the decoding transcriptions of these fixed contents of the bins
gets better until convergence.
Once a bin Bi converges, the converged acoustic model AMi is then
used as the starting point to carry out the iterations on the next bin
Bi+1.
Advantage: we need not decode the entire DU each time, which
reduces the computation time manifold.
The two proposed methods of iterative learning produce equivalent
results and so we can afford to follow the second method as it has low
time requirements.

Framework

WER Proﬁle on T

Combined Procedure: Active Learning + Semi-Supervised
Learning [1]
Active Learning eases the labeling bottleneck by asking queries in the
form of unlabeled instances to be labeled by an oracle.
Pool Based Active Learning: queries are selected from a large pool of
unlabeled instances.

Learning [2]
Evaluate the informativeness of the unlabeled samples by some means
- Querying Strategy.
Uncertainty Sampling : selects the sample about which the model is
“least certain” how to label i.e. whose prediction is “least conﬁdent”.
This technique is popular in statistical sequence modeling tasks, as in
the case of speech, because the most likely label sequence (and its
associated likelihood) can be eﬃciently computed using dynamic
programming.

Learning [3]

Learning [4]
Seed Corpus built by Uncertainty Sampling from 2.5% initial seed:
We have only enough manual labour available to transcribe 25% of
the data set.
So instead of labeling randomly selected utterances, we can pick and
choose the subset that should be labeled, so as to improve the quality
of the initial seed acoustic model.
Initial Data Split: Dseed : DU : T is 2.5:87.5:10
AMseed is trained on Dseed and used to decode DU and corresponding
Conﬁdence Scores are computed.
Select ‘x%’ of DU that have the lowest conﬁdence score and add
them to the training corpus Dseed and re-train the acoustic model.
Repeat these steps until we have grown the training corpus Dseed to
contain 25% of the entire data set.

Learning [5]
Now Semi-Supervised learning is applied using the Non-Iterative
procedure explained previously.

Learning [6]
WER Proﬁle on T

Comparison of Results [1]
WER on T

Comparison of Results [2]
WER on T
Dseed : 50% decrease in WER of the Total WER Reduction possible,
after iterative training
Dseed built by Uncertainty Sampling : 41.2% decrease in WER of the
Total WER Reduction possible, without any iterative training

Future Work
To extend the iterative procedure on the combined active and
semi-supervised framework.
To extend the whole work on the 50 hours data.
To use a different measure of confidence of prediction - select
utterances that provide most benefit to the whole data set.
Explore different language models - varying training corpus size,
in-domain / out-of-domain data, multiple language model
components using many sources and then combine them with varying
weights.
Ensemble methods: for instance, Co-Training (Semi-Supervised) and
Query By Committee (Active Learning) combination.

Conclusion
We have addressed the problem of acoustic model training in a low
resource setting, where only a small seed data is assumed to be
available, and have proposed semi-supervised learning and active
learning protocols for reﬁning the seed acoustic model from a larger,
but unlabeled, training corpus.
The proposed semi-supervised learning oﬀers WER reductions by as
much as 50% with iteration and 41% without iteration of the best
WER-reduction realizable.

Questions?
Questions?

Questions?
Questions?
Thankyou for your time!

Efficient Acoustic Model Refinement for Low Resource Languages

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Efficient Acoustic Model Refinement for Low Resource Languages

Similar to Efficient Acoustic Model Refinement for Low Resource Languages (20)

Recently uploaded

Recently uploaded (20)

Efficient Acoustic Model Refinement for Low Resource Languages