ICCIT_presentation

Isolated Bangla Word Recognition and Speaker Detection by
Semantic Modular Time Delay Neural Network (MTDNN)
Authors:
Md. Yasin Ali Khan, S. M. Mostaq Hossain and Mohammed Moshiul Hoque
Department of Computer Science and Engineering, Chittagong University of Engineering
and Technology, Chittagong-4349, Bangladesh
Presented by:
Md. Yasin Ali Khan

Introduction
• Speech is the most common forms of human
communication.
3 May 2016 2
• Also used for human to machine communication.
Human to human Human to machine

Word recognition: Recognizing an uttered word that
matched with the actual word.
Speaker recognition: Recognizing a particular person based
on his/her voice signal.
Isolated words: Adjacent words having sufficient silence.
3 May 2016 3
Introduction (cont.)

3 May 2016 4
Sp-1
Sp-2
Sp-3
আকাশ, পানি, িদী
পানি MTDNN Module
Modular Time Delay Neural Network (MTDNN): Development
of a time-delay neural network which is scaled by modularity of
large nets and based on smaller subcomponent nets.
Figure 1: Abstract representation of proposed framework.
Introduction (cont.)

3 May 2016 5
• No previous work on speaker detection in Bangla
language
• Experimenting with large vocabulary of Bangla
words and different syllables
Motivation

• Bangla speech based data documentation & information retrieval
3 May 2016 6
Control
module
• Bangla speech based security control
Control
module
Motivation (cont.)

• Designed a theoretical framework for recognizing Bangla
isolated words and detecting corresponding speakers.
• Implemented the framework using Modular Time Delay
Neural Network (MTDNN).
• Evaluated the system by using different Bangla words and
speakers.
3 May 2016 7
Contributions

• Speech Recognition using Artificial Neural Network (Roy et al. , 2002)
- Restricted on 5 words
- ANN Technology
- FFT Feature
• Speech Recognition using 3 Layer Back Propagation Neural Network (Islam
et al. , 2005)
- Restricted on 30 words
- Develops Multi-Layer Back-propagation NN
- FFT Feature
Limitation: High training time
3 May 2016 8
Bangla Language
Related Work

3 May 2016 9
Methodology
• Problem Formulation
𝑣 𝑡 =
𝐸ₒ𝑒ᵅᵗ sin 𝜔𝑡 0 ≤ 𝑡 < 𝑇𝑒
𝐸₁ 𝑒¯ᵝ⁽ᵗ¯ᵀᵉ⁾ − 𝑒¯ᵝ⁽ᵀᵉ¯ᵀ ͨ⁾ 𝑇𝑒 ≤ 𝑡 < 𝑇𝑐
0 𝑇𝑐 ≤ 𝑡 ≤ 𝑇𝑜
Glottal Pulse Model for Voiced Signals as The Liljencrants-
Fant (LF) model [5]

3 May 2016 10
Methodology (cont.)
• Mathematical Description
[ŵ₁, ŵ₂,….ŵɴ] = 𝒎𝒂𝒙
𝒘 𝟏, 𝒘 𝟐 ,…….𝒘 𝑵
𝒇( 𝒘 𝟏, 𝒘 𝟐 , … … . 𝒘 𝑵 𝑿, 𝜦)
[P₁, P₂,….Pɴ] = 𝒎𝒂𝒙
𝒑 𝟏, 𝒑 𝟐 ,…….𝒑 𝑵
𝒇( 𝒑 𝟏, 𝒑 𝟐 , … … . 𝒑 𝑵 𝑿, 𝜦)
- For Speech Recognition
- For Speaker Detection
where f is the conditional probability of a sequence of words W given
a sequence of speech features X and 𝚲 is pre-trained speech model.
where f is the conditional probability of a sequence of speakers P
given a sequence of speaker features X and 𝚲 is pre-trained speaker
model.

3 May 2016 12Figure 2: System Architecture.
Methodology (cont.)

3 May 2016 13
• MTDNN
Framework
Figure 3: MTDNN framework.
Methodology (cont.)

3 May 2016 13
• Environmental Setup
Operating System Lab in CUET
• System Requirements
Operating System : Windows 7 (Core i3, 3.66 GHz, 8GB RAM)
Software : Creative WaveStudio, Matlab
Tools : Close talk microphone
Experiments

3 May 2016 14
• Data Collection
Experiments (cont.)
No. of Participants : 5
No. of Words : 46
Age Difference : 21-23
Uttering times : 3 (For each word)
(5 [participants] ˟ 46 [no. of words] ˟ 3 [no. of times uttered each word])
= 690 samples
• Sample Size

3 May 2016 15
Average file size 81.67 KB
Total no files 690
Average duration
of files (.wav)
0.82 sec
Age average 22
• File Information
for Samples
Experiments (cont.)

3 May 2016 16
Experiments (cont.)
Mono Syllable:
দেশ, বই, দবশ, দশোক, দ োট, কোজ, তোর, রোত, গোছ, দকউ, আজ
Di Syllable:
তততি, পড়েি, হড়ব, টোকো, থোড়কি, আমরো, পোতো, পুতিশ, কথো, উৎসব, যুদ্ধ, তবজয়
Tri Syllable:
ধরড়ের, জোিোড়িোর, আমোড়ের, বতত মোি, ফিোফি, সুতবধো, কততত পড়ের, দসৌন্দযত, পেতযোগ, পততথবীর
Poly Syllable:
স্বোধীিতোর, সহড়যোতগতো
Poly Syllable:
Word, file, open, print, exit, edit, cut, copy, paste, doc1, doc2
• Sample Words List

3 May 2016 17
Epoch 3
Avg. training time 1 minute and 5 seconds
Gradient error 4.69e-13 [Range (1.00 to 1.00e-10)]
Mu error 1.00e-06 [Range (0.00100 to 1.00e+10)]
Validation
checks
2
Performance 0.165 [Range (0.200 to 0.00)]
Accuracy 82.66%
• Result Analysis of
Word Recognition
&
Speaker Detection
Experimental Results

3 May 2016 18
Method No of words Feature Recognition
rate
Roy et al. [1] 5 FFT 82%
Islam et al. [2] 13 FFT 78.85%
Proposed 35 MFCC 82.66%
• Comparison Table with
Previous Work
Experimental Results (cont.)

• Technology: Modular Time Delay Neural Network
(MTDNN)
• 82.66% overall accuracy ( in avg.)
3 May 2016 21
Conclusion

3 May 2016 20
• Future Recommendations
• Experimenting with More Bangla Words
• Variable Age of Participants
Conclusion (cont.)

[1] K. Roy, D. Das, and M. G. Ali, “Development of the speech recognition system using artificial neural
network,” in Proc.5th International Conference on Computer and Information Technology (ICCIT02),
Dhaka, Bangladesh, 2002.
[2] M. R. Islam, A. Sayeed, M. Sohail, M. W. H. Sadid, M. A. Mottalib, “Bangla Speech Recognition
using three layer Back-Propagation Neural Network”, Proc. of NCCPB, Dhaka, 2005.
[3] M. F. Khan and R. C. Debnath, “Comparative Study of Feature ExtractionMethods for Bangla
Phoneme Recognition”, 5th ICCIT-2002, Dhaka, Bangladesh, pp. 257-261.
[4] J. C. Bezdek, R. Ehrlich, W. Full, “FCM: The Fuzzy C-Means Clustering Algorithm”, Computers &
Geosciences Vol. 10, No. 2-3, pp. 191-203, 1984.
[5] L. R. Rabiner and B. H. Juang, “Fundamentals of Speech Recognition”, Prentice-Hall, Englewood
Cliffs, NJ, 1993.
3 May 2016 23
References

ICCIT_presentation

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Similar to ICCIT_presentation

Similar to ICCIT_presentation (20)

ICCIT_presentation