Exploiting contextual information for improved phoeneme recognition

•Download as KEY, PDF•

0 likes•261 views

Sebastian Hafner

Education Technology

EXPLOITING CONTEXTUAL INFORMATION
FOR IMPROVED PHONEME RECOGNITION
Joel Pinto, B. Yegnanarayana, H. Hermansky, Mathew Magimai.-Doss

presented by
Sebastian T. Hafner

OVERVIEW

• Introduction

• Basic Phoneme Recognizer

• Contextual Information

• at the feature level

• at the posterior level

2

TIMIT DATABASE

• read speech

• american english

• 630 speakers

•8 main dialects

4

TIMIT DATABASE

• training set:

• 3000 utterances

• 375 speakers

5

TIMIT DATABASE

• training set:

• 3000 utterances

• 375 speakers

• test set:

• 1344 utterances

• 168 speakers
5

PHONEME RECOGNITION

25ms

step size: 10ms

6

FEATURE EXTRACTION

• 13 PLP coefﬁcients

7

FEATURE EXTRACTION

• 13 PLP coefﬁcients

• delta values

7

FEATURE EXTRACTION

• 13 PLP coefﬁcients

• delta values

• delta-delta values

7

FEATURE EXTRACTION

• 13 PLP coefﬁcients

• delta values

• delta-delta values

39 features per frame
7

HIDDEN MARKOV MODEL

MLP % Phoneme p

MLP estimates posterior probability
8

MULTI LAYERED PERCEPTRON

...
... ...
1000
39 39
hidden
features phonemes
units
.... ....
....

9

PRIOR PROBABILITY

P (qt = i | xt )
qt Phoneme index at t

xt feature vector at t

10

BAYES THEOREM

p (xt | qt = i) P (qt = i | xt )
p (xt ) P (qt = i)

11

POSTERIOR PROBABILITY

p (xt | qt = i) P (qt = i | xt )
p (xt ) P (qt = i)

P (qt = i) = P (qt = j) ∀i, j ∈ {1, 2, . . . , 39}

equal probability for each phoneme
11

PHONEME ERROR RATE

spoken: /k/ /a/ /t/ /e/

12

PHONEME ERROR RATE

spoken: /k/ /a/ /t/ /e/

classiﬁed: /c/ /a/ /t/

12

PHONEME ERROR RATE

spoken: /k/ /a/ /t/ /e/

classiﬁed: /c/ /a/ /t/

errors:

12

PHONEME ERROR RATE

spoken: /k/ /a/ /t/ /e/

classiﬁed: /c/ /a/ /t/

errors:

error rate:
12

PHONEME ERROR RATE

spoken: /k/ /a/ /t/ /e/

classiﬁed: /c/ /a/ /t/

errors:

error rate: 2 : 4 = 50 %
12

FEATURE LEVEL

phoneme phoneme
inﬂuenced inﬂuenced
by earlier by next
phoneme phoneme

phoneme
15

3 SINGLE MLPS
MLP %

MLP %

MLP %
each MLP with 39 classes
16

3 SINGLE MLPS
MLP P (qt = i | xt , st = 1)

MLP P (qt = i | xt , st = 2)

MLP P (qt = i | xt , st = 3)

st ∈ {1, 2, 3}
17

3 SINGLE MLPS
MLP P (qt = i | xt , st = 1)

MLP P (qt = i | xt , st = 2)

MLP P (qt = i | xt , st = 3)

state index st ∈ {1, 2, 3}
17

1 LARGE MLP

MLP
%
MLP with 117 classes
18

1 LARGE MLP

MLP P (qt = i, st = j | xt )

19

1 LARGE MLP

MLP P (qt = i, st = j | xt )

39 phonemes x 3 states
19

LARGE MLP VS. 3 SMALLER
labels for training MLP
classiﬁer
uniform force aligned

one MLP
with 117 classes 69.87 71.67
three MLPs
earch 39 classes 70.13 69.70

20

ESTIMATE
POSTERIOR PROBABILITY

%
state
posterior MLP
probabilities

22

ESTIMATE
POSTERIOR PROBABILITY

state
posterior MLP P (qt = i | Qt )
probabilities

Qt
23

ESTIMATE
POSTERIOR PROBABILITY

state
posterior MLP P (qt = i | Qt )
probabilities

Qt trajectory of state posterior probabilities
23

PARAMETERS

• 3000 hidden layers

• 23 frames for windowing

24

PARAMETERS

• 3000 hidden layers

• 23 frames for windowing

73.4 %
24

INFORMATION ACROSS
STATE POSTERIORS
1 state
(classic) 68.12

1 state better modeling
(sum) 70.17 by states

3 state 71.67 better decoding

26

EXPERIMENT A
original data modiﬁed data

0.04 0.09

0.64 0.9

0.32 0.01

27

EXPERIMENT A
original data modiﬁed data

0.04 0.09

0.64 0.9

0.32 0.01

remaining values: randomly =1
modiﬁed data
27

EXPERIMENT B
original data modiﬁed data

0.04 0.25

0.64 0.64

0.32 0.11

28

EXPERIMENT B
original data modiﬁed data

0.04 0.25

0.64 0.64

0.32 0.11

remaining values: randomly =1
modiﬁed data
28

RECOGNITION ACCURACY
experiment 1 state MLP 3 state MLP

baseline 68.12 71.55

experiment A 62.77 70.27

experiment B 64.24 70.75

29

RECOGNITION ACCURACY
experiment 1 state MLP 3 state MLP

baseline 68.12 71.55

experiment A 62.77 70.27

experiment B 64.24 70.75

Information in phoneme posteriors !
29

SUMMARY

• contextual information in features

• contextual information in probabilities

30

What's hot

Isi and nyquist criterionsrkrishna341

Correlative level codingsrkrishna341

Tele3113 tut5Vin Voro

Chapt 06guest2bb25

Nyquist criterion for zero ISIGunasekara Reddy

Tele3113 tut3Vin Voro

Dft2Senthil Kumar

Nyquist criterion for distortion less baseband binary channelPriyangaKR1

igorFreire_UCI_real-time-dsp_reportsIgor Freire

Chapter6 samplingKing Mongkut's University of Technology Thonburi

DSP_FOEHU - Lec 08 - The Discrete Fourier TransformAmr E. Mohamed

DSP_FOEHU - MATLAB 04 - The Discrete Fourier Transform (DFT)Amr E. Mohamed

What's hot (12)

Isi and nyquist criterion

Correlative level coding

Tele3113 tut5

Chapt 06

Nyquist criterion for zero ISI

Tele3113 tut3

Dft2

Nyquist criterion for distortion less baseband binary channel

igorFreire_UCI_real-time-dsp_reports

Chapter6 sampling

DSP_FOEHU - Lec 08 - The Discrete Fourier Transform

DSP_FOEHU - MATLAB 04 - The Discrete Fourier Transform (DFT)

Similar to Exploiting contextual information for improved phoeneme recognition

Surrey dl-4ozzie73

Performance analysis of bangla speech recognizer model using hmmAbdullah al Mamun

adaptive equa.pptmohamadfarzansabahi1

Dereverberation in the stft and log mel frequency feature domainsTakuya Yoshioka

GMMNに基づく音声合成におけるグラム行列の スパース近似の検討Tomoki Koriyama

PSNRSHEKAR

DNA Splice site predictionsageteam

A Lossless FBAR CompressorPhilip Alipour

An Introduction to HDTV Principles-Part 3Dr. Mohieddin Moradi

Usage, Performance and Future Of PL1 at NRB Benoit EbnerNRB

defenseQing Dou

Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit

Phonons & Phonopy: Pro Tips (2015)Jonathan Skelton

PMF BPMF and BPTFPei-Che Chang

ACS 22LIE12 lab Manul.docxVasantkumarUpadhye

Large-scale computation without sacrificing expressivenessSangjin Han

Universal approximators for Direct Policy Search in multi-purpose water reser...Andrea Castelletti

Robust Growth-Optimal PortfoliosNapat Rujeerapaiboon

Random Number Generators 2018rinnocente

Modal Analysis Basic TheoryYuanCheng38

Similar to Exploiting contextual information for improved phoeneme recognition (20)

Surrey dl-4

Performance analysis of bangla speech recognizer model using hmm

adaptive equa.ppt

Dereverberation in the stft and log mel frequency feature domains

GMMNに基づく音声合成におけるグラム行列の スパース近似の検討

DNA Splice site prediction

A Lossless FBAR Compressor

An Introduction to HDTV Principles-Part 3

Usage, Performance and Future Of PL1 at NRB Benoit Ebner

defense

Optimizing Terascale Machine Learning Pipelines with Keystone ML

Phonons & Phonopy: Pro Tips (2015)

PMF BPMF and BPTF

ACS 22LIE12 lab Manul.docx

Large-scale computation without sacrificing expressiveness

Universal approximators for Direct Policy Search in multi-purpose water reser...

Robust Growth-Optimal Portfolios

Random Number Generators 2018

Modal Analysis Basic Theory

Recently uploaded

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood

Sports & Fitness Value Added Course FY..Disha Kariya

PROCESS RECORDING FORMAT.docxPoojaSen20

Application orientated numerical on hev.pptRamjanShidvankar

microwave assisted reaction. General introductionMaksud Ahmed

Grant Readiness 101 TechSoup and Remy ConsultingTechSoup

1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh

Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics

ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"National Information Standards Organization (NISO)

Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB

How to Give a Domain for a Field in Odoo 17Celine George

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82

This PowerPoint helps students to consider the concept of infinity.christianmathematics

Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417

The basics of sentences session 2pptx copy.pptxheathfieldcps1

Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George

Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417

Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417

fourth grading exam for kindergarten in writingTeacherCyreneCayanan

Recently uploaded (20)

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx

Sports & Fitness Value Added Course FY..

PROCESS RECORDING FORMAT.docx

Application orientated numerical on hev.ppt

microwave assisted reaction. General introduction

Grant Readiness 101 TechSoup and Remy Consulting

1029-Danh muc Sach Giao Khoa khoi 6.pdf

Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...

ICT Role in 21st Century Education & its Challenges.pptx

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"

Beyond the EU: DORA and NIS 2 Directive's Global Impact

How to Give a Domain for a Field in Odoo 17

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi

This PowerPoint helps students to consider the concept of infinity.

Unit-V; Pricing (Pharma Marketing Management).pptx

The basics of sentences session 2pptx copy.pptx

Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes

Unit-IV- Pharma. Marketing Channels.pptx

Unit-IV; Professional Sales Representative (PSR).pptx

fourth grading exam for kindergarten in writing

Exploiting contextual information for improved phoeneme recognition

1. EXPLOITING CONTEXTUAL INFORMATION FOR IMPROVED PHONEME RECOGNITION Joel Pinto, B. Yegnanarayana, H. Hermansky, Mathew Magimai.-Doss presented by Sebastian T. Hafner

2. OVERVIEW • Introduction • Basic Phoneme Recognizer • Contextual Information • at the feature level • at the posterior level 2

3. BASICS 3

4. TIMIT DATABASE • read speech • american english • 630 speakers •8 main dialects 4

5. TIMIT DATABASE 5

6. TIMIT DATABASE • training set: • 3000 utterances • 375 speakers 5

7. TIMIT DATABASE • training set: • 3000 utterances • 375 speakers • test set: • 1344 utterances • 168 speakers 5

8. PHONEME RECOGNITION 6

9. PHONEME RECOGNITION 25ms 6

10. PHONEME RECOGNITION 25ms step size: 10ms 6

11. FEATURE EXTRACTION 7

12. FEATURE EXTRACTION • 13 PLP coefﬁcients 7

13. FEATURE EXTRACTION • 13 PLP coefﬁcients • delta values 7

14. FEATURE EXTRACTION • 13 PLP coefﬁcients • delta values • delta-delta values 7

15. FEATURE EXTRACTION • 13 PLP coefﬁcients • delta values • delta-delta values 39 features per frame 7

16. HIDDEN MARKOV MODEL MLP % Phoneme p 8

17. HIDDEN MARKOV MODEL MLP % Phoneme p MLP estimates posterior probability 8

18. MULTI LAYERED PERCEPTRON ... ... ... 1000 39 39 hidden features phonemes units .... .... .... 9

19. PRIOR PROBABILITY P (qt = i | xt ) qt Phoneme index at t xt feature vector at t 10

20. BAYES THEOREM p (xt | qt = i) P (qt = i | xt ) p (xt ) P (qt = i) 11

21. POSTERIOR PROBABILITY p (xt | qt = i) P (qt = i | xt ) p (xt ) P (qt = i) P (qt = i) = P (qt = j) ∀i, j ∈ {1, 2, . . . , 39} equal probability for each phoneme 11

22. PHONEME ERROR RATE spoken: /k/ /a/ /t/ /e/ 12

23. PHONEME ERROR RATE spoken: /k/ /a/ /t/ /e/ classiﬁed: /c/ /a/ /t/ 12

24. PHONEME ERROR RATE spoken: /k/ /a/ /t/ /e/ classiﬁed: /c/ /a/ /t/ errors: 12

25. PHONEME ERROR RATE spoken: /k/ /a/ /t/ /e/ classiﬁed: /c/ /a/ /t/ errors: 12

26. PHONEME ERROR RATE spoken: /k/ /a/ /t/ /e/ classiﬁed: /c/ /a/ /t/ errors: 12

27. PHONEME ERROR RATE spoken: /k/ /a/ /t/ /e/ classiﬁed: /c/ /a/ /t/ errors: 12

28. PHONEME ERROR RATE spoken: /k/ /a/ /t/ /e/ classiﬁed: /c/ /a/ /t/ errors: 12

29. PHONEME ERROR RATE spoken: /k/ /a/ /t/ /e/ classiﬁed: /c/ /a/ /t/ errors: error rate: 12

30. PHONEME ERROR RATE spoken: /k/ /a/ /t/ /e/ classiﬁed: /c/ /a/ /t/ errors: error rate: 2 : 4 = 50 % 12

31. STANDARD APPROACH 68.12 % 13

32. AT THE FEATURE LEVEL 14

33. FEATURE LEVEL 15

34. FEATURE LEVEL phoneme 15

35. FEATURE LEVEL phoneme phoneme inﬂuenced inﬂuenced by earlier by next phoneme phoneme phoneme 15

36. FEATURE LEVEL phoneme phoneme inﬂuenced inﬂuenced by earlier by next phoneme phoneme phoneme 15

37. 3 SINGLE MLPS MLP % MLP % MLP % 16

38. 3 SINGLE MLPS MLP % MLP % MLP % each MLP with 39 classes 16

39. 3 SINGLE MLPS MLP P (qt = i | xt , st = 1) MLP P (qt = i | xt , st = 2) MLP P (qt = i | xt , st = 3) st ∈ {1, 2, 3} 17

40. 3 SINGLE MLPS MLP P (qt = i | xt , st = 1) MLP P (qt = i | xt , st = 2) MLP P (qt = i | xt , st = 3) state index st ∈ {1, 2, 3} 17

41. 1 LARGE MLP MLP % 18

42. 1 LARGE MLP MLP % MLP with 117 classes 18

43. 1 LARGE MLP MLP P (qt = i, st = j | xt ) 19

44. 1 LARGE MLP MLP P (qt = i, st = j | xt ) 39 phonemes x 3 states 19

45. LARGE MLP VS. 3 SMALLER labels for training MLP classiﬁer uniform force aligned one MLP with 117 classes 69.87 71.67 three MLPs earch 39 classes 70.13 69.70 20

46. AT THE POSTERIOR LEVEL 21

47. ESTIMATE POSTERIOR PROBABILITY % state posterior MLP probabilities 22

48. ESTIMATE POSTERIOR PROBABILITY % state posterior MLP probabilities 22

49. ESTIMATE POSTERIOR PROBABILITY state posterior MLP P (qt = i | Qt ) probabilities Qt 23

50. ESTIMATE POSTERIOR PROBABILITY state posterior MLP P (qt = i | Qt ) probabilities Qt trajectory of state posterior probabilities 23

51. PARAMETERS • 3000 hidden layers • 23 frames for windowing 24

52. PARAMETERS • 3000 hidden layers • 23 frames for windowing 73.4 % 24

53. ANALYSIS 25

54. INFORMATION ACROSS STATE POSTERIORS 26

55. INFORMATION ACROSS STATE POSTERIORS 1 state (classic) 68.12 1 state better modeling (sum) 70.17 by states 3 state 71.67 better decoding 26

56. INFORMATION ACROSS STATE POSTERIORS 1 state (classic) 68.12 1 state better modeling (sum) 70.17 by states 3 state 71.67 better decoding 26

57. INFORMATION ACROSS STATE POSTERIORS 1 state (classic) 68.12 1 state better modeling (sum) 70.17 by states 3 state 71.67 better decoding 26

58. EXPERIMENT A 27

59. EXPERIMENT A original data modiﬁed data 0.04 0.09 0.64 0.9 0.32 0.01 27

60. EXPERIMENT A original data modiﬁed data 0.04 0.09 0.64 0.9 0.32 0.01 27

61. EXPERIMENT A original data modiﬁed data 0.04 0.09 0.64 0.9 0.32 0.01 27

62. EXPERIMENT A original data modiﬁed data 0.04 0.09 0.64 0.9 0.32 0.01 remaining values: randomly =1 modiﬁed data 27

63. EXPERIMENT B 28

64. EXPERIMENT B original data modiﬁed data 0.04 0.25 0.64 0.64 0.32 0.11 28

65. EXPERIMENT B original data modiﬁed data 0.04 0.25 0.64 0.64 0.32 0.11 28

66. EXPERIMENT B original data modiﬁed data 0.04 0.25 0.64 0.64 0.32 0.11 28

67. EXPERIMENT B original data modiﬁed data 0.04 0.25 0.64 0.64 0.32 0.11 remaining values: randomly =1 modiﬁed data 28

68. RECOGNITION ACCURACY experiment 1 state MLP 3 state MLP baseline 68.12 71.55 experiment A 62.77 70.27 experiment B 64.24 70.75 29

69. RECOGNITION ACCURACY experiment 1 state MLP 3 state MLP baseline 68.12 71.55 experiment A 62.77 70.27 experiment B 64.24 70.75 Information in phoneme posteriors ! 29

70. SUMMARY • contextual information in features • contextual information in probabilities 30

Exploiting contextual information for improved phoeneme recognition

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Exploiting contextual information for improved phoeneme recognition

Similar to Exploiting contextual information for improved phoeneme recognition (20)

Recently uploaded

Recently uploaded (20)

Exploiting contextual information for improved phoeneme recognition

Editor's Notes