SlideShare a Scribd company logo
1 of 62
Download to read offline
• Do not compare results across different tables!
– Configurations may differ
• Most results shown here can be found in:
Takuya Yoshioka and Mark J. F. Gales, “Environmentally
robust ASR front-end for deep neural network acoustic
models,” Computer Speech and Language, vol. 31, no. 1, pp.
65-86, May 2015
1. Motivation
2. Corpus
• AMI meeting corpus
3. Baseline systems
• SI and SAT set-ups
4. Assessment of environmental robustness of
DNN acoustic models
5. Front-end techniques
6. Combined effects
Little investigation done
• Multi-party interaction
– 4 participants in each meeting
• Multi-channel recordings
– Distant microphones – only first channel used
– Head-set & lapel microphones
• 2 recording set-ups
– 70h scenario-based meetings
– 30h real meetings
• Different rooms
• Multiple sources of distortion
– Reverberation
– Additive noise
– Overlapping speech
• Moving speakers
• Many non-natives
• SI : speaker independent
– For online transcription
– DNN-HMM hybrid
• SAT: speaker adaptive training
– For offline transcription
– MLP tandem
• Manual segmentations used
• Overlapping segments ignored
State output distributions modelled with
– GMM or
– DNN
¦ –
/
Q
T
t
tttt qpqqPqPp
q
xX
1
10 )|()|()()|(
¦
M
m
jmjm
mjmt Ncjp
1
)()(
),;()|( Σμxx
)(
)|(
)|(
jp
jp
jp t
t
x
x
Æ
• Discriminative pre-training
• Cross entropy fine-tuning
• Discriminative pre-training
• Trained on Telta K20
• cuBLAS 5.5 used
• Mini-batch size: 800 frames
• Learning rate: “newbob” scheduling
• 10% held-out data for CV
System
Parame-
terisation
%WER
Dev Eval Avg
MPE GMM-HMM HLDA 54.7 55.6 55.2
DNN-HMM hybrid FBANK 43.5 42.6 43.1
This work 40.0 39.3 39.7
Data Set
Parame-
terisation
%WER
Dev Eval Avg
SDM FBANK 43.5 42.6 43.1
IHM FBANK 28.2 24.6 26.4
• 39.2% of the errors caused by acoustic distortion
• DNN-HMMs not so robust
Æ
• Discriminative pre-training
• Cross entropy fine-tuning
• Discriminative pre-training
Align-
ment
DNN
input
%WER
Dev Eval Avg
SDM IHM 30.6 27.0 28.8
IHM SDM 41.8 40.8 41.3
IHM SDM 41.7 40.6 41.2
Using 648-2,000 5-4,000 DNN:
DNN training more sensitive to noise than state
alignment
Speech enhancement
Feature transformation
Multi-stream features
Speech enhancement
Feature transformation
Multi-stream features
Previous work
– Beamforming yields gains
– No investigation on single-microphone algorithms
• Based on linear time (almost) invariant filters
• Applied to complex-valued STFT coefficients
• The filters automatically adjusted using observations
– WPE for 1ch dereverberation (NTT’s work)
– BeamformIt for denoising (ICSI’s work)
• 8 microphones used, dedicated to meetings
• Unlikely to produce irregular transitions
¦ 


1
0
,,,,
T
Tk
ktfkftftf xgxy
Align-
ment
Dev Eval
SDM +Derev
BFIt
(8mics)
SDM +Derev
BFIt
(8mics)
MPE 43.8 41.8 38.6 43.0 41.3 36.6
Hybrid 43.5 41.7 38.8 43.3 41.4 36.7
• Dereveberation helps even with single microphone
• Multi-microphone beamforming works well
DNN size
Context
frames
Dev Eval
SDM +Derev SDM +Derev
1,000 5 9 43.8 41.8 43.0 41.3
1,500 5
9 43.5 42.0 42.6 41.1
13 42.8 41.8 42.9 41.2
19 43.0 41.7 42.9 41.2
2,000 5 9 43.8 41.3 42.9 40.4
4.7% gain from 1ch dereverberation (relative)
Speech enhancement
Feature transformation
Multi-stream features
No positive results reported previously
• Applied to magnitude spectra
• Cross terms (often) ignored
• Frame-by-frame modification
– Harmful for DNN?
• Noise estimated using long-term statistics
– IMCRA (used here), minimum statistics, etc
• Deltas from un-enhanced speech
– Essential for obtaining gains
2
,
2
,
2
, tftftf nxy
• Applied to FBANK features
• The following mismatch function used
• Frame-by-frame modification
• Noise model estimated with EM
• Deltas from un-enhanced speech
))exp(1log( hynhxy  tttt
Enhancement target %WER
Spectrum Feature Dev Eval Avg
N N 42.0 41.1 41.6
Y N 41.3 40.9 41.1
N Y 41.4 40.5 41.0
Y Y 42.0 41.0 41.5
• Small consistent gains
• Different methods should not be connected
Enhancement target %WER
Spectrum Feature Dev Eval Avg
N N 42.0 41.1 41.6
Y N 41.3 40.9 41.1
N Y 41.4 40.5 41.0
Y Y 42.0 41.0 41.5
Y Y 41.4 40.4 40.9
Using multi-stream approach:
Speech enhancement
Feature transformation
Multi-stream features
• Frame level
– FMPE, RDT, FE-CMLLR
– Seems to be subsumed by DNN
• Speaker (or environment) level
– Global CMLLR, LIN, fDLR, VTLN
– Multiple decoding passes required Æ SAT
• Utterance level
– Single-pass decoding Æ SI
• Seems robust against supervision errors
• STC transform used to deal with correlations:
»
»
»
¼
º
«
«
«
¬
ª
tx
)()()()( ss
t
ss
t bLxAy
Form of speaker
transform
%WER
Dev Eval Avg
None (SI) 42.6 40.2 41.4
Full 37.4 37.4 37.4
Block diagonal 37.3 36.6 37.0
• ~10% relative gains obtained
• “Block diagonal” outperforms “full”
Form of speaker
transform
%WER
Dev Eval Avg
None (SI) 42.6 40.2 41.4
Full 37.4 37.4 37.4
Block diagonal 37.3 36.6 37.0
None (SI) 27.8 24.2 26.0
Full 23.8 21.6 22.7
On IHM data set
))(()())(()( ucu
t
ucu
t bLxAy 
uuc :)(
Clustering performed using:
– utterance-specific iVectors
– Kmeans (GMM yielded similar performance figures)
)()0()( uu
Twmm 
m(0)
T
w(u)
Subspace representation of the deviation from
UBM
m(0)
m(1)
m(2)
m(3)
Variability
subspace
#Clusters
%WER
Dev Eval Avg
No QCMLLR 41.9 40.9 41.4
64 41.0 40.4 40.7
32 41.0 40.0 40.5
16 41.5 40.5 41.0
No QCMLLR 27.8 24.2 26.0
32 26.9 23.5 25.2
On IHM data set
• Using 32 clusters yielded best performance
• Similar gains on both SDM and IHM
Speech enhancement
Feature transformation
Multi-stream features
• Originally proposed by Aachen for shallow MLP
tandem configurations
• Exploits DNN’s insensitivity to the increase in input
dimensionality
• (Hopefully) complement features masked by noise
• Allows multiple enhancement results to be
combined
• Four types of auxiliary features investigated:
– MFCC (Δ/Δ2)
– PLP
– Gammatone cepstra
• Different frequency warping
• STFT not used
– Intra-frame delta ( )
• Emphasises spectral peaks/dips
Feature set #features
%WER
Dev Eval Avg
FBANK+Δ+Δ2 (baseline) 72 41.9 40.9 41.4
+PLP 85 40.7 40.3 40.5
+Gammatone 88 40.8 40.0 40.4
+MFCC 85 41.1 39.7 40.4
+MFCC+Δ+Δ2 111 40.6 40.2 40.4
+ + 2 120 40.9 39.8 40.4
+MFCC+ + 2 133 40.4 39.8 40.1
• Speech enhancement
– Linear filtering
– Spectral/feature enhancement
• Feature transformation
– Quantised CMLLR
– (Global CMLLR for SAT)
• Multi-stream features
' '
Baseline
Front-end
%WER
Dev Eval Avg
FBANK baseline 43.1 42.4 42.8
+WPE 41.8 40.7 41.3
+MFCC+ + 2 40.5 40.1 40.3
+IMCRA+FE-VTS 40.0 39.3 39.7
+QCMLLR 40.9 39.5 40.2
• Effects additive except for QCMLLR
• QCMLLR may work if applied to the entire feature set
ÆÆ
System
Parame-
terisation
%WER
Dev Eval Avg
SAT GMM-HMM
MPE trained
HLDA 48.8 50.2 49.5
SAT tandem
MPE trained
FBANK 40.7 40.9 40.8
SI hybrid FBANK 43.5 42.6 43.1
• Outperforms SAT GMM-HMM
• Outperforms SI hybrid
' '
Baseline
Front-end
%WER
Dev Eval Avg
FBANK baseline 40.1 41.3 40.7
+WPE 38.9 39.3 39.1
+MFCC 38.5 38.5 38.5
+IMCRA+FE-VTS 38.4 38.7 38.6
+CMLLR 36.6 36.7 36.7
+CMLLR 36.9 37.0 37.0
+CMLLR 38.4 38.6 38.5
• Effects of WPE and CMLLR are additive
• Using auxiliary features yields small gains over CMLLR
features
• Denoising subsumed by CMLLR (as expected)
• Front-end processing approaches yield gains
over state-of-the-art DNN-based AMs
– Linear filtering (WPE, BeamformIt)
– Spectral/feature enhancement (IMCRA, FE-VTS)
– Feature transformation (QCMLLR, CMLLR)
– Multi-stream features
• Possible to combine different classes of
approaches

More Related Content

What's hot

Slide Handouts with Notes
Slide Handouts with NotesSlide Handouts with Notes
Slide Handouts with Notes
Leon Nguyen
 
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONSPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
niranjan kumar
 

What's hot (20)

Overview of sampling
Overview of samplingOverview of sampling
Overview of sampling
 
Slide Handouts with Notes
Slide Handouts with NotesSlide Handouts with Notes
Slide Handouts with Notes
 
Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet Vocoder
 
EC8553 Discrete time signal processing
EC8553 Discrete time signal processing EC8553 Discrete time signal processing
EC8553 Discrete time signal processing
 
Multrate dsp
Multrate dspMultrate dsp
Multrate dsp
 
Non-Uniform sampling and reconstruction of multi-band signals
Non-Uniform sampling and reconstruction of multi-band signalsNon-Uniform sampling and reconstruction of multi-band signals
Non-Uniform sampling and reconstruction of multi-band signals
 
1 AUDIO SIGNAL PROCESSING
1 AUDIO SIGNAL PROCESSING1 AUDIO SIGNAL PROCESSING
1 AUDIO SIGNAL PROCESSING
 
Multirate dtsp
Multirate dtspMultirate dtsp
Multirate dtsp
 
Dsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processingDsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processing
 
SAMPLING & RECONSTRUCTION OF DISCRETE TIME SIGNAL
SAMPLING & RECONSTRUCTION  OF DISCRETE TIME SIGNALSAMPLING & RECONSTRUCTION  OF DISCRETE TIME SIGNAL
SAMPLING & RECONSTRUCTION OF DISCRETE TIME SIGNAL
 
Fft analysis
Fft analysisFft analysis
Fft analysis
 
Audio Processing
Audio ProcessingAudio Processing
Audio Processing
 
Basics of Digital Filters
Basics of Digital FiltersBasics of Digital Filters
Basics of Digital Filters
 
Lecture9
Lecture9Lecture9
Lecture9
 
The Fast Fourier Transform (FFT)
The Fast Fourier Transform (FFT)The Fast Fourier Transform (FFT)
The Fast Fourier Transform (FFT)
 
Aliasing and Antialiasing filter
Aliasing and Antialiasing filterAliasing and Antialiasing filter
Aliasing and Antialiasing filter
 
Signal Processing
Signal ProcessingSignal Processing
Signal Processing
 
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONSPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
 
Digital signal processing part1
Digital signal processing part1Digital signal processing part1
Digital signal processing part1
 
DSP_2018_FOEHU - Lec 06 - FIR Filter Design
DSP_2018_FOEHU - Lec 06 - FIR Filter DesignDSP_2018_FOEHU - Lec 06 - FIR Filter Design
DSP_2018_FOEHU - Lec 06 - FIR Filter Design
 

Similar to Environmentally robust ASR front end for DNN-based acoustic models

Digital Signal Processor evolution over the last 30 years
Digital Signal Processor evolution over the last 30 yearsDigital Signal Processor evolution over the last 30 years
Digital Signal Processor evolution over the last 30 years
Francois Charlot
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
Hira Shaukat
 
Final presentation
Final presentationFinal presentation
Final presentation
Rohan Lad
 
Speech Compression using LPC
Speech Compression using LPCSpeech Compression using LPC
Speech Compression using LPC
Disha Modi
 
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
niranjan kumar
 
A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...
a3labdsp
 
Introduction to ELINT Analyses
Introduction to ELINT AnalysesIntroduction to ELINT Analyses
Introduction to ELINT Analyses
Joseph Hennawy
 

Similar to Environmentally robust ASR front end for DNN-based acoustic models (20)

Text independent speaker recognition system
Text independent speaker recognition systemText independent speaker recognition system
Text independent speaker recognition system
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
 
Digital Signal Processor evolution over the last 30 years
Digital Signal Processor evolution over the last 30 yearsDigital Signal Processor evolution over the last 30 years
Digital Signal Processor evolution over the last 30 years
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
 
ISSCS2011
ISSCS2011ISSCS2011
ISSCS2011
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detection
 
Digital signal processing
Digital signal processingDigital signal processing
Digital signal processing
 
DNN-based frequency-domain permutation solver for multichannel audio source s...
DNN-based frequency-domain permutation solver for multichannel audio source s...DNN-based frequency-domain permutation solver for multichannel audio source s...
DNN-based frequency-domain permutation solver for multichannel audio source s...
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesis
 
Speech Compression using LPC
Speech Compression using LPCSpeech Compression using LPC
Speech Compression using LPC
 
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
 
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...
 
DSP Lesson 1 Slides (1).pdf
DSP Lesson 1 Slides (1).pdfDSP Lesson 1 Slides (1).pdf
DSP Lesson 1 Slides (1).pdf
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition final
 
A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...
 
Introduction to ELINT Analyses
Introduction to ELINT AnalysesIntroduction to ELINT Analyses
Introduction to ELINT Analyses
 
SBE Filter Tuning 101 by Jeremy Ruck November 2015
SBE Filter Tuning 101 by Jeremy Ruck November 2015SBE Filter Tuning 101 by Jeremy Ruck November 2015
SBE Filter Tuning 101 by Jeremy Ruck November 2015
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Environmentally robust ASR front end for DNN-based acoustic models

  • 1.
  • 2. • Do not compare results across different tables! – Configurations may differ • Most results shown here can be found in: Takuya Yoshioka and Mark J. F. Gales, “Environmentally robust ASR front-end for deep neural network acoustic models,” Computer Speech and Language, vol. 31, no. 1, pp. 65-86, May 2015
  • 3. 1. Motivation 2. Corpus • AMI meeting corpus 3. Baseline systems • SI and SAT set-ups 4. Assessment of environmental robustness of DNN acoustic models 5. Front-end techniques 6. Combined effects
  • 4.
  • 5.
  • 7. • Multi-party interaction – 4 participants in each meeting • Multi-channel recordings – Distant microphones – only first channel used – Head-set & lapel microphones • 2 recording set-ups – 70h scenario-based meetings – 30h real meetings
  • 8. • Different rooms • Multiple sources of distortion – Reverberation – Additive noise – Overlapping speech • Moving speakers • Many non-natives
  • 9. • SI : speaker independent – For online transcription – DNN-HMM hybrid • SAT: speaker adaptive training – For offline transcription – MLP tandem
  • 10.
  • 11. • Manual segmentations used • Overlapping segments ignored
  • 12.
  • 13. State output distributions modelled with – GMM or – DNN ¦ – / Q T t tttt qpqqPqPp q xX 1 10 )|()|()()|( ¦ M m jmjm mjmt Ncjp 1 )()( ),;()|( Σμxx )( )|( )|( jp jp jp t t x x
  • 14. Æ • Discriminative pre-training • Cross entropy fine-tuning • Discriminative pre-training
  • 15. • Trained on Telta K20 • cuBLAS 5.5 used • Mini-batch size: 800 frames • Learning rate: “newbob” scheduling • 10% held-out data for CV
  • 16. System Parame- terisation %WER Dev Eval Avg MPE GMM-HMM HLDA 54.7 55.6 55.2 DNN-HMM hybrid FBANK 43.5 42.6 43.1 This work 40.0 39.3 39.7
  • 17.
  • 18. Data Set Parame- terisation %WER Dev Eval Avg SDM FBANK 43.5 42.6 43.1 IHM FBANK 28.2 24.6 26.4 • 39.2% of the errors caused by acoustic distortion • DNN-HMMs not so robust
  • 19.
  • 20. Æ • Discriminative pre-training • Cross entropy fine-tuning • Discriminative pre-training
  • 21.
  • 22. Align- ment DNN input %WER Dev Eval Avg SDM IHM 30.6 27.0 28.8 IHM SDM 41.8 40.8 41.3 IHM SDM 41.7 40.6 41.2 Using 648-2,000 5-4,000 DNN: DNN training more sensitive to noise than state alignment
  • 23.
  • 26. Previous work – Beamforming yields gains – No investigation on single-microphone algorithms
  • 27. • Based on linear time (almost) invariant filters • Applied to complex-valued STFT coefficients • The filters automatically adjusted using observations – WPE for 1ch dereverberation (NTT’s work) – BeamformIt for denoising (ICSI’s work) • 8 microphones used, dedicated to meetings • Unlikely to produce irregular transitions ¦ 1 0 ,,,, T Tk ktfkftftf xgxy
  • 28. Align- ment Dev Eval SDM +Derev BFIt (8mics) SDM +Derev BFIt (8mics) MPE 43.8 41.8 38.6 43.0 41.3 36.6 Hybrid 43.5 41.7 38.8 43.3 41.4 36.7 • Dereveberation helps even with single microphone • Multi-microphone beamforming works well
  • 29. DNN size Context frames Dev Eval SDM +Derev SDM +Derev 1,000 5 9 43.8 41.8 43.0 41.3 1,500 5 9 43.5 42.0 42.6 41.1 13 42.8 41.8 42.9 41.2 19 43.0 41.7 42.9 41.2 2,000 5 9 43.8 41.3 42.9 40.4 4.7% gain from 1ch dereverberation (relative)
  • 31. No positive results reported previously
  • 32. • Applied to magnitude spectra • Cross terms (often) ignored • Frame-by-frame modification – Harmful for DNN? • Noise estimated using long-term statistics – IMCRA (used here), minimum statistics, etc • Deltas from un-enhanced speech – Essential for obtaining gains 2 , 2 , 2 , tftftf nxy
  • 33.
  • 34. • Applied to FBANK features • The following mismatch function used • Frame-by-frame modification • Noise model estimated with EM • Deltas from un-enhanced speech ))exp(1log( hynhxy tttt
  • 35. Enhancement target %WER Spectrum Feature Dev Eval Avg N N 42.0 41.1 41.6 Y N 41.3 40.9 41.1 N Y 41.4 40.5 41.0 Y Y 42.0 41.0 41.5 • Small consistent gains • Different methods should not be connected
  • 36. Enhancement target %WER Spectrum Feature Dev Eval Avg N N 42.0 41.1 41.6 Y N 41.3 40.9 41.1 N Y 41.4 40.5 41.0 Y Y 42.0 41.0 41.5 Y Y 41.4 40.4 40.9 Using multi-stream approach:
  • 38. • Frame level – FMPE, RDT, FE-CMLLR – Seems to be subsumed by DNN • Speaker (or environment) level – Global CMLLR, LIN, fDLR, VTLN – Multiple decoding passes required Æ SAT • Utterance level – Single-pass decoding Æ SI
  • 39. • Seems robust against supervision errors • STC transform used to deal with correlations: » » » ¼ º « « « ¬ ª tx )()()()( ss t ss t bLxAy
  • 40.
  • 41.
  • 42. Form of speaker transform %WER Dev Eval Avg None (SI) 42.6 40.2 41.4 Full 37.4 37.4 37.4 Block diagonal 37.3 36.6 37.0 • ~10% relative gains obtained • “Block diagonal” outperforms “full”
  • 43. Form of speaker transform %WER Dev Eval Avg None (SI) 42.6 40.2 41.4 Full 37.4 37.4 37.4 Block diagonal 37.3 36.6 37.0 None (SI) 27.8 24.2 26.0 Full 23.8 21.6 22.7 On IHM data set
  • 44. ))(()())(()( ucu t ucu t bLxAy uuc :)( Clustering performed using: – utterance-specific iVectors – Kmeans (GMM yielded similar performance figures)
  • 45. )()0()( uu Twmm m(0) T w(u) Subspace representation of the deviation from UBM m(0) m(1) m(2) m(3) Variability subspace
  • 46.
  • 47.
  • 48.
  • 49. #Clusters %WER Dev Eval Avg No QCMLLR 41.9 40.9 41.4 64 41.0 40.4 40.7 32 41.0 40.0 40.5 16 41.5 40.5 41.0 No QCMLLR 27.8 24.2 26.0 32 26.9 23.5 25.2 On IHM data set
  • 50. • Using 32 clusters yielded best performance • Similar gains on both SDM and IHM
  • 52. • Originally proposed by Aachen for shallow MLP tandem configurations • Exploits DNN’s insensitivity to the increase in input dimensionality • (Hopefully) complement features masked by noise • Allows multiple enhancement results to be combined
  • 53. • Four types of auxiliary features investigated: – MFCC (Δ/Δ2) – PLP – Gammatone cepstra • Different frequency warping • STFT not used – Intra-frame delta ( ) • Emphasises spectral peaks/dips
  • 54. Feature set #features %WER Dev Eval Avg FBANK+Δ+Δ2 (baseline) 72 41.9 40.9 41.4 +PLP 85 40.7 40.3 40.5 +Gammatone 88 40.8 40.0 40.4 +MFCC 85 41.1 39.7 40.4 +MFCC+Δ+Δ2 111 40.6 40.2 40.4 + + 2 120 40.9 39.8 40.4 +MFCC+ + 2 133 40.4 39.8 40.1
  • 55. • Speech enhancement – Linear filtering – Spectral/feature enhancement • Feature transformation – Quantised CMLLR – (Global CMLLR for SAT) • Multi-stream features
  • 57. Front-end %WER Dev Eval Avg FBANK baseline 43.1 42.4 42.8 +WPE 41.8 40.7 41.3 +MFCC+ + 2 40.5 40.1 40.3 +IMCRA+FE-VTS 40.0 39.3 39.7 +QCMLLR 40.9 39.5 40.2 • Effects additive except for QCMLLR • QCMLLR may work if applied to the entire feature set
  • 58. ÆÆ
  • 59. System Parame- terisation %WER Dev Eval Avg SAT GMM-HMM MPE trained HLDA 48.8 50.2 49.5 SAT tandem MPE trained FBANK 40.7 40.9 40.8 SI hybrid FBANK 43.5 42.6 43.1 • Outperforms SAT GMM-HMM • Outperforms SI hybrid
  • 61. Front-end %WER Dev Eval Avg FBANK baseline 40.1 41.3 40.7 +WPE 38.9 39.3 39.1 +MFCC 38.5 38.5 38.5 +IMCRA+FE-VTS 38.4 38.7 38.6 +CMLLR 36.6 36.7 36.7 +CMLLR 36.9 37.0 37.0 +CMLLR 38.4 38.6 38.5 • Effects of WPE and CMLLR are additive • Using auxiliary features yields small gains over CMLLR features • Denoising subsumed by CMLLR (as expected)
  • 62. • Front-end processing approaches yield gains over state-of-the-art DNN-based AMs – Linear filtering (WPE, BeamformIt) – Spectral/feature enhancement (IMCRA, FE-VTS) – Feature transformation (QCMLLR, CMLLR) – Multi-stream features • Possible to combine different classes of approaches