SlideShare a Scribd company logo
• Do not compare results across different tables!
– Configurations may differ
• Most results shown here can be found in:
Takuya Yoshioka and Mark J. F. Gales, “Environmentally
robust ASR front-end for deep neural network acoustic
models,” Computer Speech and Language, vol. 31, no. 1, pp.
65-86, May 2015
1. Motivation
2. Corpus
• AMI meeting corpus
3. Baseline systems
• SI and SAT set-ups
4. Assessment of environmental robustness of
DNN acoustic models
5. Front-end techniques
6. Combined effects
Little investigation done
• Multi-party interaction
– 4 participants in each meeting
• Multi-channel recordings
– Distant microphones – only first channel used
– Head-set & lapel microphones
• 2 recording set-ups
– 70h scenario-based meetings
– 30h real meetings
• Different rooms
• Multiple sources of distortion
– Reverberation
– Additive noise
– Overlapping speech
• Moving speakers
• Many non-natives
• SI : speaker independent
– For online transcription
– DNN-HMM hybrid
• SAT: speaker adaptive training
– For offline transcription
– MLP tandem
• Manual segmentations used
• Overlapping segments ignored
State output distributions modelled with
– GMM or
– DNN
¦ –
/
Q
T
t
tttt qpqqPqPp
q
xX
1
10 )|()|()()|(
¦
M
m
jmjm
mjmt Ncjp
1
)()(
),;()|( Σμxx
)(
)|(
)|(
jp
jp
jp t
t
x
x
Æ
• Discriminative pre-training
• Cross entropy fine-tuning
• Discriminative pre-training
• Trained on Telta K20
• cuBLAS 5.5 used
• Mini-batch size: 800 frames
• Learning rate: “newbob” scheduling
• 10% held-out data for CV
System
Parame-
terisation
%WER
Dev Eval Avg
MPE GMM-HMM HLDA 54.7 55.6 55.2
DNN-HMM hybrid FBANK 43.5 42.6 43.1
This work 40.0 39.3 39.7
Data Set
Parame-
terisation
%WER
Dev Eval Avg
SDM FBANK 43.5 42.6 43.1
IHM FBANK 28.2 24.6 26.4
• 39.2% of the errors caused by acoustic distortion
• DNN-HMMs not so robust
Æ
• Discriminative pre-training
• Cross entropy fine-tuning
• Discriminative pre-training
Align-
ment
DNN
input
%WER
Dev Eval Avg
SDM IHM 30.6 27.0 28.8
IHM SDM 41.8 40.8 41.3
IHM SDM 41.7 40.6 41.2
Using 648-2,000 5-4,000 DNN:
DNN training more sensitive to noise than state
alignment
Speech enhancement
Feature transformation
Multi-stream features
Speech enhancement
Feature transformation
Multi-stream features
Previous work
– Beamforming yields gains
– No investigation on single-microphone algorithms
• Based on linear time (almost) invariant filters
• Applied to complex-valued STFT coefficients
• The filters automatically adjusted using observations
– WPE for 1ch dereverberation (NTT’s work)
– BeamformIt for denoising (ICSI’s work)
• 8 microphones used, dedicated to meetings
• Unlikely to produce irregular transitions
¦ 


1
0
,,,,
T
Tk
ktfkftftf xgxy
Align-
ment
Dev Eval
SDM +Derev
BFIt
(8mics)
SDM +Derev
BFIt
(8mics)
MPE 43.8 41.8 38.6 43.0 41.3 36.6
Hybrid 43.5 41.7 38.8 43.3 41.4 36.7
• Dereveberation helps even with single microphone
• Multi-microphone beamforming works well
DNN size
Context
frames
Dev Eval
SDM +Derev SDM +Derev
1,000 5 9 43.8 41.8 43.0 41.3
1,500 5
9 43.5 42.0 42.6 41.1
13 42.8 41.8 42.9 41.2
19 43.0 41.7 42.9 41.2
2,000 5 9 43.8 41.3 42.9 40.4
4.7% gain from 1ch dereverberation (relative)
Speech enhancement
Feature transformation
Multi-stream features
No positive results reported previously
• Applied to magnitude spectra
• Cross terms (often) ignored
• Frame-by-frame modification
– Harmful for DNN?
• Noise estimated using long-term statistics
– IMCRA (used here), minimum statistics, etc
• Deltas from un-enhanced speech
– Essential for obtaining gains
2
,
2
,
2
, tftftf nxy
• Applied to FBANK features
• The following mismatch function used
• Frame-by-frame modification
• Noise model estimated with EM
• Deltas from un-enhanced speech
))exp(1log( hynhxy  tttt
Enhancement target %WER
Spectrum Feature Dev Eval Avg
N N 42.0 41.1 41.6
Y N 41.3 40.9 41.1
N Y 41.4 40.5 41.0
Y Y 42.0 41.0 41.5
• Small consistent gains
• Different methods should not be connected
Enhancement target %WER
Spectrum Feature Dev Eval Avg
N N 42.0 41.1 41.6
Y N 41.3 40.9 41.1
N Y 41.4 40.5 41.0
Y Y 42.0 41.0 41.5
Y Y 41.4 40.4 40.9
Using multi-stream approach:
Speech enhancement
Feature transformation
Multi-stream features
• Frame level
– FMPE, RDT, FE-CMLLR
– Seems to be subsumed by DNN
• Speaker (or environment) level
– Global CMLLR, LIN, fDLR, VTLN
– Multiple decoding passes required Æ SAT
• Utterance level
– Single-pass decoding Æ SI
• Seems robust against supervision errors
• STC transform used to deal with correlations:
»
»
»
¼
º
«
«
«
¬
ª
tx
)()()()( ss
t
ss
t bLxAy
Form of speaker
transform
%WER
Dev Eval Avg
None (SI) 42.6 40.2 41.4
Full 37.4 37.4 37.4
Block diagonal 37.3 36.6 37.0
• ~10% relative gains obtained
• “Block diagonal” outperforms “full”
Form of speaker
transform
%WER
Dev Eval Avg
None (SI) 42.6 40.2 41.4
Full 37.4 37.4 37.4
Block diagonal 37.3 36.6 37.0
None (SI) 27.8 24.2 26.0
Full 23.8 21.6 22.7
On IHM data set
))(()())(()( ucu
t
ucu
t bLxAy 
uuc :)(
Clustering performed using:
– utterance-specific iVectors
– Kmeans (GMM yielded similar performance figures)
)()0()( uu
Twmm 
m(0)
T
w(u)
Subspace representation of the deviation from
UBM
m(0)
m(1)
m(2)
m(3)
Variability
subspace
#Clusters
%WER
Dev Eval Avg
No QCMLLR 41.9 40.9 41.4
64 41.0 40.4 40.7
32 41.0 40.0 40.5
16 41.5 40.5 41.0
No QCMLLR 27.8 24.2 26.0
32 26.9 23.5 25.2
On IHM data set
• Using 32 clusters yielded best performance
• Similar gains on both SDM and IHM
Speech enhancement
Feature transformation
Multi-stream features
• Originally proposed by Aachen for shallow MLP
tandem configurations
• Exploits DNN’s insensitivity to the increase in input
dimensionality
• (Hopefully) complement features masked by noise
• Allows multiple enhancement results to be
combined
• Four types of auxiliary features investigated:
– MFCC (Δ/Δ2)
– PLP
– Gammatone cepstra
• Different frequency warping
• STFT not used
– Intra-frame delta ( )
• Emphasises spectral peaks/dips
Feature set #features
%WER
Dev Eval Avg
FBANK+Δ+Δ2 (baseline) 72 41.9 40.9 41.4
+PLP 85 40.7 40.3 40.5
+Gammatone 88 40.8 40.0 40.4
+MFCC 85 41.1 39.7 40.4
+MFCC+Δ+Δ2 111 40.6 40.2 40.4
+ + 2 120 40.9 39.8 40.4
+MFCC+ + 2 133 40.4 39.8 40.1
• Speech enhancement
– Linear filtering
– Spectral/feature enhancement
• Feature transformation
– Quantised CMLLR
– (Global CMLLR for SAT)
• Multi-stream features
' '
Baseline
Front-end
%WER
Dev Eval Avg
FBANK baseline 43.1 42.4 42.8
+WPE 41.8 40.7 41.3
+MFCC+ + 2 40.5 40.1 40.3
+IMCRA+FE-VTS 40.0 39.3 39.7
+QCMLLR 40.9 39.5 40.2
• Effects additive except for QCMLLR
• QCMLLR may work if applied to the entire feature set
ÆÆ
System
Parame-
terisation
%WER
Dev Eval Avg
SAT GMM-HMM
MPE trained
HLDA 48.8 50.2 49.5
SAT tandem
MPE trained
FBANK 40.7 40.9 40.8
SI hybrid FBANK 43.5 42.6 43.1
• Outperforms SAT GMM-HMM
• Outperforms SI hybrid
' '
Baseline
Front-end
%WER
Dev Eval Avg
FBANK baseline 40.1 41.3 40.7
+WPE 38.9 39.3 39.1
+MFCC 38.5 38.5 38.5
+IMCRA+FE-VTS 38.4 38.7 38.6
+CMLLR 36.6 36.7 36.7
+CMLLR 36.9 37.0 37.0
+CMLLR 38.4 38.6 38.5
• Effects of WPE and CMLLR are additive
• Using auxiliary features yields small gains over CMLLR
features
• Denoising subsumed by CMLLR (as expected)
• Front-end processing approaches yield gains
over state-of-the-art DNN-based AMs
– Linear filtering (WPE, BeamformIt)
– Spectral/feature enhancement (IMCRA, FE-VTS)
– Feature transformation (QCMLLR, CMLLR)
– Multi-stream features
• Possible to combine different classes of
approaches

More Related Content

What's hot

Overview of sampling
Overview of samplingOverview of sampling
Overview of samplingSagar Kumar
 
Slide Handouts with Notes
Slide Handouts with NotesSlide Handouts with Notes
Slide Handouts with NotesLeon Nguyen
 
Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderAkira Tamamori
 
EC8553 Discrete time signal processing
EC8553 Discrete time signal processing EC8553 Discrete time signal processing
EC8553 Discrete time signal processing ssuser2797e4
 
Non-Uniform sampling and reconstruction of multi-band signals
Non-Uniform sampling and reconstruction of multi-band signalsNon-Uniform sampling and reconstruction of multi-band signals
Non-Uniform sampling and reconstruction of multi-band signalsmravendi
 
1 AUDIO SIGNAL PROCESSING
1 AUDIO SIGNAL PROCESSING1 AUDIO SIGNAL PROCESSING
1 AUDIO SIGNAL PROCESSINGmukesh bhardwaj
 
Dsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processingDsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processingAmr E. Mohamed
 
SAMPLING & RECONSTRUCTION OF DISCRETE TIME SIGNAL
SAMPLING & RECONSTRUCTION  OF DISCRETE TIME SIGNALSAMPLING & RECONSTRUCTION  OF DISCRETE TIME SIGNAL
SAMPLING & RECONSTRUCTION OF DISCRETE TIME SIGNALkaran sati
 
Fft analysis
Fft analysisFft analysis
Fft analysisSatrious
 
Audio Processing
Audio ProcessingAudio Processing
Audio Processinganeetaanu
 
Basics of Digital Filters
Basics of Digital FiltersBasics of Digital Filters
Basics of Digital Filtersop205
 
The Fast Fourier Transform (FFT)
The Fast Fourier Transform (FFT)The Fast Fourier Transform (FFT)
The Fast Fourier Transform (FFT)Oka Danil
 
Aliasing and Antialiasing filter
Aliasing and Antialiasing filterAliasing and Antialiasing filter
Aliasing and Antialiasing filterSuresh Mohta
 
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONSPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONniranjan kumar
 
DSP_2018_FOEHU - Lec 06 - FIR Filter Design
DSP_2018_FOEHU - Lec 06 - FIR Filter DesignDSP_2018_FOEHU - Lec 06 - FIR Filter Design
DSP_2018_FOEHU - Lec 06 - FIR Filter DesignAmr E. Mohamed
 

What's hot (20)

Overview of sampling
Overview of samplingOverview of sampling
Overview of sampling
 
Slide Handouts with Notes
Slide Handouts with NotesSlide Handouts with Notes
Slide Handouts with Notes
 
Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet Vocoder
 
EC8553 Discrete time signal processing
EC8553 Discrete time signal processing EC8553 Discrete time signal processing
EC8553 Discrete time signal processing
 
Multrate dsp
Multrate dspMultrate dsp
Multrate dsp
 
Non-Uniform sampling and reconstruction of multi-band signals
Non-Uniform sampling and reconstruction of multi-band signalsNon-Uniform sampling and reconstruction of multi-band signals
Non-Uniform sampling and reconstruction of multi-band signals
 
1 AUDIO SIGNAL PROCESSING
1 AUDIO SIGNAL PROCESSING1 AUDIO SIGNAL PROCESSING
1 AUDIO SIGNAL PROCESSING
 
Multirate dtsp
Multirate dtspMultirate dtsp
Multirate dtsp
 
Dsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processingDsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processing
 
SAMPLING & RECONSTRUCTION OF DISCRETE TIME SIGNAL
SAMPLING & RECONSTRUCTION  OF DISCRETE TIME SIGNALSAMPLING & RECONSTRUCTION  OF DISCRETE TIME SIGNAL
SAMPLING & RECONSTRUCTION OF DISCRETE TIME SIGNAL
 
Fft analysis
Fft analysisFft analysis
Fft analysis
 
Audio Processing
Audio ProcessingAudio Processing
Audio Processing
 
Basics of Digital Filters
Basics of Digital FiltersBasics of Digital Filters
Basics of Digital Filters
 
Lecture9
Lecture9Lecture9
Lecture9
 
The Fast Fourier Transform (FFT)
The Fast Fourier Transform (FFT)The Fast Fourier Transform (FFT)
The Fast Fourier Transform (FFT)
 
Aliasing and Antialiasing filter
Aliasing and Antialiasing filterAliasing and Antialiasing filter
Aliasing and Antialiasing filter
 
Signal Processing
Signal ProcessingSignal Processing
Signal Processing
 
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONSPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
 
Digital signal processing part1
Digital signal processing part1Digital signal processing part1
Digital signal processing part1
 
DSP_2018_FOEHU - Lec 06 - FIR Filter Design
DSP_2018_FOEHU - Lec 06 - FIR Filter DesignDSP_2018_FOEHU - Lec 06 - FIR Filter Design
DSP_2018_FOEHU - Lec 06 - FIR Filter Design
 

Similar to Environmentally robust ASR front end for DNN-based acoustic models

Text independent speaker recognition system
Text independent speaker recognition systemText independent speaker recognition system
Text independent speaker recognition systemDeepesh Lekhak
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...NUGU developers
 
Digital Signal Processor evolution over the last 30 years
Digital Signal Processor evolution over the last 30 yearsDigital Signal Processor evolution over the last 30 years
Digital Signal Processor evolution over the last 30 yearsFrancois Charlot
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCCHira Shaukat
 
Final presentation
Final presentationFinal presentation
Final presentationRohan Lad
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionNAVER Engineering
 
Digital signal processing
Digital signal processingDigital signal processing
Digital signal processingVedavyas PBurli
 
DNN-based frequency-domain permutation solver for multichannel audio source s...
DNN-based frequency-domain permutation solver for multichannel audio source s...DNN-based frequency-domain permutation solver for multichannel audio source s...
DNN-based frequency-domain permutation solver for multichannel audio source s...Kitamura Laboratory
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesisNAVER Engineering
 
Speech Compression using LPC
Speech Compression using LPCSpeech Compression using LPC
Speech Compression using LPCDisha Modi
 
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...niranjan kumar
 
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...Yun-Nung (Vivian) Chen
 
DSP Lesson 1 Slides (1).pdf
DSP Lesson 1 Slides (1).pdfDSP Lesson 1 Slides (1).pdf
DSP Lesson 1 Slides (1).pdfPearlInc1
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition finalArchit Vora
 
A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...a3labdsp
 
Introduction to ELINT Analyses
Introduction to ELINT AnalysesIntroduction to ELINT Analyses
Introduction to ELINT AnalysesJoseph Hennawy
 
SBE Filter Tuning 101 by Jeremy Ruck November 2015
SBE Filter Tuning 101 by Jeremy Ruck November 2015SBE Filter Tuning 101 by Jeremy Ruck November 2015
SBE Filter Tuning 101 by Jeremy Ruck November 2015kmsavage
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisRushin Shah
 

Similar to Environmentally robust ASR front end for DNN-based acoustic models (20)

Text independent speaker recognition system
Text independent speaker recognition systemText independent speaker recognition system
Text independent speaker recognition system
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
 
Digital Signal Processor evolution over the last 30 years
Digital Signal Processor evolution over the last 30 yearsDigital Signal Processor evolution over the last 30 years
Digital Signal Processor evolution over the last 30 years
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
 
ISSCS2011
ISSCS2011ISSCS2011
ISSCS2011
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detection
 
Digital signal processing
Digital signal processingDigital signal processing
Digital signal processing
 
DNN-based frequency-domain permutation solver for multichannel audio source s...
DNN-based frequency-domain permutation solver for multichannel audio source s...DNN-based frequency-domain permutation solver for multichannel audio source s...
DNN-based frequency-domain permutation solver for multichannel audio source s...
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesis
 
Speech Compression using LPC
Speech Compression using LPCSpeech Compression using LPC
Speech Compression using LPC
 
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL RE...
 
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...
 
DSP Lesson 1 Slides (1).pdf
DSP Lesson 1 Slides (1).pdfDSP Lesson 1 Slides (1).pdf
DSP Lesson 1 Slides (1).pdf
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition final
 
A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...
 
Introduction to ELINT Analyses
Introduction to ELINT AnalysesIntroduction to ELINT Analyses
Introduction to ELINT Analyses
 
SBE Filter Tuning 101 by Jeremy Ruck November 2015
SBE Filter Tuning 101 by Jeremy Ruck November 2015SBE Filter Tuning 101 by Jeremy Ruck November 2015
SBE Filter Tuning 101 by Jeremy Ruck November 2015
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
 

Recently uploaded

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...Product School
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024Stephanie Beckett
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Alison B. Lowndes
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...Product School
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsExpeed Software
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Product School
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff
 

Recently uploaded (20)

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 

Environmentally robust ASR front end for DNN-based acoustic models

  • 1.
  • 2. • Do not compare results across different tables! – Configurations may differ • Most results shown here can be found in: Takuya Yoshioka and Mark J. F. Gales, “Environmentally robust ASR front-end for deep neural network acoustic models,” Computer Speech and Language, vol. 31, no. 1, pp. 65-86, May 2015
  • 3. 1. Motivation 2. Corpus • AMI meeting corpus 3. Baseline systems • SI and SAT set-ups 4. Assessment of environmental robustness of DNN acoustic models 5. Front-end techniques 6. Combined effects
  • 4.
  • 5.
  • 7. • Multi-party interaction – 4 participants in each meeting • Multi-channel recordings – Distant microphones – only first channel used – Head-set & lapel microphones • 2 recording set-ups – 70h scenario-based meetings – 30h real meetings
  • 8. • Different rooms • Multiple sources of distortion – Reverberation – Additive noise – Overlapping speech • Moving speakers • Many non-natives
  • 9. • SI : speaker independent – For online transcription – DNN-HMM hybrid • SAT: speaker adaptive training – For offline transcription – MLP tandem
  • 10.
  • 11. • Manual segmentations used • Overlapping segments ignored
  • 12.
  • 13. State output distributions modelled with – GMM or – DNN ¦ – / Q T t tttt qpqqPqPp q xX 1 10 )|()|()()|( ¦ M m jmjm mjmt Ncjp 1 )()( ),;()|( Σμxx )( )|( )|( jp jp jp t t x x
  • 14. Æ • Discriminative pre-training • Cross entropy fine-tuning • Discriminative pre-training
  • 15. • Trained on Telta K20 • cuBLAS 5.5 used • Mini-batch size: 800 frames • Learning rate: “newbob” scheduling • 10% held-out data for CV
  • 16. System Parame- terisation %WER Dev Eval Avg MPE GMM-HMM HLDA 54.7 55.6 55.2 DNN-HMM hybrid FBANK 43.5 42.6 43.1 This work 40.0 39.3 39.7
  • 17.
  • 18. Data Set Parame- terisation %WER Dev Eval Avg SDM FBANK 43.5 42.6 43.1 IHM FBANK 28.2 24.6 26.4 • 39.2% of the errors caused by acoustic distortion • DNN-HMMs not so robust
  • 19.
  • 20. Æ • Discriminative pre-training • Cross entropy fine-tuning • Discriminative pre-training
  • 21.
  • 22. Align- ment DNN input %WER Dev Eval Avg SDM IHM 30.6 27.0 28.8 IHM SDM 41.8 40.8 41.3 IHM SDM 41.7 40.6 41.2 Using 648-2,000 5-4,000 DNN: DNN training more sensitive to noise than state alignment
  • 23.
  • 26. Previous work – Beamforming yields gains – No investigation on single-microphone algorithms
  • 27. • Based on linear time (almost) invariant filters • Applied to complex-valued STFT coefficients • The filters automatically adjusted using observations – WPE for 1ch dereverberation (NTT’s work) – BeamformIt for denoising (ICSI’s work) • 8 microphones used, dedicated to meetings • Unlikely to produce irregular transitions ¦ 1 0 ,,,, T Tk ktfkftftf xgxy
  • 28. Align- ment Dev Eval SDM +Derev BFIt (8mics) SDM +Derev BFIt (8mics) MPE 43.8 41.8 38.6 43.0 41.3 36.6 Hybrid 43.5 41.7 38.8 43.3 41.4 36.7 • Dereveberation helps even with single microphone • Multi-microphone beamforming works well
  • 29. DNN size Context frames Dev Eval SDM +Derev SDM +Derev 1,000 5 9 43.8 41.8 43.0 41.3 1,500 5 9 43.5 42.0 42.6 41.1 13 42.8 41.8 42.9 41.2 19 43.0 41.7 42.9 41.2 2,000 5 9 43.8 41.3 42.9 40.4 4.7% gain from 1ch dereverberation (relative)
  • 31. No positive results reported previously
  • 32. • Applied to magnitude spectra • Cross terms (often) ignored • Frame-by-frame modification – Harmful for DNN? • Noise estimated using long-term statistics – IMCRA (used here), minimum statistics, etc • Deltas from un-enhanced speech – Essential for obtaining gains 2 , 2 , 2 , tftftf nxy
  • 33.
  • 34. • Applied to FBANK features • The following mismatch function used • Frame-by-frame modification • Noise model estimated with EM • Deltas from un-enhanced speech ))exp(1log( hynhxy tttt
  • 35. Enhancement target %WER Spectrum Feature Dev Eval Avg N N 42.0 41.1 41.6 Y N 41.3 40.9 41.1 N Y 41.4 40.5 41.0 Y Y 42.0 41.0 41.5 • Small consistent gains • Different methods should not be connected
  • 36. Enhancement target %WER Spectrum Feature Dev Eval Avg N N 42.0 41.1 41.6 Y N 41.3 40.9 41.1 N Y 41.4 40.5 41.0 Y Y 42.0 41.0 41.5 Y Y 41.4 40.4 40.9 Using multi-stream approach:
  • 38. • Frame level – FMPE, RDT, FE-CMLLR – Seems to be subsumed by DNN • Speaker (or environment) level – Global CMLLR, LIN, fDLR, VTLN – Multiple decoding passes required Æ SAT • Utterance level – Single-pass decoding Æ SI
  • 39. • Seems robust against supervision errors • STC transform used to deal with correlations: » » » ¼ º « « « ¬ ª tx )()()()( ss t ss t bLxAy
  • 40.
  • 41.
  • 42. Form of speaker transform %WER Dev Eval Avg None (SI) 42.6 40.2 41.4 Full 37.4 37.4 37.4 Block diagonal 37.3 36.6 37.0 • ~10% relative gains obtained • “Block diagonal” outperforms “full”
  • 43. Form of speaker transform %WER Dev Eval Avg None (SI) 42.6 40.2 41.4 Full 37.4 37.4 37.4 Block diagonal 37.3 36.6 37.0 None (SI) 27.8 24.2 26.0 Full 23.8 21.6 22.7 On IHM data set
  • 44. ))(()())(()( ucu t ucu t bLxAy uuc :)( Clustering performed using: – utterance-specific iVectors – Kmeans (GMM yielded similar performance figures)
  • 45. )()0()( uu Twmm m(0) T w(u) Subspace representation of the deviation from UBM m(0) m(1) m(2) m(3) Variability subspace
  • 46.
  • 47.
  • 48.
  • 49. #Clusters %WER Dev Eval Avg No QCMLLR 41.9 40.9 41.4 64 41.0 40.4 40.7 32 41.0 40.0 40.5 16 41.5 40.5 41.0 No QCMLLR 27.8 24.2 26.0 32 26.9 23.5 25.2 On IHM data set
  • 50. • Using 32 clusters yielded best performance • Similar gains on both SDM and IHM
  • 52. • Originally proposed by Aachen for shallow MLP tandem configurations • Exploits DNN’s insensitivity to the increase in input dimensionality • (Hopefully) complement features masked by noise • Allows multiple enhancement results to be combined
  • 53. • Four types of auxiliary features investigated: – MFCC (Δ/Δ2) – PLP – Gammatone cepstra • Different frequency warping • STFT not used – Intra-frame delta ( ) • Emphasises spectral peaks/dips
  • 54. Feature set #features %WER Dev Eval Avg FBANK+Δ+Δ2 (baseline) 72 41.9 40.9 41.4 +PLP 85 40.7 40.3 40.5 +Gammatone 88 40.8 40.0 40.4 +MFCC 85 41.1 39.7 40.4 +MFCC+Δ+Δ2 111 40.6 40.2 40.4 + + 2 120 40.9 39.8 40.4 +MFCC+ + 2 133 40.4 39.8 40.1
  • 55. • Speech enhancement – Linear filtering – Spectral/feature enhancement • Feature transformation – Quantised CMLLR – (Global CMLLR for SAT) • Multi-stream features
  • 57. Front-end %WER Dev Eval Avg FBANK baseline 43.1 42.4 42.8 +WPE 41.8 40.7 41.3 +MFCC+ + 2 40.5 40.1 40.3 +IMCRA+FE-VTS 40.0 39.3 39.7 +QCMLLR 40.9 39.5 40.2 • Effects additive except for QCMLLR • QCMLLR may work if applied to the entire feature set
  • 58. ÆÆ
  • 59. System Parame- terisation %WER Dev Eval Avg SAT GMM-HMM MPE trained HLDA 48.8 50.2 49.5 SAT tandem MPE trained FBANK 40.7 40.9 40.8 SI hybrid FBANK 43.5 42.6 43.1 • Outperforms SAT GMM-HMM • Outperforms SI hybrid
  • 61. Front-end %WER Dev Eval Avg FBANK baseline 40.1 41.3 40.7 +WPE 38.9 39.3 39.1 +MFCC 38.5 38.5 38.5 +IMCRA+FE-VTS 38.4 38.7 38.6 +CMLLR 36.6 36.7 36.7 +CMLLR 36.9 37.0 37.0 +CMLLR 38.4 38.6 38.5 • Effects of WPE and CMLLR are additive • Using auxiliary features yields small gains over CMLLR features • Denoising subsumed by CMLLR (as expected)
  • 62. • Front-end processing approaches yield gains over state-of-the-art DNN-based AMs – Linear filtering (WPE, BeamformIt) – Spectral/feature enhancement (IMCRA, FE-VTS) – Feature transformation (QCMLLR, CMLLR) – Multi-stream features • Possible to combine different classes of approaches