Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language processing
1. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
IEEEACM Transactions on Audio, Speech, and Language
Processing
Mel-Cepstrum-Based Quantization Noise Shaping Applied to Neural-
Network-Based Speech Waveform Synthesis
ABSTRACT
This paper presents a mel-cepstrum-based quantization noise shaping method for
improving the quality of synthetic speech generated by neural-network-based speech waveform
synthesis systems. Since mel-cepstral coefficients closely match the characteristics of human
auditory perception, the proposed method effectively masks the white noise introduced by the
quantization typically used in neural-network-based speech waveform synthesis systems. The
paper also describes a computationally efficient implementation of the proposed method using
the structure of the mel-log spectrum approximation filter. Experiments using the WaveNet
generative model, which is a state-of-theart model for neural-network-based speech waveform
synthesis, showed that speech quality is significantly improved by the proposed method.
A Multi-Objective Learning and Ensembling Approach to High-Performance
Speech Enhancement with Compact Neural Network Architectures
ABSTRACT
In this study, we propose a novel deep neural network (DNN) architecture for speech
enhancement (SE) via a multi-objective learning and ensembling (MOLE) framework to achieve
2. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
a compact and low-latency design while maintaining good performance in quality evaluations.
MOLE follows the boosting concept when combining weak models into a strong classifier and
consists of two compact deep neural networks (DNNs). The first, called the multi-objective
learning DNN (MOLDNN), takes multiple features, such as log-power spectra (LPS), mel-
frequency cepstral coefficients (MFCCs) and Gammatone frequency cepstral coefficients
(GFCCs) to predict a multiobjective set that includes clean speech feature, dynamic noise feature
and ideal ratio mask (IRM). The second, called the multi-objective ensembling DNN (MOE-
DNN), takes the learned features from MOL-DNN as inputs and separately predicts clean LPS
and IRM, clean MFCC and IRM and clean GFCC and IRM using three sets of weak regression
functions. Finally, a post-processing operation can be applied to the estimated clean features by
leveraging the multiple targets learned from both the MOL-DNN and the MOE-DNN. On speech
corrupted by 15 noise types not seen in model training the speech enhancement results show that
the MOLE approach, which features a small model size and low run-time latency, can achieve
consistent improvements over both DNN- and long short-term memory (LSTM)-based
techniques in terms of all the objective metrics evaluated in this study for all three cases (the
input contexts contain 1-frame, 4-frame and 7-frame instances). The 1-frame MOLE-based SE
system outperforms the DNN-based SE system with a 7-frame input expansion at a 3-frame
delay and also achieves better performance than the LSTM-based SE system with 4-frame, no
delay expansion by including only 3 previous frames, and with 170 times less processing
latency.
Speaker-Adapted Confidence Measures for ASR using Deep Bidirectional
Recurrent Neural Networks
ABSTRACT
3. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
In the last years, Deep Bidirectional Recurrent Neural Networks (DBRNN) and DBRNN
with Long Short-Term Memory cells (DBLSTM) have outperformed the most accurate
classifiers for confidence estimation in automatic speech recognition. At the same time, we have
recently shown that speaker adaptation of confidence measures using DBLSTM yields
significant improvements over non-adapted confidence measures. In accordance with these two
recent contributions to the state of the art in confidence estimation, this paper presents a
comprehensive study of speaker-adapted confidence measures using DBRNN and DBLSTM
models. Firstly, we present new empirical evidences of the superiority of RNN-based confidence
classifiers evaluated over a large speech corpus consisting of the English LibriSpeech and the
Spanish poliMedia tasks. Secondly, we show new results on speaker-adapted confidence
measures considering a multi-task framework in which RNN-based confidence classifiers trained
with LibriSpeech are adapted to speakers of the TED-LIUM corpus. These experiments confirm
that speaker-adapted confidence measures outperform their non-adapted counterparts. Lastly, we
describe an unsupervised adaptation method of the acoustic DBLSTM model based on
confidence measures which results in better automatic speech recognition performance.
Mispronunciation Detection in Children’s Reading of Sentences
ABSTRACT
This work proposes an approach to automatically parse children’s reading of sentences by
detecting word pronunciations and extra content, and to classify words as correctly or incorrectly
pronounced. This approach can be directly helpful for automatic assessment of reading level or
for automatic reading tutors, where a correct reading must be identified. We propose a first
segmentation stage to locate candidate word pronunciations based on allowing repetitions and
false starts of a word’s syllables. A decoding grammar based solely on syllables allows silence to
appear during a word pronunciation. At a second stage, word candidates are classified as
4. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
mispronounced or not. The feature that best classifies mispronunciations is found to be the log-
likelihood ratio between a free phone loop and a word spotting model in the very close vicinity
of the candidate segmentation. Additional features are combined in multi-feature models to
further improve classification, including: normalizations of the log-likelihood ratio, derivations
from phone likelihoods, and Levenshtein distances between the correct pronunciation and
recognized phonemes through two phoneme recognition approaches. Results show that most
extra events were detected (close to 2% word error rate achieved) and that using automatic
segmentation for mispronunciation classification approaches the performance of manual
segmentation. Although the log-likelihood ratio from a spotting approach is already a good
metric to classify word pronunciations, the combination of additional features provides a relative
reduction of the miss rate of 18% (from 34.03% to 27.79% using manual segmentation and from
35.58% to 29.35% using automatic segmentation, at constant 5% false alarm rate).
Analysis of the Reconstruction of Sparse Signals in the DCT Domain Applied
to Audio Signals
ABSTRACT
Sparse signals can be reconstructed from a reduced set of signal samples using
compressive sensing (CS) methods. The discrete cosine transform (DCT) can provide highly
concentrated representations of audio signals. This property implies the DCT as a good sparsity
domain for the audio signals. In this paper, the DCT is studied within the context of sparse audio
signal processing using the CS theory and methods. The DCT coefficients of a sparse signal,
calculated with a reduced set of available samples, can be modeled as random variables. It has
been shown that the statistical properties of these variables are closely related to the unique
reconstruction conditions. The main result of the paper is in an exact formula for the mean
square reconstruction error in the case of approximately sparse and nonsparse noisy signals,
5. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
reconstructed under the sparsity assumption. Based on the presented analysis, a simple and
computationally efficient reconstruction algorithm is proposed. The presented theoretical
concepts and the efficiency of the reconstruction algorithm are verified numerically, including
examples with synthetic and recorded audio signals with unavailable or corrupted samples.
Random disturbances and disturbances simulating clicks or inpainting in audio signals are
considered. Statistical verification is done on a dataset with experimental signals. Results are
compared with some classical and recent methods used in similar signal and disturbance
scenarios.
Speech Dereverberation with Context aware Recurrent Neural Networks
ABSTRACT
In this paper, we propose a model to perform speech dereverberation by estimating its
spectral magnitude from the reverberant counterpart. Our models are capable of extracting
features that take into account both short and long-term dependencies in the signal through a
convolutional encoder (which extracts features from a short, bounded context of frames) and a
recurrent neural network for extracting long-term information. Our model outperforms a recently
proposed model that uses different context information depending on the reverberation time,
without requiring any sort of additional input, yielding improvements of up to 0.4 on PESQ, 0.3
on STOI, and 1.0 on POLQA relative to reverberant speech. We also show our model is able to
generalize to real room impulse responses even when only trained with simulated room impulse
responses, different speakers, and high reverberation times. Lastly, listening tests show the
proposed method outperforming benchmark models in reduction of perceived reverberation.
6. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
Do we need individual head-related transfer functions for vertical
localization? The case study of a spectral notch distance metric
ABSTRACT
This paper deals with the issue of individualizing the head-related transfer function
(HRTF) rendering process for auditory elevation perception: is it possible to find a
nonindividual, personalized HRTF set that allows a listener to have an equally accurate
localization performance than with his/her individual HRTFs? We propose a psychoacoustically
motivated, anthropometry based mismatch function between HRTF pairs, that exploits the close
relation between the listener’s pinna geometry and localization cues. This is evaluated using an
auditory model that computes a mapping between HRTF spectra and perceived spatial locations.
Results on a large number of subjects in the CIPIC and ARI HRTF databases suggest that there
exists a non-individual HRTF set which allows a listener to have an equally accurate vertical
localization than with individual HRTFs. Furthermore, we find the optimal parametrization of
the proposed mismatch function, i.e. the one that best reflects the information given by the
auditory model. Our findings show that the selection procedure yields statistically significant
improvements with respect to dummy-head HRTFs or random HRTF selection, with potentially
high impact from an applicative point of view.
7. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
Interaural Coherence Preservation for Binaural Noise Reduction Using
Partial Noise Estimation and Spectral Postfiltering
ABSTRACT
The objective of binaural speech enhancement algorithms is to reduce the undesired noise
component, while preserving the desired speech source and the binaural cues of all sound
sources. For the scenario of a single desired speech source in a diffuse noise field, an extension
of the binaural multi-channel Wiener filter (MWF), namely the MWF-IC, has been recently
proposed, which aims to preserve the interaural coherence (IC) of the noise component.
However, due to the large complexity of the MWF-IC, in this paper we propose several
alternative algorithms at a lower computational complexity. First, we consider a
quasidistortionless version of the MWF-IC, denoted as MVDR-IC. Secondly, we propose to
preserve the IC of the noise component using the binaural MWF with partial noise estimation
(MWFN) and the binaural minimum-variance-distortionless response beamformer with partial
noise estimation (MVDR-N), for which closed-form expressions exist. In addition, we show that
for the MVDR-N a closed-form expression can be derived for the tradeoff parameter yielding a
desired magnitude squared coherence (MSC) for the output noise component. Since contrary to
the MWF-IC and the MWF-N the MVDR-IC and the MVDR-N do not take into account the
spectro-temporal properties of the speech and the noise components, we propose to apply a
spectral postfilter to the filter outputs, improving the noise reduction performance. The
performance of all algorithms is compared in several diffuse noise scenarios. The simulation
results show that both the MVDR-IC and the MVDR-N are able to preserve the MSC of the
noise component, while generally the MVDRIC shows a slightly better noise reduction
performance at a larger complexity. Further simulation results show that applying a spectral
postfilter leads to a very similar performance for all considered algorithms in terms of noise
reduction and speech distortion.
8. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
Gating Neural Network for Large Vocabulary Audiovisual Speech
Recognition
ABSTRACT
Audio-based automatic speech recognition (A-ASR) systems are affected by noisy
conditions in real-world applications. Adding visual cues to the ASR system is an appealing
alternative to improve the robustness of the system, replicating the audiovisual perception
process used during human interactions. A common problem observed when using audiovisual
automatic speech recognition (AV-ASR) is the drop in performance when speech is clean. In this
case, visual features may not provide complementary information, introducing variability that
negatively affects the performance of the system. The experimental evaluation in this study
clearly demonstrates this problem when we train an audiovisual state-of-the-art hybrid system
with a deep neural network (DNN) and hidden Markov models (HMMs). This study proposes a
framework that addresses this problem, improving, or at least, maintaining the performance
when visual features are used. The proposed approach is a deep learning solution with a gating
layer that diminishes the effect of noisy or uninformative visual features, keeping only useful
information. The framework is implemented with a subset of the audiovisual CRSS-4ENGLISH-
14 corpus which consists of 61 hours of speech from 105 subjects simultaneously collected with
multiple cameras and microphones. The proposed framework is compared with conventional
HMMs with observation models implemented with either a Gaussian mixture model (GMM) or
DNNs. We also compare the system with a multi-stream hidden Markov model (MS-HMM)
system. The experimental evaluation indicates that the proposed framework outperforms
alternative methods under all configurations, showing the robustness of the gating-based
framework for AV-ASR.
9. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
Bias-Compensated Informed Sound Source Localization Using Relative
Transfer Functions
ABSTRACT
In this paper, we consider the problem of estimating the target sound direction of arrival
(DoA) for a hearing aid (HA) system, which can connect to a wireless microphone worn by the
talker of interest. The wireless microphone “informs” the HA system about the noise-free target
speech. To estimate the DoA, we consider a maximum-likelihood approach, and we assume that
a database of DoA-dependent relative transfer functions (RTFs) has been measured in advance
and is available. The proposed DoA estimator is able to take the available noise-free target
speech, ambient noise characteristics, and the shadowing effect of the user’s head on the received
signals into account, and it supports bothmonaural and binaural microphone array configurations.
Moreover, we analytically analyze the bias in the proposed estimator and introduce a modified
estimator, which has been compensated for the bias. We demonstrate that the proposed method
has lower computational complexity and better performance than recent RTF-based estimators.
Furthermore, to decrease the number of parameters required to be wirelessly exchanged between
the HAs in binaural configurations, we propose an information fusion strategy, which avoids
transmitting microphone signals between the HAs. An important benefit of the proposed IF
strategy is that the number of parameters to be exchanged between the HAs is independent of the
number of HA microphones. Finally, we investigate the performance of variants of the proposed
estimator extensively in different noisy and reverberant situations.
CONTACT: TSYS Center for Research and Development
(TSYS Academic Projects)