Recently, WaveNet, which predicts the probability distribution of speech sample auto-regressively, provides a new paradigm in speech synthesis tasks.
Since the usage of WaveNet for speech synthesis varies by conditional vectors, it is very important to effectively design a baseline system structure.
In this talk, I would like to first introduce various types of WaveNet vocoders such as conventional speech-domain approach and recently proposed source-filter theory-based approach.
Then, I will explain a linear prediction (LP)-based WaveNet speech synthesis, i.e., LP-WaveNet, which overcomes the limitations of source-filter theory-based WaveNet vocoders caused by the mismatch between speech excitation signal and vocal tract filter.
While presenting experimental setups and results, I also would like to share some know-hows to successfully training the network.
Conv-TasNet is a convolutional neural network architecture for speech separation that directly operates on the raw audio waveform in the time domain. It surpasses previous approaches that perform separation in the time-frequency domain using magnitude masking. Conv-TasNet uses a convolutional encoder to transform short audio segments into representations, estimates separation masks for each source, and uses a convolutional decoder to reconstruct the sources. Experiments show Conv-TasNet achieves substantially better speech separation than ideal time-frequency masking baselines as measured by metrics like SI-SNRi and SDRi. The best performing model uses a small filter length, increasing numbers of encoder/decoder filters, and deep convolutional layers.
MFCCs were the standard feature for automatic speech recognition systems using HMM classifiers. MFCCs work by framing an audio signal, calculating the power spectrum of each frame, applying a Mel filterbank to group frequencies, taking the logarithm of the filterbank energies, and computing the DCT to decorrelate the features. The Mel scale relates perceived pitch to actual frequency in a way that matches human hearing. MFCCs were effective for GMM-HMM systems and helped speech recognition performance by representing audio signals in a way aligned with human perception.
These are slides used for invited tutorial on "end-to-end text-to-speech synthesis", given at IEICE SP workshop held on 27th Jan 2019.
Part 1: Neural waveform modeling
Presenters: Xin Wang, Yusuke Yasuda (National Institute of Informatics, Japan)
Text independent speaker recognition systemDeepesh Lekhak
This document outlines a project to develop a text-independent speaker recognition system. It lists the project members and provides an overview of the presentation sections, which include the system architecture, methodology, results and analysis, and applications. The methodology section describes implementing the system in MATLAB, including voice capturing, pre-processing, MFCC feature extraction, GMM matching, and identification/verification. It also outlines implementing the system on an FPGA, including analog conversion, storage, framing, FFT, mel spectrum, MFCC extraction, and UART transmission to MATLAB for further processing. The results show over 99% recognition accuracy with longer training and test data.
Conv-TasNet is a convolutional neural network architecture for speech separation that directly operates on the raw audio waveform in the time domain. It surpasses previous approaches that perform separation in the time-frequency domain using magnitude masking. Conv-TasNet uses a convolutional encoder to transform short audio segments into representations, estimates separation masks for each source, and uses a convolutional decoder to reconstruct the sources. Experiments show Conv-TasNet achieves substantially better speech separation than ideal time-frequency masking baselines as measured by metrics like SI-SNRi and SDRi. The best performing model uses a small filter length, increasing numbers of encoder/decoder filters, and deep convolutional layers.
MFCCs were the standard feature for automatic speech recognition systems using HMM classifiers. MFCCs work by framing an audio signal, calculating the power spectrum of each frame, applying a Mel filterbank to group frequencies, taking the logarithm of the filterbank energies, and computing the DCT to decorrelate the features. The Mel scale relates perceived pitch to actual frequency in a way that matches human hearing. MFCCs were effective for GMM-HMM systems and helped speech recognition performance by representing audio signals in a way aligned with human perception.
These are slides used for invited tutorial on "end-to-end text-to-speech synthesis", given at IEICE SP workshop held on 27th Jan 2019.
Part 1: Neural waveform modeling
Presenters: Xin Wang, Yusuke Yasuda (National Institute of Informatics, Japan)
Text independent speaker recognition systemDeepesh Lekhak
This document outlines a project to develop a text-independent speaker recognition system. It lists the project members and provides an overview of the presentation sections, which include the system architecture, methodology, results and analysis, and applications. The methodology section describes implementing the system in MATLAB, including voice capturing, pre-processing, MFCC feature extraction, GMM matching, and identification/verification. It also outlines implementing the system on an FPGA, including analog conversion, storage, framing, FFT, mel spectrum, MFCC extraction, and UART transmission to MATLAB for further processing. The results show over 99% recognition accuracy with longer training and test data.
Conformer, Gulati, Anmol, et al. "Conformer: Convolution-augmented Transformer for Speech Recognition." arXiv preprint arXiv:2005.08100 (2020). review by June-Woo Kim
Introductory Lecture to Audio Signal ProcessingAngelo Salatino
The document provides an introduction to audio signal processing and related topics. It discusses analog and digital audio signals, the waveform audio file format (WAV) specification including its header structure, and tools for audio processing like FFmpeg and MATLAB. Example code is given to read header metadata and audio samples from a WAV file in C++. While useful for understanding audio formats and processing, the solution contains an error and FFmpeg is noted as a better library for audio tasks.
The document discusses research issues in speech processing. It covers topics like speech production, speech processing tasks, speech measurements, speech signal components, automatic speech recognition, speaker recognition, text-to-speech systems, speech coding, and a proposed speech-assisted translation corrector system. The key challenges in speech processing research are modeling the human auditory system, developing large multilingual speech databases, and generating natural sounding synthetic speech.
This document provides an overview of end-to-end text-to-speech synthesis models including Char2Wav, Tacotron, Tacotron2, and the development of a Japanese Tacotron model. It describes the typical encoder-decoder architecture with attention, improvements made in subsequent models like Tacotron2 using larger models and more regularization, and the implementation of a Japanese Tacotron to model Japanese pitch accents using an accent embedding and self-attention.
This document discusses speech signal processing and speech recognition. It begins by defining speech processing and its relationship to digital signal processing. It then outlines several disciplines related to speech processing including signal processing, physics, pattern recognition, and computer science. The document discusses aspects of speech signals including phonemes, the speech waveform, and spectral envelope. It covers various aspects of speech processing including pre-processing, feature extraction, and recognition. It provides details on techniques for pre-processing, feature extraction including filtering, linear predictive coding, and cepstrum. Finally, it summarizes the main steps in a speech recognition procedure including endpoint detection, framing and windowing, feature extraction, and distortion measure calculations for recognition.
1. Recurrent neural networks can model sequential data like time series by incorporating hidden state that has internal dynamics. This allows the model to store information for long periods of time.
2. Two key types of recurrent networks are linear dynamical systems and hidden Markov models. Long short-term memory networks were developed to address the problem of exploding or vanishing gradients in training traditional recurrent networks.
3. Recurrent networks can learn tasks like binary addition by recognizing patterns in the inputs over time rather than relying on fixed architectures like feedforward networks. They have been successfully applied to handwriting recognition.
Tutorial on neural vocoders at the 2021 Speech Processing Courses in Crete, "Inclusive Neural Speech Synthesis."
Presenters: Xin Wang and Junichi Yamagishi, National Institute of Informatics, Japan
Introduction to natural language processing (NLP)Alia Hamwi
The document provides an introduction to natural language processing (NLP). It defines NLP as a field of artificial intelligence devoted to creating computers that can use natural language as input and output. Some key NLP applications mentioned include data analysis of user-generated content, conversational agents, translation, classification, information retrieval, and summarization. The document also discusses various linguistic levels of analysis like phonology, morphology, syntax, and semantics that involve ambiguity challenges. Common NLP tasks like part-of-speech tagging, named entity recognition, parsing, and information extraction are described. Finally, the document outlines the typical steps in an NLP pipeline including data collection, text cleaning, preprocessing, feature engineering, modeling and evaluation.
This document summarizes research on applying speech enhancement techniques including spectral subtraction and Wiener filtering. The goals were to examine and simulate these techniques in Matlab. The techniques were tested on speech degraded by additive noise at different signal-to-noise ratios. Spectral subtraction removes noise by subtracting noise spectrum estimates from the degraded speech spectrum. Wiener filtering suppresses noise by multiplying the speech spectrum by a frequency response. Both techniques performed similarly at low noise, but Wiener filtering performed better at higher noise levels. Future work could include automatic noise detection and adaptation to changing noise.
RNN & LSTM: Neural Network for Sequential DataYao-Chieh Hu
Recurrent neural networks (RNNs) and long short-term memory (LSTM) networks can process sequential data like text and time series data. RNNs have memory and can perform the same task for every element in a sequence, but struggle with long-term dependencies. LSTMs address this issue using memory cells and gates that allow them to learn long-term dependencies. LSTMs have four interacting layers - a forget gate, input gate, cell state, and output gate that allow them to store and access information over long periods of time. RNNs and LSTMs are applied to tasks like language modeling, machine translation, speech recognition, and image caption generation.
Conditional random fields (CRFs) are probabilistic models for segmenting and labeling sequence data. CRFs address limitations of previous models like hidden Markov models (HMMs) and maximum entropy Markov models (MEMMs). CRFs allow incorporation of arbitrary, overlapping features of the observation sequence and label dependencies. Parameters are estimated to maximize the conditional log-likelihood using iterative scaling or tracking partial feature expectations. Experiments show CRFs outperform HMMs and MEMMs on synthetic and real-world tasks by addressing label bias problems and modeling dependencies beyond the previous label.
Natural Language Processing (NLP) is a field of computer science concerned with interactions between computers and human languages. NLP involves understanding written or spoken language at various levels such as morphology, syntax, semantics, and pragmatics. The goal of NLP is to allow computers to understand, generate, and translate between different human languages.
How pixel CNN and Pixel RNN is used to create WaveNet.
WaveNet is a audio processing Neural Network developed by Google and is the core technology behind Google Duplex
Diffusion models beat gans on image synthesisBeerenSahu
Diffusion models have recently been shown to produce higher quality images than GANs while also offering better diversity and being easier to scale and train. Specifically, a 2021 paper by OpenAI demonstrated that a diffusion model achieved an FID score of 2.97 on ImageNet 128x128, beating the previous state-of-the-art held by BigGAN. Diffusion models work by gradually adding noise to images in a forward process and then learning to remove noise in a backward denoising process, allowing them to generate diverse, high fidelity images.
This document proposes a speaker-dependent WaveNet vocoder to generate high-quality speech from acoustic features. It uses a WaveNet model conditioned on mel-cepstral coefficients and fundamental frequency to directly model the relationship between acoustic features and speech waveforms. Evaluation shows the proposed method improves sound quality over traditional vocoders, as measured by objective metrics and subjective listening tests. Future work will apply this approach to other tasks and make the model independent of individual speakers.
This document discusses speaker recognition using Mel Frequency Cepstral Coefficients (MFCC). It describes the process of feature extraction using MFCC which involves framing the speech signal, taking the Fourier transform of each frame, warping the frequencies using the mel scale, taking the logs of the powers at each mel frequency, and converting to cepstral coefficients. It then discusses feature matching techniques like vector quantization which clusters reference speaker features to create codebooks for comparison to unknown speakers. The document provides references for further reading on speech and speaker recognition techniques.
Conformer, Gulati, Anmol, et al. "Conformer: Convolution-augmented Transformer for Speech Recognition." arXiv preprint arXiv:2005.08100 (2020). review by June-Woo Kim
Introductory Lecture to Audio Signal ProcessingAngelo Salatino
The document provides an introduction to audio signal processing and related topics. It discusses analog and digital audio signals, the waveform audio file format (WAV) specification including its header structure, and tools for audio processing like FFmpeg and MATLAB. Example code is given to read header metadata and audio samples from a WAV file in C++. While useful for understanding audio formats and processing, the solution contains an error and FFmpeg is noted as a better library for audio tasks.
The document discusses research issues in speech processing. It covers topics like speech production, speech processing tasks, speech measurements, speech signal components, automatic speech recognition, speaker recognition, text-to-speech systems, speech coding, and a proposed speech-assisted translation corrector system. The key challenges in speech processing research are modeling the human auditory system, developing large multilingual speech databases, and generating natural sounding synthetic speech.
This document provides an overview of end-to-end text-to-speech synthesis models including Char2Wav, Tacotron, Tacotron2, and the development of a Japanese Tacotron model. It describes the typical encoder-decoder architecture with attention, improvements made in subsequent models like Tacotron2 using larger models and more regularization, and the implementation of a Japanese Tacotron to model Japanese pitch accents using an accent embedding and self-attention.
This document discusses speech signal processing and speech recognition. It begins by defining speech processing and its relationship to digital signal processing. It then outlines several disciplines related to speech processing including signal processing, physics, pattern recognition, and computer science. The document discusses aspects of speech signals including phonemes, the speech waveform, and spectral envelope. It covers various aspects of speech processing including pre-processing, feature extraction, and recognition. It provides details on techniques for pre-processing, feature extraction including filtering, linear predictive coding, and cepstrum. Finally, it summarizes the main steps in a speech recognition procedure including endpoint detection, framing and windowing, feature extraction, and distortion measure calculations for recognition.
1. Recurrent neural networks can model sequential data like time series by incorporating hidden state that has internal dynamics. This allows the model to store information for long periods of time.
2. Two key types of recurrent networks are linear dynamical systems and hidden Markov models. Long short-term memory networks were developed to address the problem of exploding or vanishing gradients in training traditional recurrent networks.
3. Recurrent networks can learn tasks like binary addition by recognizing patterns in the inputs over time rather than relying on fixed architectures like feedforward networks. They have been successfully applied to handwriting recognition.
Tutorial on neural vocoders at the 2021 Speech Processing Courses in Crete, "Inclusive Neural Speech Synthesis."
Presenters: Xin Wang and Junichi Yamagishi, National Institute of Informatics, Japan
Introduction to natural language processing (NLP)Alia Hamwi
The document provides an introduction to natural language processing (NLP). It defines NLP as a field of artificial intelligence devoted to creating computers that can use natural language as input and output. Some key NLP applications mentioned include data analysis of user-generated content, conversational agents, translation, classification, information retrieval, and summarization. The document also discusses various linguistic levels of analysis like phonology, morphology, syntax, and semantics that involve ambiguity challenges. Common NLP tasks like part-of-speech tagging, named entity recognition, parsing, and information extraction are described. Finally, the document outlines the typical steps in an NLP pipeline including data collection, text cleaning, preprocessing, feature engineering, modeling and evaluation.
This document summarizes research on applying speech enhancement techniques including spectral subtraction and Wiener filtering. The goals were to examine and simulate these techniques in Matlab. The techniques were tested on speech degraded by additive noise at different signal-to-noise ratios. Spectral subtraction removes noise by subtracting noise spectrum estimates from the degraded speech spectrum. Wiener filtering suppresses noise by multiplying the speech spectrum by a frequency response. Both techniques performed similarly at low noise, but Wiener filtering performed better at higher noise levels. Future work could include automatic noise detection and adaptation to changing noise.
RNN & LSTM: Neural Network for Sequential DataYao-Chieh Hu
Recurrent neural networks (RNNs) and long short-term memory (LSTM) networks can process sequential data like text and time series data. RNNs have memory and can perform the same task for every element in a sequence, but struggle with long-term dependencies. LSTMs address this issue using memory cells and gates that allow them to learn long-term dependencies. LSTMs have four interacting layers - a forget gate, input gate, cell state, and output gate that allow them to store and access information over long periods of time. RNNs and LSTMs are applied to tasks like language modeling, machine translation, speech recognition, and image caption generation.
Conditional random fields (CRFs) are probabilistic models for segmenting and labeling sequence data. CRFs address limitations of previous models like hidden Markov models (HMMs) and maximum entropy Markov models (MEMMs). CRFs allow incorporation of arbitrary, overlapping features of the observation sequence and label dependencies. Parameters are estimated to maximize the conditional log-likelihood using iterative scaling or tracking partial feature expectations. Experiments show CRFs outperform HMMs and MEMMs on synthetic and real-world tasks by addressing label bias problems and modeling dependencies beyond the previous label.
Natural Language Processing (NLP) is a field of computer science concerned with interactions between computers and human languages. NLP involves understanding written or spoken language at various levels such as morphology, syntax, semantics, and pragmatics. The goal of NLP is to allow computers to understand, generate, and translate between different human languages.
How pixel CNN and Pixel RNN is used to create WaveNet.
WaveNet is a audio processing Neural Network developed by Google and is the core technology behind Google Duplex
Diffusion models beat gans on image synthesisBeerenSahu
Diffusion models have recently been shown to produce higher quality images than GANs while also offering better diversity and being easier to scale and train. Specifically, a 2021 paper by OpenAI demonstrated that a diffusion model achieved an FID score of 2.97 on ImageNet 128x128, beating the previous state-of-the-art held by BigGAN. Diffusion models work by gradually adding noise to images in a forward process and then learning to remove noise in a backward denoising process, allowing them to generate diverse, high fidelity images.
This document proposes a speaker-dependent WaveNet vocoder to generate high-quality speech from acoustic features. It uses a WaveNet model conditioned on mel-cepstral coefficients and fundamental frequency to directly model the relationship between acoustic features and speech waveforms. Evaluation shows the proposed method improves sound quality over traditional vocoders, as measured by objective metrics and subjective listening tests. Future work will apply this approach to other tasks and make the model independent of individual speakers.
This document discusses speaker recognition using Mel Frequency Cepstral Coefficients (MFCC). It describes the process of feature extraction using MFCC which involves framing the speech signal, taking the Fourier transform of each frame, warping the frequencies using the mel scale, taking the logs of the powers at each mel frequency, and converting to cepstral coefficients. It then discusses feature matching techniques like vector quantization which clusters reference speaker features to create codebooks for comparison to unknown speakers. The document provides references for further reading on speech and speaker recognition techniques.
This document provides an overview of speech recognition algorithms and models. It describes preprocessing techniques like pre-emphasis and MFCC extraction. Modeling methods covered include GMM, HMM, vector quantization using LBG and k-means, and distance measures for pattern matching. Physiological aspects of speech production and representations in time and frequency domains are also summarized. The document aims to give a comprehensive overview of the multidisciplinary field of speech recognition from DSP techniques to applications.
This document summarizes a research paper on pitch detection of speech synthesis using MATLAB. It discusses using an adaptable filter and peak-valley decision method to determine pitch marks for speech synthesis. Low-pass filtering and autocorrelation are used to detect pitch periods. An adaptive filter is designed to flatten spectral peaks. Peak and valley costs are calculated over each pitch period to determine pitch marks. Dynamic programming is then used to obtain the optimal pitch mark locations for high quality speech synthesis.
We present a causal speech enhancement model working on the
raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with
skip-connections. It is optimized on both time and frequency
domains, using multiple loss functions. Empirical evidence
shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises,
as well as room reverb. Additionally, we suggest a set of
data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities. We perform evaluations on several standard
benchmarks, both using objective metrics and human judgements. The proposed model matches state-of-the-art performance of both causal and non causal methods while working
directly on the raw waveform.
Index Terms: Speech enhancement, speech denoising, neural
networks, raw waveform
The document summarizes Yi-Chiao Wu's research on neural-based speech generation models for voice conversion. It outlines Wu's work developing a baseline voice conversion system using a neural vocoder, improving the robustness by addressing issues like collapsed speech, and making the system more flexible. It also describes Wu's work on quasi-periodic neural vocoders like WaveNet and Parallel WaveGAN that incorporate pitch information to better model speech signals. The document provides an overview of Wu's past and ongoing research applying neural vocoders to tasks like voice conversion and speech enhancement.
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...IRJET Journal
This document presents a modified least mean square (LMS) algorithm to reduce noise in real-time speech signals. The proposed approach modifies the standard LMS algorithm by incorporating a Wiener filter. Experiments are conducted on speech samples from the NOIZEUS database with various types of noise at different signal-to-noise ratios. Objective measures like segmental SNR, log likelihood ratio, Itakura-Saito spectral distance, and cepstrum are used to evaluate the performance of the proposed algorithm compared to the standard LMS algorithm. The results show that the modified LMS algorithm with Wiener filter outperforms the standard LMS algorithm in enhancing the quality of noisy speech signals based on the objective measure values.
This document discusses speech compression using linear predictive coding (LPC). It begins with the objectives of developing low bit-rate speech coders for cellular networks. It then introduces LPC and how it models the human vocal tract. The key aspects of LPC encoding and decoding are described, including analysis, synthesis, and the Levinson-Durbin algorithm. Simulation results on compressing male and female speech are presented, showing compression ratios and signal-to-noise ratios. The document concludes that LPC is well-suited for secure telephone systems by preserving the meaning of speech at low bit rates.
This document proposes a noise reduction method for audio signals based on an LMS adaptive filter. It segments the noisy audio signal into frames and uses an NLMS adaptive algorithm to estimate the filter coefficients and minimize the mean square error between the clean signal and filter output. Simulation results show the proposed method significantly reduces noise and improves the signal to noise ratio by adaptively filtering the noisy audio signal in the time domain. Analysis of the output signal variance indicates the noise level is substantially decreased compared to the original noisy signal.
Deep Learning Based Voice Activity Detection and Speech EnhancementNAVER Engineering
The document summarizes speech recognition front-end technologies including voice activity detection (VAD) and speech enhancement. It discusses conventional signal processing based approaches and more recent deep learning based methods. For VAD, it describes adaptive context attention models that can dynamically adjust the context used based on noise type and SNR. For speech enhancement, it proposes a two-step neural network approach consisting of a prior network that makes multiple predictions from noisy features and a post network that combines these using a boosting method to produce enhanced features, allowing end-to-end training without an explicit masking step. The approach aims to better exploit neural network modeling power while reducing computation cost compared to conventional methods or single-step deep learning frameworks.
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
This document summarizes research on using linear predictive coding (LPC) and related techniques for speech recognition and compression. Key points discussed include:
1) LPC is used to compress and encode speech signals for transmission by determining a filter to predict samples from past values, minimizing error. Filter coefficients are encoded and decoded.
2) LPC and PARCOR parameters can characterize phonemes and have potential for speech recognition by analyzing short frames of speech. Recognition rates of 65% for vowels and 94% for consonants were achieved.
3) An LPC-based speech coding system was implemented and tested for mobile radio communications, achieving a bit error rate performance suitable for speech transmission.
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...IJERA Editor
Speech feature extraction and likelihood evaluation are considered the main issues in speech recognition system.
Although both techniques were developed and improved, but they remain the most active area of research. This
paper investigates the performance of conventional and hybrid speech feature extraction algorithm of Mel
Frequency Cepstrum Coefficient (MFCC), Linear Prediction Cepstrum Coefficient (LPCC), perceptual linear
production (PLP) and RASTA-PLP through using multivariate Hidden Markov Model (HMM) classifier. The
performance of the speech recognition system is evaluated based on word error rate (WER), which is given for
different data set of human voice using isolated speech TIDIGIT corpora sampled by 8 Khz. This data includes
the pronunciation of eleven words (zero to nine plus oh) are recorded from 208 different adult speakers (men &
women) each person uttered each word 2 times.
Translatotron, Jia, Ye, et al. "Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model}}." Proc. Interspeech 2019 (2019): 1123-1127. review by June-Woo Kim
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...IRJET Journal
This document presents a speech enhancement method based on spectral subtraction involving the magnitude and phase components. The proposed method aims to improve noisy speech quality by estimating and subtracting noise from the noisy speech signal in the frequency domain. It involves segmenting the noisy speech into overlapping frames, estimating the noise spectrum during non-speech periods, subtracting the noise magnitude from the noisy speech magnitude, and reconstructing the enhanced speech using the inverse FFT and overlap-add processing. The method was tested on different noise types using MATLAB simulations. Results showed the proposed method achieved better noise reduction compared to conventional spectral subtraction while introducing less speech distortion.
COLEA : A MATLAB Tool for Speech AnalysisRushin Shah
This document describes COLEA, a MATLAB tool for speech analysis. COLEA allows users to analyze speech signals in both the time and frequency domains. It provides various speech analysis features including waveform display, spectrogram display, formant analysis, and comparison of waveforms using distance measures like SNR, cepstrum, weighted cepstrum, Itakura-Saito, and weighted likelihood ratio. The document explains how to install and use COLEA's various analysis functions, such as linear predictive coding analysis, FFT analysis, and segmentation and filtering of speech signals.
This document summarizes a research paper on speech enhancement using the signal subspace algorithm. It begins with an abstract describing how noise degrades speech quality and intelligibility in communication systems. It then provides background on speech enhancement objectives and commonly used methods like spectral subtraction and signal subspace. The paper describes the signal subspace algorithm and shows its ability to enhance speech signals by suppressing noise. Experimental results on sine waves with added Gaussian noise demonstrate improved peak signal-to-noise ratios when using the signal subspace method compared to the noisy signals. The conclusion is that the algorithm removes noise to a great extent from noisy speech.
WaveNet is a generative model for raw audio developed by Google DeepMind that uses dilated causal convolutions to capture long-term dependencies in audio. It can generate speech that is subjectively natural and can be conditioned on speaker identity. The model was also shown to generate music from specific instruments and genres with varying levels of success depending on the conditioning information provided. Key aspects of the model include causal dilated convolutions to exponentially increase the receptive field, gated activation units, and residual and skip connections.
This document summarizes a speech recognition system (SRS). SRS uses speech identification and verification. Speech identification determines which registered speaker provided an utterance by extracting features like mel-frequency cepstrum coefficients and comparing them. Speech verification accepts or rejects an identity claim by clustering training vectors from an enrollment session into speaker-specific codebooks using vector quantization. Applications of SRS include banking by phone, voice dialing, voice mail, and security control.
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...IJERA Editor
This paper presents compressive sensing technique used for speech reconstruction using linear predictive coding because the
speech is more sparse in LPC. DCT of a speech is taken and the DCT points of sparse speech are thrown away arbitrarily.
This is achieved by making some point in DCT domain to be zero by multiplying with mask functions. From the incomplete
points in DCT domain, the original speech is reconstructed using compressive sensing and the tool used is Gradient
Projection for Sparse Reconstruction. The performance of the result is compared with direct IDCT subjectively. The
experiment is done and it is observed that the performance is better for compressive sensing than the DCT.
ZUIX is a design system created by Zigbang's CTO team to standardize design across all of Zigbang's services. It uses React Native for responsive, multi-platform components and includes tools like Storybook for development and a design review infrastructure for validation. The deployment process involves code reviews, CI/CD pipelines, and publishing to a npm registry. Training and documentation is provided through tools like Google Classroom and Notion. The team aims to further develop ZUIX by improving the design review tools, adding end-to-end testing, and analyzing component usage. The goal is to solve Zigbang's unique challenges through an agile, collaborative approach between designers and developers.
This document discusses Kakao's search platform front-end project. It describes the architecture of an integrated search service using microservices and the need for a design system due to fragmented UIs. It introduces the KST (Kakao Search Template) project for creating a design system including 200+ UI blocks and templates. The KST Builder, Logger, and Dashboard are discussed for managing templates, logging usage, and monitoring coverage. Maintaining a consistent design system is important for operating diverse search services and platforms.
This document discusses Banksalad Product Language (BPL), which is a method used at Banksalad to standardize UI text, elements, and components. It allows designers and developers to use consistent terms, while abstracting UI elements to different levels suitable for their roles. Examples of standardized elements are provided, as well as external resources that discuss concepts like tree shaking that are relevant to BPL. While BPL has benefits, the document considers whether there may be better approaches than BPL.
This document summarizes a presentation about using Stitches, a React styling library, and Storybook for component design.
The presentation introduces Stitches as the styling library used for its support of React, easy usage, and themes. Key features of Stitches discussed include creating styled components, variants, and comparisons to other libraries.
Storybook is presented as a way to improve communication between designers and developers by allowing visualization of components alongside their stories. Clean communication through a shared Storybook is emphasized.
Reflections on initially creating a design system note the benefits of consistency and speed but also identify areas for improvement like documentation, process alignment, and understanding each other's roles. Establishing trust and understanding between
비행기 설계를 왜 통일 해야 할까?
디자인 시스템을 하는 이유
비행기들이 다 용도가 다르다...어떻게 설계하지?
맥락이 다른 페이지와 패턴
경유지까지 아직 멀었다... 언제 수리하지?
디자인 시스템을 적용하는 시점
엔지니어랑 얘기해서 정비해야하는데...어떻게 수리하지?
디자인 시스템을 적용하는 프로세스
비행기 설계가 바뀐걸 어떻게 알리지?
디자인 시스템의 전파
The document discusses Kotlin coroutines and how they can be used to write asynchronous code in a synchronous, sequential way. It explains what coroutines are, how they work internally using continuation-passing style (CPS) transformation and state machines, and compares them to callbacks. It also outlines some of the benefits of using coroutines, such as structured concurrency, light weight execution, built-in cancellation, and simplifying asynchronous code. Finally, it provides examples of how to use common coroutine builders like launch, async, and coroutineScope in a basic Android application with ViewModels.
This document contains the transcript from a presentation given by Wonsuk Lim from Naver on tips for debugging and analyzing Android applications. Some key tips discussed include fully utilizing the Android emulator's capabilities like 2-finger touch control, clipboard sharing between the emulator and host PC, and mocking locations. Advanced settings for the emulator like foldable and camera emulation are also covered. The presenter recommends ways to configure developer options and use tools like LeakCanary, the Android profiler, and Stetho for testing app stability. Methods for understanding the Android framework by reviewing system services and managers via AIDL files and logcat dumps are presented. Finally, reverse engineering tools like APK Extractor and decompilers are introduced.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
2. PROFILE
2
Min-Jae Hwang
• Affiliation : 7th semester of MS/Ph.D. student
DSP & AI Lab., Yonsei Univ., Seoul, Korea
• Adviser : Prof. Hong-Goo Kang
• E-mail : hmj234@dsp.yonsei.ac.kr
Work experience
• 2017.12 ~ 2017.12 – Intern at NAVER corporation
• 2018.01 ~ 2018.11 – Intern at Microsoft Research Asia
Research area
• 2015.8 ~ 2016.12 - Audio watermarking
• M. Hwang, J. Lee, M. Lee, and H. Kang,“사전 분석법을 통한 스프레드 스펙트럼 기반 오디오 워터마킹 알고리즘의 성능 향상,” 한국음향학회 제 33회 음성통신 및
신호처리 학술대회, 2016
• M. Hwang, J. Lee, M. Lee, and H. Kang, “SVD-based adaptive QIM watermarking on stereo audio signals,” in IEEE Transactions on Multimedia, 2017
• 2017.1 ~ Present - Deep learning-based statistical parametric speech synthesis
• M. Hwang, E. Song, K. Byun, and H. Kang, “Modeling-by-generation structure-based noise compensation algorithm for glottal vocoding speech synthesis syste
m,” in ICASSP, 2018
• M. Hwang, E. Song, J. Kim, and H. Kang, “A unified framework for the generation of glottal signals in deep learning-based parametric speech synthesis
systems,” in Interspeech, 2018
• M. Hwang, F. Soong, F. Xie, X. Wang, and H. Kang, “LP-WaveNet: linear prediction-based WaveNet speech synthesis,” in ICASSP, 2019 [Submitted]
3. CONTENTS
Introduction
• Story of WaveNet vocoder
Various types of WaveNet vocoders
• SoftMax-WaveNet
• MDN-WaveNet
• ExcitNet
• Proposed LP-WaveNet
Tuning of WaveNet
• Noise injection
• Waveform generation methods
Experiments
Conclusion & Summary
3
4. INTRODUCTION
Story of WaveNet vocoder
Traditional
Text-to-speech
WaveNet MDN-WaveNet ExcitNet LP-WaveNet
~2015 2016 2017 2018 2019
WaveNet
Vocoder
2017
5. CLASSICAL TTS SYSTEMS
• Parameterize speech signal using vocoder
technique
• Pros) High controllability, small database
• Cons) Low quality
5
• Concatenate tiny segment of real speech
signal
• Pros) High quality
• Cons) Low controllability, large database
Unit-selection speech synthesis [1] Statistical parametric speech synthesis (SPSS) [2]
Text-to-speech
6. GENERATIVE MODEL-BASED SPEECH SYNTHESIS
WaveNet [3]
• First generative model for raw audio waveform
• Predict the probability distribution of waveform sample auto-regressively
• Generate high quality of audio / speech signal
• Impact on the task of TTS, voice conversion, music synthesis, etc.
WaveNet in TTS task
6
1
( ) ( | )
N
n n
n
p p x <
=
= Õx x
[Mean opinion scores]
• Utilize linguistic features and F0
contour as a conditional information
[WaveNet speech synthesis]
• Present higher quality than the
conventional TTS systems
7. WAVENET VOCODER-BASED SPEECH SYNTHESIS
Utilize WaveNet as parametric vocoder [4]
• Use acoustic features as conditional information
Advantages
• Higher quality synthesized speech than the conventional vocoders
• Don’t require hand-engineered processing pipeline
• Higher controllability than the case of linguistic features
• Controlling acoustic features
• Higher training efficiency than the case of linguistic features
• Linguistic feature: 25~35 hour database
• Acoustic feature: 1 hour database
7
[WaveNet vocoder based parametric speech synthesis]
Conventional
vocoder
WaveNet
vocoder
Recorded
8. 1. SoftMax-WaveNet
2. MDN-WaveNet
3. ExcitNet
4. Proposed LP-WaveNet
Traditional
Text-to-Speech
WaveNet MDN-WaveNet ExcitNet LP-WaveNet
~2015 2016 2017 2018 2019
WaveNet
Vocoder
2017
VARIOUS TYPES OF WAVENET VOCODERS
9. BASIC OF WAVENET
Auto-regressive generative model
• Predict probability distribution of speech samples
• Use past waveform samples as a condition information of WaveNet
Problem: Long-term dependency nature of speech signal
• Highly correlated speech signal in high sampling rate, e.g. 16000 Hz
• E.g. 1) At least 160 (=16000/100) speech samples to represent 100Hz of voice correctly
• E.g. 2) Averagely 6,000 speech samples to represent single word1
Solution
• Put the speech samples as an input of dilated causal convolution layer
• Stack the dilated causal convolution layer
• Result in exponentially growing receptive field
9
: 1
1
( ) ( | )
N
n n R n
n
p p x - -
=
= Õx x
[Dilated causal convolution]
1 Statistics based on the average speaking rate of a set of TED talk speakers
http://sixminutes.dlugan.com/speaking-rate/
Effective embedding method of long receptive field is required
10. BASIC OF WAVENET
WaveNet without dilation
10
[WaveNet without dilated convolution]
Receptive field: # layer - 1
11. BASIC OF WAVENET
WaveNet with dilation
11
[WaveNet with dilated convolution]
Receptive field: 𝟐# layer − 𝟏
12. BASIC OF WAVENET
12
Multiple stack of residual blocks
1. Dilated causal convolution
• Exponentially increase the receptive field
2. Gated activation
• Impose non-linearity on the model
• Enable conditional WaveNet
3. Residual / skip connections
• Speed up the convergence
• Enable deep layered model training
( ) ( )tanh * * sigmoid * *f f g gW V W V= + +z x h x h!
1
0
[ , ] [ ] [ ]
M
k
y d n h k x n d k
-
=
= - ×å
[Basic WaveNet architecture]
: input of activation
: conditional vector
: output of activation
x
h
z
13. SOFTMAX-WAVENET
Use WaveNet as multinomial logistic regression model [4]
• 𝜇-law companding for evenly distributed speech sample
• 8-bit one-hot encoding
• 256 symbols
Estimate each symbol using WaveNet
• Predict the sample by SoftMax distribution
Optimize network by cross-entropy (CE) loss
• Minimize the probabilistic distance between 𝑝& and 𝑞&
13
( )ln 1
( ) , =255
ln(1 )
x
y sign x
µ
µ
µ
+
= ×
+
8 ( )bitp OneHot y=
exp( )
exp( )
q
n
n q
i
i
z
q
z
=
å
[ ]logn n
n
L p q= -å
( | )q
nWaveNet <=z q h
[Distribution of speech samples]
[SoftMax-WaveNet]
14. SOFTMAX-WAVENET
Limitation by the usage of 8-bit quantization
• Noisy synthetic speech due to insufficient number of quantization bits
Intuitive solution
• Expand the SoftMax dimension to 65,536 corresponds to 16-bit quantization
à High computational cost & difficult to train
Mixture density network (MDN)-based solution [5]
• Train the WaveNet to predict the parameter of pre-defined speech distribution
14
15. MDN-WAVENET
Define the distribution of waveform sample as parameterized form [5]
• Discretized mixture of logistic (MoL) distribution
• Discretized logistic mixture with 16bit quantization (Δ = 1/2-.
)
Estimate mixture parameters by WaveNet
Optimize network by negative log-likelihood (NLL) loss
15
: logistic functions
[ ]log ( | , )n n
n
L p x x<= -å h
1
/ 2 / 2
( )
N
n n
n
n n n
x x
p x
s s
µ µ
p s s
=
é ùæ ö æ ö+ D - - D -
= × -ê úç ÷ ç ÷
è ø è øë û
å
[ , , ] ( , )s
nWaveNetp µ
<=z z z x h
softmax( )
exp( )s
p
µ
=
=
=
π z
µ z
s z
, for unity-summed mixture gain
, for positive value of mixture scale
Mixture density network
[MDN-WaveNet]
16. MDN-WAVENET
Higher quality than SoftMax-WaveNet
• Enable to model the speech signal by 16-bit
• Overcome difficulty of spectrum modeling
Difficulty of WaveNet training
• Due to increased quantization bit from 8-bit to 16-bit
Solution based on the human speech production model [6]
• Model the vocal source signal, whose physical behavior is much simpler than the speech signal
16
Mixture density network
[Spectrogram comparison] [Loss comparison]
SoftMax
WaveNet
MDN
WaveNet
Recorded
17. SPEECH PRODUCTION MODEL
Concept
• Modeling the speech as the filtered output of vocal source signal to vocal tract filter
Methodology: Linear prediction (LP) approach
• Define speech signal as linear combination of past speech samples
17
[ ]( ) ( ) ( ) ( ) ( )S z G z V z R z E z= × × ×
[Speech production model] [Spectral deconvolution by LP analysis]
1
P
n i n i n
i
s s ea -
=
= +å
1
1
( ) ( ) ( ), where ( )
1
p i
ii
S z H z E z H z
za -
=
= × =
- å
Spectrum part = LP coefficients
Excitation part = Residual signal of LP analysis
Speech = [vocal fold × vocal tract × lip radiation] × excitation
18. EXCITNET
Model the excitation signal by WaveNet, instead of speech signal [6]
Training stage
• Extract excitation signal by linear prediction (LP) analysis
• Periodically updated filter coefficients matched to frame rate of acoustic feature
• Train WaveNet to model excitation signal
Synthesis stage
• Generated excitation signal by WaveNet
• Synthesize speech signal by LP synthesis filtering
18
1
P
n n i n i
i
e s sa -
=
= - å
1
p
n i n i n
i
s s ea -
=
= +å
[ExcitNet]
Recorded Excitation LP synthesis
19. EXCITNET
High quality of synthesized speech even though 8-bit quantization is used
• Overcome the difficulty of spectrum modeling by using external spectral shaping filter
Limitation
• Independent modeling of excitation and spectrum parts results in unpredictable
prediction error
19
1
1
( ) ( )
1
p
i
i
i
S z E z
za -
=
= ×
- å
Contains error
Contains error
Objective
Represent LP synthesis process during WaveNet training / generation
Recorded
SoftMax
WaveNet ExcitNet
20. LP-WAVENET
Motivation from the assumption of WaveNet vocoder
1. Previous speech samples, 𝐱3&, are given
2. LP coefficients, {𝛼6}, are given
Probabilistic analysis
• Difference between the random variables 𝑋& and 𝐸& is only a constant value of 𝑥;&
Assume the discretized MoL distributed speech
• Shifting property of 2nd order random variable
20
1
ˆ
p
n i n i
i
x xa -
=
= åTheir linear combination, , are also given
( | , )n np e <x h
ˆnx
( | , )n np x <x h
[Difference of distributions between speech and excitation]
ˆ
x e
i i
x e
i i n
x e
i i
x
s s
p p
µ µ
=
= +
=
Only mean parameters
are different
Linear prediction
ˆ| ( , ) ( ) | ( , )
ˆ| ( , )
n n n n n
n n n
X E x
E x
< <
<
= +
= +
x h x h
x h
ˆn n nx e x= +
21. LP-WAVENET
Utilize the causality of WaveNet and the linearity of LP synthesis processes [7]
21
( ), , ,s
nWaveNetp µ
<
é ù =ë ûz z z x h
1
( | , )
/ 2 / 2
n n
N
i i
i
i i i
p x
x x
s s
µ µ
p s s
<
=
=
é ùæ ö æ ö+ D - - D -
× -ê úç ÷ ç ÷
è ø è øë û
å
x h
1. Mixture parameter prediction
2. Mixture parameter refinement
3. Discretized MoL loss calculation
[ ]log ( | , )n n
n
E p x <= -å x h
softmax( )
ˆ
exp( )
n
s
x
p
µ
=
= +
=
π z
µ z
s z
[LP-WaveNet]
Linear prediction Recorded ExcitNet LP-WaveNet
22. TUNING OF WAVENET
1. Solution to waveform divergence problem
2. Waveform generation methods
23. WAVEFORM DIVERGENCE PROBLEM
Waveform divergence problem
• Waveform divergence by a stacked error during WaveNet’s autoregressive generation
Cause – Overfitting on the silence region
• Unique solution in silence region
• Easier to be happen when the portion of silence region in training set is larger
• Be sensitive to tiny error in silence region during waveform generation
Solution – Noise injection
• Inject negligible amount of noise
• Allowing only 1 bit error
• Increase robustness to the prediction error
in silence region
23
( ),curr prevx WaveNet= x h ( )0 , silWaveNet= 0 h
16
ˆ , where = 2 2e e= + ×x x n 0 0 0
1 1 1
2 2 2
~ ( | )
~ ( | )
~ ( | )
~ ( | )n n n
x p x
x p x
x p x
x p x
<
<
<
<
x
x
x
x
!
0err
0 1err err+
i
i
errå
!
[Error stacking during waveform generation]
24. WAVEFORM GENERATION METHODS
Random sampling
Argmax sampling
Greedy sampling [8]
Mode sampling [7]
• Sharper distribution on voiced region
24
ˆ ( ) ~ ( ( ) | ( ), )randx n p x n x k n< h
max
ˆ ˆ ˆ( ) ( ) (1 ) ( )greedy randx n vuv x n vuv x n= × + - ×
max
( )
ˆ ( ) arg max ( ( ) | ( ), )
x n
x n p x n x k n= < h
,
ˆ ˆ ˆ( ) ( ) (1 ) ( )mode rand nar randx n vuv x n vuv x n= × + - ×
1
ˆ ( ) ~ ( , )
I
rand i i i
i
x n DistLogic sp µ
=
å
,
1
ˆ ( ) ~ ,
I
i
rand narr i i
i
s
x n DistLogic
c
p µ
=
æ ö
ç ÷
è ø
å
[Random & argmax sampling]
[Scale parameter control in mode sampling]
randx
maxx
Random
sampling
Recorded
Argmax
sampling
Greedy
sampling
Mode
sampling
28. OBJECTIVE EVALUATION
Performance measurements
• VUV: Voicing error rate (%)
• F0 RMSE: F0 root mean square error (Hz) in voiced region
• LSD: Log-spectral distance of AR spectrum (dB)
• F-LSD: LSD of synthesized speech in frequency domain (dB)
Results
28
A/S: analysis / synthesis
SPSS: synthesis by predicted acoustic features
29. SUBJECTIVE EVALUATION
Mean opinion score (MOS) test
• 20 random synthesized utterances from test set
• 12 native Korean speakers
• Include STRAIGHT (STR)-based speech synthesis system as baseline [9]
Results
29
A/S: analysis / synthesis
SPSS: synthesis by predicted acoustic features
31. SUMMARY & CONCLUSION
Investigated various types of WaveNet vocoder
• SoftMax-WaveNet
• MDN-WaveNet
• ExcitNet
Proposed an LP-WaveNet
• Explicitly represent linear prediction structure of speech waveform into WaveNet framework
Introduced tuning methods of WaveNet training / generation
• Noise injection
• Sample generation methods
Performed experiment in objective and subjective manner
• Achieve faster convergence speed than conventional WaveNet vocoders, whereas keep
similar speech quality
31
32. REFERENCES
[1] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in
Proc. ICASSP, 1996
[2] H. Zen, et. al. “Statistical parametric speech synthesis using deep neural networks,” in Proc. ICASSP, 2013.
[3] A. van den Oord, et. al.,“WaveNet: A generative model for raw audio generation,” in CoRR, 2016.
[4] A.Tamamori, et. al., “Speaker-dependent wavenet vocoder,” in INTERSPEECH, 2017.
[5] A. van den Oord, et. al., "Parallel WaveNet: Fast High-fidelity speech synthesis," in CoRR, 2017
[6] E. Song, K. Byun, and H.-G. Kang, “A neural excitation vocoder for statistical parametric speech synthesis systems,” in
Arxiv, 2018.
[7] M. Hwang, F. Soong, F. Xie, X. Wang, and H. Kang, “LP-WaveNet: Linear Prediction-based WaveNet Speech Synthesis,”
Submitted to Proc. ICASSP, 2019.
[8] W. Xin, et. al., “A comparison of recent waveform generation and acoustic modeling methods for neural-network-based
speech synthesis,” in Proc. ICASSP, 2018.
[9] H. Kawahara, “STRAIGHT, exploration of the other aspect of vocoder: Perceptually isomorphic decomposition of speech
sounds,” Acoustical Science and Technology, 2006.
32