This document provides an overview of end-to-end text-to-speech synthesis models including Char2Wav, Tacotron, Tacotron2, and the development of a Japanese Tacotron model. It describes the typical encoder-decoder architecture with attention, improvements made in subsequent models like Tacotron2 using larger models and more regularization, and the implementation of a Japanese Tacotron to model Japanese pitch accents using an accent embedding and self-attention.
These are slides used for invited tutorial on "end-to-end text-to-speech synthesis", given at IEICE SP workshop held on 27th Jan 2019.
Part 1: Neural waveform modeling
Presenters: Xin Wang, Yusuke Yasuda (National Institute of Informatics, Japan)
Tutorial on neural vocoders at the 2021 Speech Processing Courses in Crete, "Inclusive Neural Speech Synthesis."
Presenters: Xin Wang and Junichi Yamagishi, National Institute of Informatics, Japan
2018 Speech Processing Courses in Crete (SPCC2018)
"Toawrds flexible and intelligible end-to-end speech synthesis systems"
Lecture slides
Tomoki Toda: Advanced Voice Conversion, July 26, 2018
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Continuous representations of words and documents, which is recently referred to as Word Embeddings, have recently demonstrated large advancements in many of the Natural language processing tasks.
In this presentation we will provide an introduction to the most common methods of learning these representations. As well as previous methods in building these representations before the recent advances in deep learning, such as dimensionality reduction on the word co-occurrence matrix.
Moreover, we will present the continuous bag of word model (CBOW), one of the most successful models for word embeddings and one of the core models in word2vec, and in brief a glance of many other models of building representations for other tasks such as knowledge base embeddings.
Finally, we will motivate the potential of using such embeddings for many tasks that could be of importance for the group, such as semantic similarity, document clustering and retrieval.
Recent progress on voice conversion: What is next?NU_I_TODALAB
Invited Talk at IEEE SLT 2021
Title: "Recent progress on voice conversion: What is next?"
Speaker: Tomoki Toda
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
These are slides used for invited tutorial on "end-to-end text-to-speech synthesis", given at IEICE SP workshop held on 27th Jan 2019.
Part 1: Neural waveform modeling
Presenters: Xin Wang, Yusuke Yasuda (National Institute of Informatics, Japan)
Tutorial on neural vocoders at the 2021 Speech Processing Courses in Crete, "Inclusive Neural Speech Synthesis."
Presenters: Xin Wang and Junichi Yamagishi, National Institute of Informatics, Japan
2018 Speech Processing Courses in Crete (SPCC2018)
"Toawrds flexible and intelligible end-to-end speech synthesis systems"
Lecture slides
Tomoki Toda: Advanced Voice Conversion, July 26, 2018
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Continuous representations of words and documents, which is recently referred to as Word Embeddings, have recently demonstrated large advancements in many of the Natural language processing tasks.
In this presentation we will provide an introduction to the most common methods of learning these representations. As well as previous methods in building these representations before the recent advances in deep learning, such as dimensionality reduction on the word co-occurrence matrix.
Moreover, we will present the continuous bag of word model (CBOW), one of the most successful models for word embeddings and one of the core models in word2vec, and in brief a glance of many other models of building representations for other tasks such as knowledge base embeddings.
Finally, we will motivate the potential of using such embeddings for many tasks that could be of importance for the group, such as semantic similarity, document clustering and retrieval.
Recent progress on voice conversion: What is next?NU_I_TODALAB
Invited Talk at IEEE SLT 2021
Title: "Recent progress on voice conversion: What is next?"
Speaker: Tomoki Toda
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
APSIPA ASC 2021
Ding Ma, Wen-Chin Huang, Tomoki Toda: Investigation of text-to-speech-based synthetic parallel data for sequence-to-sequence non-parallel voice conversion, Dec. 2021
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Over the last two years, the field of Natural Language Processing (NLP) has witnessed the emergence of transfer learning methods and architectures which significantly improved upon the state-of-the-art on pretty much every NLP tasks.
The wide availability and ease of integration of these transfer learning models are strong indicators that these methods will become a common tool in the NLP landscape as well as a major research direction.
In this talk, I'll present a quick overview of modern transfer learning methods in NLP and review examples and case studies on how these models can be integrated and adapted in downstream NLP tasks, focusing on open-source solutions.
Website: https://fwdays.com/event/data-science-fwdays-2019/review/transfer-learning-in-nlp
Interactive voice conversion for augmented speech productionNU_I_TODALAB
Invited Talk at SNL 2021
Title: "Interactive voice conversion for augmented speech production"
Speaker: Tomoki Toda
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Improving Neural Abstractive Text Summarization with Prior KnowledgeGaetano Rossiello, PhD
Abstractive text summarization is a complex task whose goal is to generate a concise version of a text without necessarily reusing the sentences from the original source, but still preserving the meaning and the key contents. In this position paper we address this issue by modeling the problem as a sequence to sequence learning and exploiting Recurrent Neural Networks (RNN). Moreover, we discuss the idea of combining RNNs and probabilistic models in a unified way in order to incorporate prior knowledge, such as linguistic features. We believe that this approach can obtain better performance than the state-of-the-art models for generating well-formed summaries.
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
This is my slides for introducing sequence to sequence model and Recurrent Neural Network(RNN) to my laboratory colleagues.
Hyemin Ahn, @CPSLAB, Seoul National University (SNU)
Presentation regarding development of text-to-speech system for Gujarati. Input would be arbitrary Gujarati unicode text while output would equivalent speech sound.
Statistical voice conversion with direct waveform modelingNU_I_TODALAB
Lecture slides by Tomoki Toda
Tutorial [T2] at INTERSPEECH 2019
Title: "Statistical voice conversion with direct waveform modeling"
Lecturers: Tomoki Toda, Kazuhiro Kobayashi, Tomoki Hayashi
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
2018 Speech Processing Courses in Crete (SPCC2018)
"Toawrds flexible and intelligible end-to-end speech synthesis systems"
Hands-on slides
Tomoki Toda: Hands on Voice Conversion, July 26, 2018
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
Guest presentation at "Applied Gaussian Process and Machine Learning," Graduate School of Information Science and Technology, The University of Tokyo, Japan, 2021.
Basics of RNNs and its applications with following papers:
- Generating Sequences With Recurrent Neural Networks, 2013
- Show and Tell: A Neural Image Caption Generator, 2014
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, 2015
- DenseCap: Fully Convolutional Localization Networks for Dense Captioning, 2015
- Deep Tracking- Seeing Beyond Seeing Using Recurrent Neural Networks, 2016
- Robust Modeling and Prediction in Dynamic Environments Using Recurrent Flow Networks, 2016
- Social LSTM- Human Trajectory Prediction in Crowded Spaces, 2016
- DESIRE- Distant Future Prediction in Dynamic Scenes with Interacting Agents, 2017
- Predictive State Recurrent Neural Networks, 2017
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
APSIPA ASC 2021
Ding Ma, Wen-Chin Huang, Tomoki Toda: Investigation of text-to-speech-based synthetic parallel data for sequence-to-sequence non-parallel voice conversion, Dec. 2021
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Over the last two years, the field of Natural Language Processing (NLP) has witnessed the emergence of transfer learning methods and architectures which significantly improved upon the state-of-the-art on pretty much every NLP tasks.
The wide availability and ease of integration of these transfer learning models are strong indicators that these methods will become a common tool in the NLP landscape as well as a major research direction.
In this talk, I'll present a quick overview of modern transfer learning methods in NLP and review examples and case studies on how these models can be integrated and adapted in downstream NLP tasks, focusing on open-source solutions.
Website: https://fwdays.com/event/data-science-fwdays-2019/review/transfer-learning-in-nlp
Interactive voice conversion for augmented speech productionNU_I_TODALAB
Invited Talk at SNL 2021
Title: "Interactive voice conversion for augmented speech production"
Speaker: Tomoki Toda
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Improving Neural Abstractive Text Summarization with Prior KnowledgeGaetano Rossiello, PhD
Abstractive text summarization is a complex task whose goal is to generate a concise version of a text without necessarily reusing the sentences from the original source, but still preserving the meaning and the key contents. In this position paper we address this issue by modeling the problem as a sequence to sequence learning and exploiting Recurrent Neural Networks (RNN). Moreover, we discuss the idea of combining RNNs and probabilistic models in a unified way in order to incorporate prior knowledge, such as linguistic features. We believe that this approach can obtain better performance than the state-of-the-art models for generating well-formed summaries.
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
This is my slides for introducing sequence to sequence model and Recurrent Neural Network(RNN) to my laboratory colleagues.
Hyemin Ahn, @CPSLAB, Seoul National University (SNU)
Presentation regarding development of text-to-speech system for Gujarati. Input would be arbitrary Gujarati unicode text while output would equivalent speech sound.
Statistical voice conversion with direct waveform modelingNU_I_TODALAB
Lecture slides by Tomoki Toda
Tutorial [T2] at INTERSPEECH 2019
Title: "Statistical voice conversion with direct waveform modeling"
Lecturers: Tomoki Toda, Kazuhiro Kobayashi, Tomoki Hayashi
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
2018 Speech Processing Courses in Crete (SPCC2018)
"Toawrds flexible and intelligible end-to-end speech synthesis systems"
Hands-on slides
Tomoki Toda: Hands on Voice Conversion, July 26, 2018
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
Guest presentation at "Applied Gaussian Process and Machine Learning," Graduate School of Information Science and Technology, The University of Tokyo, Japan, 2021.
Basics of RNNs and its applications with following papers:
- Generating Sequences With Recurrent Neural Networks, 2013
- Show and Tell: A Neural Image Caption Generator, 2014
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, 2015
- DenseCap: Fully Convolutional Localization Networks for Dense Captioning, 2015
- Deep Tracking- Seeing Beyond Seeing Using Recurrent Neural Networks, 2016
- Robust Modeling and Prediction in Dynamic Environments Using Recurrent Flow Networks, 2016
- Social LSTM- Human Trajectory Prediction in Crowded Spaces, 2016
- DESIRE- Distant Future Prediction in Dynamic Scenes with Interacting Agents, 2017
- Predictive State Recurrent Neural Networks, 2017
Deep Learning Based Voice Activity Detection and Speech EnhancementNAVER Engineering
발표자: 김준태 (KAIST 박사과정)
발표일: 2018.10
Voice activity detection (VAD) and speech enhancement (SE) are important front-end technologies for noise robust speech recognition system.
From incoming noisy signal, VAD detects the speech signal only and SE removes the noise signal while conserving the speech signal.
For VAD and SE, this presentation will cover the traditional methods, deep learning based methods, and our papers as follows:
1. J. Kim and M. Hahn, "Voice Activity Detection Using an Adaptive Context Attention Model," in IEEE Signal Processing Letters, vol. 25, no. 8, pp. 1181-1185, Aug. 2018.
2. J. Kim and M. Hahn, "Speech Enhancement Using a Two Step Network," submitted to IEEE Signal Processing Letters, 2018.
Also, this presentation will briefly introduce some experimental results in real-world environment (far-field, noisy environment), conducted on the embedded board.
For VAD,
Traditional VAD methods.
Deep learning based VAD methods.
Paper presentation: J. Kim and M. Hahn, "Voice Activity Detection Using an Adaptive Context Attention Model," in IEEE Signal Processing Letters, vol. 25, no. 8, pp. 1181-1185, Aug. 2018.
End point detection based on VAD.
Experimental results of DNN-EPD on embedded board in real-world environment.
For SE,
Traditional SE methods.
Deep learning based SE methods.
Paper presentation: J. Kim and M. Hahn, "Speech Enhancement Using a Two Step Network," submitted to IEEE Signal Processing Letters, 2018.
Experimental results in real-world environment.
Hamming net based Low Complexity Successive Cancellation Polar DecoderRSIS International
This paper aims to implement hybrid based Polar
encoder using the knowledge of mutual information and channel
capacity. Further a Hamming weight successive cancellation
decoder is simulated with QPSK modulation technique in
presence of additive white gaussian noise. The experimentation
performed with the effect of channel polarization has shown that
for 256- bit data stream, 30% channels has zero bit and 49%
channels are with a one bit capacity. The decoding complexity is
reduced to almost half as compared to conventional successive
cancellation decoding algorithm. However, the required SNR of
7 dB is achieved at the targeted BER of 10 -4. The penalty paid is
in terms of training time required at the decoding end.
Two approaches to QbE (Query-by-Example) retrieving system, proposed by the Technical University of Kosice (TUKE) for the query by example search on speech task (QUESST), are presented in this paper. Our main interest was focused on building such QbE system, which is able to retrieve all given queries with and without using any external speech resources. Therefore we developed posteriorgram-based keyword matching system, which utilizes a novel weighted fast sequential variant of DTW (WFS-DTW) algorithm in order to detect occurrences of each query within the particular utterance le, using two GMM-based acoustic units modeling approaches. The first one, referred as low-resource approach, employs language-dependent phonetic decoders to convert queries and utterances into posteriorgrams. The second one, defined as zero-resource approach, implements combination of unsupervised segmentation and clustering techniques by using only provided utterance files.
http://ceur-ws.org/Vol-1263/mediaeval2014_submission_80.pdf
Design and Implementation of an Embedded System for Software Defined RadioIJECEIAES
In this paper, developing high performance software for demanding real-time embed- ded systems is proposed. This software-based design will enable the software engineers and system architects in emerging technology areas like 5G Wireless and Software Defined Networking (SDN) to build their algorithms. An ADSP-21364 floating point SHARC Digital Signal Processor (DSP) running at 333 MHz is adopted as a platform for an embedded system. To evaluate the proposed embedded system, an implementation of frame, symbol and carrier phase synchronization is presented as an application. Its performance is investigated with an on line Quadrature Phase Shift keying (QPSK) receiver. Obtained results show that the designed software is implemented successfully based on the SHARC DSP which can utilized efficiently for such algorithms. In addition, it is proven that the proposed embedded system is pragmatic and capable of dealing with the memory constraints and critical time issue due to a long length interleaved coded data utilized for channel coding.
Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip IJECEIAES
Multiprocessor system-on-chip (MPSoC) has become an attractive solution for improving the performance of single chip in objective to satisfy the performance growing exponentially of the computer applications as multimedia applications. However, the communication between the different processors’ cores presents the first challenge front the high performance of MPSoC. Besides, Network on Chip (NoC) is among the most prominent solution for handling the on-chip communication. Besides, NoC potential limited by physical limitation, power consumption, latency and bandwidth in the both case: increasing data exchange or scalability of Multicores. Optical communication offers a wider bandwidth and lower power consumption, based on, a new technology named Optical Network-on-Chip (ONoC) has been introduced in MPSoC. However, ONoC components induce the crosstalk noise in the network on both forms intra/inter crosstalk. This serious problem deteriorates the quality of signals and degrades network performance. As a result, detection and monitoring the impairments becoming a challenge to keep the performance in the ONoC. In this article, we propose a new system to detect and monitor the crosstalk noise in ONoC. Particularly, we present an analytic model of intra/inter crosstalk at the optical devices. Then, we evaluate these impairments in objective to present the motivation to detect and monitor crosstalk in ONoC, in which our system has the capability to detect, to localize, and to monitor the crosstalk noise in the whole network. This system offers high reliability, scalability and efficiency with time running time less than 20 ms.
Study of the operational SNR while constructing polar codes IJECEIAES
Channel coding is commonly based on protecting information to be communicated across an unreliable medium, by adding patterns of redundancy into the transmission path. Also referred to as forward error control coding (FECC), the technique is widely used to enable correcting or at least detecting bit errors in digital communication systems. In this paper we study an original FECC known as polar coding which has proven to meet the typical use cases of the next generation mobile standard. This work is motivated by the suitability of polar codes for the new coming wireless era. Hence, we investigate the performance of polar codes in terms of bit error rate (BER) for several codeword lengths and code rates. We first perform a discrete search to find the best operational signal-to-noise ratio (SNR) at two different code rates, while varying the blocklength. We find in our extensive simulations that the BER becomes more sensitive to operational SNR (OSNR) as long as we increase the blocklength and code rate. Finally, we note that increasing blocklength achieves an SNR gain, while increasing code rate changes the OSNR domain. This trade-off sorted out must be taken into consideration while designing polar codes for high-throughput application.
Similar to Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related end-to-end systems (20)
A large and growing amount of speech content in real-life scenarios is being recorded on consumer-grade devices in uncontrolled environments, resulting in degraded speech quality. Transforming such low-quality device-degraded speech into high-quality speech is a goal of speech enhancement (SE). This paper introduces a new speech dataset, DDS, to facilitate the research on SE. DDS provides aligned parallel recordings of high-quality speech (recorded in professional studios) and a number of versions of low-quality speech, producing approximately 2,000 hours speech data. The DDS dataset covers 27 realistic recording conditions by combining diverse acoustic environments and microphone devices, and each version of a condition consists of multiple recordings from six microphone positions to simulate different noise and reverberation levels. We also test several SE baseline systems on the DDS dataset and show the impact of recording diversity on performance.
Paper: https://arxiv.org/abs/2109.07931
Presentation for Interspeech 2022: "The VoiceMOS Challenge 2022"
Presenter: Dr. Erica Cooper, National Institute of Informatics
Preprint: https://arxiv.org/abs/2203.11389
Video: https://youtu.be/99ZQ-SLUvKE
Challenge website: https://voicemos-challenge-2022.github.io
Thu-SS-OS-9-5
We present the first edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthetic speech. This challenge drew 22 participating teams from academia and industry who tried a variety of approaches to tackle the problem of predicting human ratings of synthesized speech. The listening test data for the main track of the challenge consisted of samples from 187 different text-to-speech and voice conversion systems spanning over a decade of research, and the out-of-domain track consisted of data from more recent systems rated in a separate listening test. Results of the challenge show the effectiveness of fine-tuning self-supervised speech models for the MOS prediction task, as well as the difficulty of predicting MOS ratings for unseen speakers and listeners, and for unseen systems in the out-of-domain setting.
Presentation for Interspeech 2022: "Analyzing Language-Independent Speaker Anonymization Framework under Unseen Conditions"
Presenter: Dr. Xiaoxiao Miao, National Institute of Informatics
Thu-O-OS-9-1
Video: https://youtu.be/wVIxyLiQa1Y
Preprint: Preprint: https://arxiv.org/abs/2203.14834
In our previous work, we proposed a language-independent speaker anonymization system based on self-supervised learning models. Although the system can anonymize speech data of any language, the anonymization was imperfect, and the speech content of the anonymized speech was distorted. This limitation is more severe when the input speech is from a domain unseen in the training data. This study analyzed the bottleneck of the anonymization system under unseen conditions. It was found that the domain (e.g., language and channel) mismatch between the training and test data affected the neural waveform vocoder and anonymized speaker vectors, which limited the performance of the whole system. Increasing the training data diversity for the vocoder was found to be helpful to reduce its implicit language and channel dependency. Furthermore, a simple correlation-alignment-based domain adaption strategy was found to be significantly effective to alleviate the mismatch on the anonymized speaker vectors. Audio samples and source code are available online.
Presentation for Interspeech 2022: Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials Sampling Strategy for SASVC 2022
Presenter: Chang Zeng (National Institute of Informatics and SOKENDAI)
Wed-SS-OS-6-5
Presentation video: https://youtu.be/gXxP1nn5X6E
The spoofing aware speaker verification challenge (SASVC) 2022 has been organized to explore the relation between automatic speaker verification (ASV) and spoof countermeasure (CM). In this paper, we will introduce our proposed spoofing- aware attention back-end developed for SASVC 2022. First, we design a novel sampling strategy for simulating real verification scenario. Then, in order to fully leverage information derived from multiple enrollments, a spoofing-aware attention back-end has been proposed. Finally, a joint decision strategy is aggregated to introduce mutual interaction between ASV module and CM module. Compared with the trial sampling method used in baseline systems, our proposed sampling method shows effective improvement without any attention modules. The experimental result shows our proposed spoofing-aware attention back-end improves the performance from 6.37% of best baseline system on evaluation dataset to 1.19% in term of SASV- EER (equal error rate) metric.
Presenter: Dr. Xiaoxiao Miao, NII
Paper: https://arxiv.org/abs/2202.13097
Speaker anonymization aims to protect the privacy of speakers while preserving spoken linguistic information from speech. Current mainstream neural network speaker anonymization systems are complicated, containing an F0 extractor, speaker encoder, automatic speech recognition acoustic model (ASR AM), speech synthesis acoustic model and speech waveform generation model. Moreover, as an ASR AM is language-dependent, trained on English data, it is hard to adapt it into another language. In this paper, we propose a simpler self-supervised learning (SSL)-based method for language-independent speaker anonymization without any explicit language-dependent model, which can be easily used for other languages. Extensive experiments were conducted on the VoicePrivacy Challenge 2020 datasets in English and AISHELL-3 datasets in Mandarin to demonstrate the effectiveness of our proposed SSL-based language-independent speaker anonymization method.
Presenter: Dr. Xin Wang, NII
Paper: https://arxiv.org/abs/2111.07725
Self-supervised speech model is a rapid progressing research topic, and many pre-trained models have been released and used in various down stream tasks. For speech anti-spoofing, most countermeasures (CMs) use signal processing algorithms to extract acoustic features for classification. In this study, we use pre-trained self-supervised speech models as the front end of spoofing CMs. We investigated different back end architectures to be combined with the self-supervised front end, the effectiveness of fine-tuning the front end, and the performance of using different pre-trained self-supervised models. Our findings showed that, when a good pre-trained front end was fine-tuned with either a shallow or a deep neural network-based back end on the ASVspoof 2019 logical access (LA) training set, the resulting CM not only achieved a low EER score on the 2019 LA test set but also significantly outperformed the baseline on the ASVspoof 2015, 2021 LA, and 2021 deepfake test sets. A sub-band analysis further demonstrated that the CM mainly used the information in a specific frequency band to discriminate the bona fide and spoofed trials across the test sets.
SSW11 presentation: How do Voices from Past Speech Synthesis Challenges Compare Today?
Presenter: Erica Cooper
Preprint: https://arxiv.org/abs/2105.02373
Presentation for SSW11: "Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance"
Presenter: Hieu-Thi Luong
Preprint: https://arxiv.org/abs/2106.13479
These are the slides for the presentation titled "Neural source-waveform model," given at ICASSP 2019 in Brighton, UK.
Presenter: Xin Wang, National Institute of Informatics, Japan
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Connector Corner: Automate dynamic content and events by pushing a button
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related end-to-end systems
1. Tutorial on end-to-end text-to-speech
synthesis
Part 2 – Tactron and related end-to-end systems
安田裕介 (Yusuke YASUDA)
National Institute of Informatics
1
3. About me
- Name: 安田裕介
- Status:
- A PhD student at Sokendai / NII
- A programmer at work
- Former geologist
- bachelor, master
- Github account: TanUkkii007
3
4. Table of contents
1. Text-to-speech architecture
2. End-to-end model
3. Example architectures
a. Char2Wav
b. Tacotron
c. Tacotron2
4. Our work: Japanese Tacotron
5. Implementation
4
5. TTS architecture: traditional pipeline
- Typical pipeline architecture
for statistical parametric
speech synthesis
- Consists of task-specific
models
- linguistic model
- alignment (duration) model
- acoustic model
- vocoder
5
6. TTS architecture: End-to-end model
- End-to-end model directly
converts text to waveform
- End-to-end model does not
require intermediate feature
extraction
- Pipeline models accumulate
errors across predicted
features
- End-to-end model’s internal
blocks are jointly optimized
6
7. [1] Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-EndText-to-Speech. CoRR abs/1807.07281 (2018)
End-to-end model: Encoder-Decoder with Attention
- Building blocks
- Encoder
- Attention mechanism
- Decoder
- Neural vocoder
Conventional end-to-end models may not
include a waveform generator, but some
recent full end-to-end models contain a
neural waveform generator, e.g. ClariNet[1].
7
8. End-to-end model: Decoding with attention
mechanism
8
Time step 1.
- Assign alignment
probabilities to encoded
inputs
- Alignment scores can be
derived from attention
layer’s output and
encoded inputs
9. End-to-end model: Decoding with attention
mechanism
9
Time step 1.
- Calculate context vector
- Context vector is the sum
of encoded inputs
weighted by alignment
probabilities
10. End-to-end model: Decoding with attention
mechanism
10
Time step 1.
- Decoder layer predicts an
output from the context
vector
11. End-to-end model: Decoding with attention
mechanism
11
Time step 2.
- The previous output is
fed back to next time step
- Assign alignment
probabilities to encoded
inputs
15. Example: Char2Wav
Char2Wav is one of the earliest work on end-to-end TTS. Its simple sequence-to-
sequence architecture with attention gives a good proof-of-concept for end-to-end
TTS as a start line. Char2Wav also uses advanced blocks like a neural vocoder
and the multi-speaker embedding.
Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron
Courville, Yoshua Bengio:
Char2Wav: End-to-End Speech Synthesis. ICLR 2017
https://openreview.net/forum?id=B1VWyySKx
15
17. Char2Wav: Source and target choice
- Input: character or phoneme sequence
- Output: vocoder parameters
- Objective: mean square error or negative GMM log likelihood loss
17
18. Char2Wav: GMM attention
- Proposed by Graves
(2013)
- Location based attention
- Alignment is based on
input location
- Alignment does not
depend on input content
18
[3] Alex Graves: Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850 (2013)
[3]
19. Char2Wav: GMM attention
- Monotonic progress
- The mean value μ
always increases as time
progresses
- Robust to long sequence
prediction
epoch 0 epoch 130
Visualization ofGMM attention
alignmentfrom Tacotron predicting
vocoder parameters.r=2.
19
20. Char2Wav: Limitations
- Target features: Char2Wav uses vocoder parameters as a target. Vocoder
parameters are challenging target for sequence-to-sequence system.
- e.g. vocoder parameters for 10s speech requires 2000 iteration to predict
- Architecture: As Tacotron paper demonstrated, a vanila sequence-to-sequence
architecture gives poor alignment.
These limitations were solved by Tacotron.
20
21. Example: Tacotron
Tacotron is the most influential method among modern architectures because of
several cutting-edge techniques.
Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly,
Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis,
Rob Clark, Rif A. Saurous:
Tacotron: Towards End-to-End Speech Synthesis. INTERSPEECH 2017: 4006-4010
https://arxiv.org/abs/1703.10135
21
22. Tacotron: Architecture
- CBHG encoder
- Encoder & decoder pre-net
- Reduction factor
- Post-net
- Additive attention
Figure is from Wang et al. (2017)
22
23. Tacotron: Source and target choice
- Input: character sequence
- Output: mel and linear spectrogram (50ms frame length, 12.5ms frame shift)
- x2.5 shorter total sequence than that of vocoder parameters
- Objective: L1 loss
23
24. Tacotron: Decoder pre-net
- The most important architecture
advancement from vanila seq2seq
- 2 FFN + ReLU + dropout
- Crucial for alignment learning
24
25. Tacotron: Reduction factor
- Predicts multiple frames at a time
- Reduction factor r: the number of
frames to predict at one step
- Large r →
- Small number of iteration
- Easy alignment learning
- Less training time
- Less inference time
- Poor quality
25
26. [5] Dzmitry Bahdanau,Kyunghyun Cho, Yoshua Bengio:Neural Machine Translation by Jointly Learningto Align and Translate. CoRR abs/1409.0473 (2014)
- Proposed by Bahdanau et al. (2014)
- Content based attention
- Distance between source and target is
learned by FFN
- No structure constraint
Tacotron: Additive attention
epoch 1 epoch 3 epoch 7 epoch 10 epoch 50 epoch 100epoch 0
26
[5]
27. Tacotron: Limitations
- Training time is too long: > 2M steps (mini batches) to converge
- Waveform generation: Griffin-Lim is handy but hurts the waveform quality.
- Stop condition: Tacotron predicts fixed-length spectrogram, which is inefficient
both at training and inference time.
27
28. Example: Tacotron2
Tacotron2 is a surprising method that achieved human level quality of synthesized
speech. Its architecture is an extension of Tacotron. Tacotron2 uses WaveNet for
high-quality waveform generation.
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu:
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. ICASSP 2018: 4779-
4783
https://arxiv.org/abs/1712.05884
28
29. Tacotron2: Architecture
- CNN layers instead of CBHG at
encoder and post-net
- Stop token prediction
- Post-net improves predicted mel
spectrogram
- Location sensitive attention
- No reduction factor
Figure is from Shen et al. (2018)
29
30. [7] David Krueger, Tegan Maharaj, János Kramár, MohammadPezeshki, Nicolas Ballas, Nan Rosemary Ke, AnirudhGoyal, YoshuaBengio, HugoLarochelle,
Aaron C. Courville, Chris Pal: Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations.CoRR abs/1606.01305(2016)
Tacotron2: Trends of architecture changes
1. Larger model size
- character embedding: 256 → 512
- encoder: 256 → 512
- decoder pre-net:
(256, 128) → (256, 256)
- decoder: 256 → 1024
- attention: 256 → 128
- post-net: 256 → 512
2. More regularizations:
- Encoder applies dropout at every
layer. The original Tacotron applies
dropout in pre-net only at encoder.
- Zoneout regularization for encoder
bidirectional LSTM and decoder LSTM
- Post-net applies dropout at every layer
- L2 regularization
CBHG module is replaced by CNN layers at encoder and
postnet probably due to limited regularization applicability.
Large model needs heavy regularization to prevent overfitting
30
[7]
31. Tacotron2: Source and target choice
- Input: character sequence
- Output: mel spectrogram
- Objective: mean square error
31
32. Tacotron2: Waveform synthesis with WaveNet
- Upsampling rate: 200
- Mel vs Linear: no difference
- Human-level quality was achieved by WaveNet trained with ground truth
alignmed predicted spectrogram
32
33. [8] Jan Chorowski, Dzmitry Bahdanau,Dmitriy Serdyuk, Kyunghyun Cho,Yoshua Bengio: Attention-Based Models for SpeechRecognition. NIPS 2015: 577-585
Tacotron2: Location sensitive attention
- Proposed by Chorowski et al.,
(2015)
- Utilizes both input content and
input location
33
[8]
34. Tacotron2: Limitations
- WaveNet training in Tacotron2:
● WaveNet is trained with ground truth-aligned **predicted** mel spectrogram.
● Tacotron2's errors are corrected by WaveNet.
● However, this is unproductive: one WaveNet for each different Tacotron2
- WaveNet training in normal condition
● WaveNet is trained with ground truth spectrogram
● However, its MOS score still is still inferior to the human speech.
34
35. Our work: Tacotron for Japanese language
We extended Tacotron for Japanese language.
● To model Japanese pitch accents, we introduced accentual type embedding
as well as phoneme embedding
● To handle long term dependencies in accents: we use additional components
Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi:
Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent
language. CoRR abs/1810.11960 (2018)
https://arxiv.org/abs/1810.11960
35
37. Japanese Tacotron: Source and target choice
- Source: phoneme and accentual type sequence
- Target:
- mel spectrogram
- vocoder parameters
- Objective:
- L1 loss (mel spectrogram, mel generalized cepstrum)
- softmax cross entropy loss (discretized F0)
37
38. [10] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai: Forward Attentionin Sequence-To-SequenceAcoustic Modeling for SpeechSynthesis. ICASSP2018:
4789-4793
Japanese Tacotron: Forward attention
- Proposed by Zhang et al. (2018)
- Precluding left-to-right progress
- Fast alignment learning
epoch 1 epoch 3
epoch 7 epoch 10 epoch 50
epoch 0
38
[10]
39. Japanese Tacotron: Limitations
- Speech quality is still worse than that of traditional pipeline system
- Input feature limitation
- Configuration is based on Tacotron, not Tacotron2
39
41. Japanese Tacotron: Implementation
URL: https://github.com/nii-yamagishilab/self-attention-tacotron
Framework: Tensorflow, Python
Supported datasets: LJSpeech, VCTK, (...coming more in the future)
Features:
41
- Tacotron, Tacotron2, Japanese Tacotron model
- Combination choices for encoder, decoder, attention mechanism
- Force alignment
- Mel spectrogram, vocoder parameter for target
- Compatible with our WaveNet implementation
44. Bibliography
- [1] Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-End Text-to-
Speech. CoRR abs/1807.07281 (2018)
- [2] Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville,
Yoshua Bengio: Char2Wav: End-to-End Speech Synthesis. ICLR 2017
- [3] Alex Graves: Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850
(2013)
- [4] Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly,
Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis,
Rob Clark, Rif A. Saurous: Tacotron: Towards End-to-End Speech Synthesis. INTERSPEECH 2017:
4006-4010
- [5] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: Neural Machine Translation by Jointly
Learning to Align and Translate. CoRR abs/1409.0473 (2014)
44
45. Bibliography
- [6] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang,
Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis,
Yonghui Wu: Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions.
ICASSP 2018: 4779-4783
- [7] David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan
Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron C. Courville, Chris Pal:
Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. CoRR abs/1606.01305
(2016)
- [8] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio: Attention-
Based Models for Speech Recognition. NIPS 2015: 577-585
- [9] Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi: Investigation of enhanced Tacotron
text-to-speech synthesis systems with self-attention for pitch accent language. CoRR
abs/1810.11960 (2018)
45
46. Bibliography
- [10] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai: Forward Attention in Sequence-To-Sequence
Acoustic Modeling for Speech Synthesis. ICASSP 2018: 4789-4793
- [11] Yaniv Taigman, Lior Wolf, Adam Polyak, Eliya Nachmani: VoiceLoop: Voice Fitting and
Synthesis via a Phonological Loop. ICLR 2018
- [12] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan Ömer Arik, Ajay Kannan, Sharan Narang,
Jonathan Raiman, John Miller: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence
Learning. ICLR 2018
- [13] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou: Close to Human Quality
TTS with Transformer. CoRR abs/1809.08895 (2018)
46
47. Acknowledgement
This work was partially supported by JST CREST Grant Number JPMJCR18A6,
Japan and by MEXT KAKENHI Grant Numbers (16H06302, 17H04687,
18H04120, 18H04112, 18KT0051), Japan.
47