APSIPA ASC 2021
Ding Ma, Wen-Chin Huang, Tomoki Toda: Investigation of text-to-speech-based synthetic parallel data for sequence-to-sequence non-parallel voice conversion, Dec. 2021
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Recent progress on voice conversion: What is next?NU_I_TODALAB
Invited Talk at IEEE SLT 2021
Title: "Recent progress on voice conversion: What is next?"
Speaker: Tomoki Toda
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Interactive voice conversion for augmented speech productionNU_I_TODALAB
Invited Talk at SNL 2021
Title: "Interactive voice conversion for augmented speech production"
Speaker: Tomoki Toda
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Recent progress on voice conversion: What is next?NU_I_TODALAB
Invited Talk at IEEE SLT 2021
Title: "Recent progress on voice conversion: What is next?"
Speaker: Tomoki Toda
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Interactive voice conversion for augmented speech productionNU_I_TODALAB
Invited Talk at SNL 2021
Title: "Interactive voice conversion for augmented speech production"
Speaker: Tomoki Toda
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Statistical voice conversion with direct waveform modelingNU_I_TODALAB
Lecture slides by Tomoki Toda
Tutorial [T2] at INTERSPEECH 2019
Title: "Statistical voice conversion with direct waveform modeling"
Lecturers: Tomoki Toda, Kazuhiro Kobayashi, Tomoki Hayashi
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
Guest presentation at "Applied Gaussian Process and Machine Learning," Graduate School of Information Science and Technology, The University of Tokyo, Japan, 2021.
2018 Speech Processing Courses in Crete (SPCC2018)
"Toawrds flexible and intelligible end-to-end speech synthesis systems"
Lecture slides
Tomoki Toda: Advanced Voice Conversion, July 26, 2018
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
2018 Speech Processing Courses in Crete (SPCC2018)
"Toawrds flexible and intelligible end-to-end speech synthesis systems"
Hands-on slides
Tomoki Toda: Hands on Voice Conversion, July 26, 2018
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Statistical voice conversion with direct waveform modelingNU_I_TODALAB
Lecture slides by Tomoki Toda
Tutorial [T2] at INTERSPEECH 2019
Title: "Statistical voice conversion with direct waveform modeling"
Lecturers: Tomoki Toda, Kazuhiro Kobayashi, Tomoki Hayashi
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
Guest presentation at "Applied Gaussian Process and Machine Learning," Graduate School of Information Science and Technology, The University of Tokyo, Japan, 2021.
2018 Speech Processing Courses in Crete (SPCC2018)
"Toawrds flexible and intelligible end-to-end speech synthesis systems"
Lecture slides
Tomoki Toda: Advanced Voice Conversion, July 26, 2018
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
2018 Speech Processing Courses in Crete (SPCC2018)
"Toawrds flexible and intelligible end-to-end speech synthesis systems"
Hands-on slides
Tomoki Toda: Hands on Voice Conversion, July 26, 2018
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in English, but the presentation is done in Korean)
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
The automatic question answering (QA) task has long been considered a primary objective of artificial intelligence.
Among the QA sub-systems, we focused on answer-ranking part. In particular, we investigated a novel neural network architecture with additional data clustering module to improve the performance in ranking answer candidates which are longer than a single sentence. This work can be used not only for the QA ranking task, but also to evaluate the relevance of next utterance with given dialogue generated from the dialogue model.
In this talk, I'll present our research results (NAACL 2018), and also its potential use cases (i.e. fake news detection). Finally, I'll conclude by introducing some issues on previous research, and by introducing recent approach in academic.
A Medium article on Meta A.I. Speech to Speech Textless Spoken Language Translation.
https://medium.com/@sidsanc4998/translate-your-cats-meow-884d2bdd4587
Identification of frequency domain using quantum based optimization neural ne...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...ijnlc
The quality of Neural Machine Translation (NMT) systems like Statistical Machine Translation (SMT) systems, heavily depends on the size of training data set, while for some pairs of languages, high-quality parallel data are poor resources. In order to respond to this low-resourced training data bottleneck reality, we employ the pivoting approach in both neural MT and statistical MT frameworks. During our experiments on the Persian-Spanish, taken as an under-resourced translation task, we discovered that, the aforementioned method, in both frameworks, significantly improves the translation quality in comparison to the standard direct translation approach.
BioTeam Bhanu Rekepalli Presentation at BICoB 2015The BioTeam Inc.
Adapting life sciences applications to next generation supercomputers with Intel Xeon Phi coprocessors to accelerate large scale data analysis and discovery.
Similar to Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to-Sequence Non-Parallel Voice Conversion (20)
Weakly-Supervised Sound Event Detection with Self-AttentionNU_I_TODALAB
IEEE ICASSP 2020
Koichi Miyazaki, Tatsuya Komatsu, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Weakly-supervised sound event detection with self-attention, May 2020
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...NU_I_TODALAB
IEEE International Workshop on Machine Learning for Signal Processing (MLSP2017)
Nominated For Best Student Paper Award (student: Shogo Seki)
Shogo Seki, Hirokazu Kameoka, Tomoki Toda, Kazuya Takeda: Missing Component Restoration for Masked Speech Signals based on Time-Domain Spectrogram Factorization,Sep. 2017
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to-Sequence Non-Parallel Voice Conversion
1. Investigation of Text-to-speech based
Synthetic Parallel Data for Sequence-to-
sequence Non-parallel Voice Conversion
Ding Ma, Wen-Chin Huang and Tomoki Toda
Graduate School of Informatics, Nagoya University, Nagoya, Japan
Paper ID: #1606 Presenter: Ding Ma
2. Introduction
•Voice conversion (VC)
• The methodology that aims to convert the speaker identity
of speech from source speaker into target speaker while
preserving the linguistic information.
• VC is expected to play a significant role in augmented
human communication.
Source
speech
VC
Target
speech
2
3. Introduction
•Sequence-to-sequence (seq2seq) modeling
• Seq2Seq model: a model that takes a sequence of items and outputs another sequence of
items, have emerged from the development of deep neural networks (DNN).
• Can automatically determine the output phoneme duration.
• Capture long term dependencies: prosody (F0 & duration), intonation…
• Requires a large amount and parallel speech corpus from source and target
speakers for training.
Encoder Attention Decoder
Source
speech
Target
speech
3
4. Background
• Voice conversion challenge 2020(VCC2020)
• Bi-annual event to compare the performance of different VC systems.
• 2 tasks: Intra-lingual and semi-parallel case in Task 1 & cross lingual case in Task 2.
• Parallel: same utterances
• Nonparallel: different utterances
• Semiparallel: Parallel + Nonparallel situation
(can be regarded as the relaxation of non-
parallel case)
• Limited dataset: Only 90 corpus in T1/ 70
corpus in T2
4
5. Background
• VTN: Voice Transformer Network, which is the sequence-to-
sequence Voice Conversion Using Transformer with Text-to-
speech (TTS) pretraining.
• ➕ 1hr à 5 mins training data (Thanks to pretraining technology).
• ➖ still Needs parallel training data.
• How to tackle issue of semiparallel dataset?
• 「 Synthetic speech method 」
• We extended VTN model by training TTS models to generate synthetic
parallel data (SPD). (Semiparallel à Parallel)
[1]
[1] W. C. Huang, T. Hayashi, Y. C. Wu, H. Kameoka, and T. Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” Proc. Interspeech,
pp. 4676-4680, 2020.
5
6. Background
• Generation process of a synthetic parallel data (SPD) from a
semiparallel dataset.
(a) TTS training process using the semiparallel dataset; and (b) SPD generation process using source synthetic data,
target synthetic data, and external SPD.
6
7. Background
•Generation process of a synthetic parallel data (SPD)
from a semiparallel dataset.
• Four types of parallel data available for training the VC
model in total:
1. <source natural, target natural>
2. <SPD with source synthetic, target natural>
3. <source natural, SPD with target synthetic>
4. <external SPD with source synthetic, external SPD with
target synthetic>
7
8. 「Synthetic Speech Method」
• There are still uncertainties about the effects and usage of
SPD on seq2seq VC model. In this paper we try to address
the following 3 questions:
• Q1: What are the feasibility and properties of using SPD?
• Q1-1: How does quality of data affect VC performance?
• Q1-2: Which kind of the training pair is better?
Ø Source + target natural / source synthetic only / target synthetic only / natural+synthetic (mixed
situation)
• Q2: How can this method benefit from a semiparallel setting?
• Fix original training data, and set semiparallel ratio(0/25%/50%/75%/100%)
• Q3: What are the influences of using external text data?
• Fix original training data, increase external data (1k/2k/5k)
8
9. Datasets and Configuration
• Initial dataset : CMU ARCTIC database(containing parallel 1132 utterances
recorded by the English speakers in 16kHZ)
• Female: clb, slt
• Male: bdl, rms
• Development set and evaluation set: 100 utterances separately
• External dataset: M-AILABS database
• English corpus: 15369 utterances, 30 hours long
• Implementation:
• TTS models: Pretrained Transformer-TTS architecture
• VC model: VTN (Transformer-based seq2seq VC model) [1]
• Vocoder: Parallel WaveGAN (PWG) neural vocoder [2]
• Objective evaluation: Transformer-based ASR engine trained by LibriSpeech
database [3]
[1] W. C. Huang, T. Hayashi, Y. C. Wu, H. Kameoka, and T. Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” Proc. Interspeech, pp. 4676-4680, 2020.
[2] R. Yamamoto, E. Song, and J. M. Kim. “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199-6203, 2020.
[3] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884-5888, 2018.
9
10. Experiment and Evaluation
• Q1: What are the feasibility and properties of using SPD?
Five kinds of training pairs:
1. <source natural, target natural>
2. <source natural, target synthetic>
3. <source synthetic, target natural>
4. <source synthetic, target synthetic>
5. <source synthetic and source natural, target
natural and target synthetic>
10
11. Experiment and Evaluation
• MCD: Mel Cepstrum Distortion / CER: Character Error Rate / WER: Word Error Rate
• The Objective evaluationresults of Q1
Table I: The comparison results with different training pair and datasize. TTS-450, TTS-400, TTS-200 and TTS-80 represent the homologous datasize of
TTS finetuning, which also reflect TTS performance, the datasize of SPD generation and VC training.
• The TTS performance is critical in terms of the impact of VC results.
• The training dataset of source synthetic - target natural generally performs better among the other pairs using
SPD. 11
12. Experiment and Evaluation
• Q2: How can this method benefit from a semiparallel setting?
• Training procedure with different semiparallel setting
(e.g., datasize=400).
• Parallel ratio (PR) is used to represent the proportion
of natural parallel corpus, so as to reflect the semi-
parallel setting under each group.
• The respective TTS models of source and target
speaker are trained in case of constant datasize but
different semiparallel setting for each group.
• Two parts experiment: Training dataset I retains all
SPD as shown in (a); training dataset II removes
natural-synthetic part of semiparallel cases for
training as shown in (b).
12
13. Experiment and Evaluation
• The Objective evaluationresults of Q2
Table II: Experimental results under different semiparallel setting.
13
14. Experiment and Evaluation
• The Objective evaluationresults of Q2
Table II: Experimental results under different semiparallel setting.
14
15. Experiment and Evaluation
• The Objective evaluation results of Q3
Table III: Experimental results of adding external data with different datasizes. TTS-400 and TTS-200 represent homologous datasize
of TTS finetuning.
15
16. Experiment and Evaluation
• The Subjective evaluation (MOS)results of Q1, Q2 and Q3 under specific datasets.
Table IV: Results of subjective evaluation using test set under 450 and 80 datasize with 95%
confidence intervals for Q1.
Table V: Results of subjective evaluation using test sets with 95%
confidence intervals for Q2.
Table VI: Results of subjective evaluation using test sets under 400 and 200 datasize with 95% confidence
intervals for Q3.
• The overall results are consistent with
the findings in the objective evaluations.
16
17. Conclusions
• SPD is feasible for seq2seq non-parallel VC. The VC results using
SPD are determined by the performance of TTS models and VC
training datasize. In addition, the VC result is also affected by the object of
using SPD.
• When the dataset is semiparallel, we should try to ensure the PR is
large enough. If the original datasize is large, the introduction of SPD
into target speaker or source speaker can both achieve ideal VC results.
Thus, the full use of all types of SPD to ensure amount of data, can
maximize the benefits. On the contrary, when the original datasize is
small, the well-performing TTS models are difficult to get. Introducing
training pair with negative impact such as source natural-target
synthetic should be avoided.
• SPD with external text data as data augmentation can improve parallel
seq2seq VC performance to a certain extent (e.g. natural-natural).
17
18. Future work
• Using more speakers and a larger amount of data to further investigate the
beneficial trend that seq2seq non-parallel VC can obtain from SPD.
• In terms of methodology, we can introduce the VC models which can directly
processing non-parallel data training to compare the performance with the
way of using SPD on seq2seq VC in the future research, so as to further
clarify the role of SPD.
18