33. 参考⽂献
参考⽂献:1
[Hsu+ 2016] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, H.-M. Wang. Voice conversion from non-parallel
corpora using variational auto-encoder. Proc. APSIPA ASC, 6 pages, 2016.
[Huang+ 2020] W.-C. Huang, T. Hayashi, S. Watanabe, T. Toda. The sequence-to-sequence baseline for the
Voice Conversion Challenge 2020: cascading ASR and TTS. Proc. Joint workshop for the Blizzard Challenge
and Voice Conversion Challenge 2020, pp. 160‒164, 2020.
[Huang+ 2021] W.-C. Huang, T. Hayashi, Y.-C. Wu, H. Kameoka, T. Toda. Pretraining techniques for
sequence-to-sequence voice conversion. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 29, pp.
745‒755, 2021.
[Huang+ 2022] W.-C. Huang, S.-W. Yang, T. Hayashi, T. Toda. A comparative study of self-supervised
speech representation based voice conversion. IEEE Journal of Selected Topics in Signal Processing, 2022
(https://arxiv.org/abs/2207.04356).
[Itakura+ 1968] F. Itakura, S. Saito. Analysis synthesis telephony based upon the maximum likelihood
method. Proc. ICA, C-5-5, pp. C17‒20, 1968.
[Kobayashi+ 2016] K. Kobayashi, S. Takamichi, S. Nakamura, T. Toda. The NU-NAIST voice conversion
system for the Voice Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1667‒1671, 2016.
[Kobayashi+ 2018] K. Kobayashi, T. Toda. sprocket: open-source voice conversion software. Proc. Odyssey,
pp. 203‒210, 2018.
[Kong+ 2020] J. Kong, J. Kim, J. Bae. HiFi-GAN: generative adversarial networks for efficient and high
fidelity speech synthesis. Proc. NeurIPS, pp. 17022‒17033, 2020.
[Liu+ 2018] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, L.-R. Dai. WaveNet Vocoder with Limited Training Data
for Voice Conversion. Proc. INTERSPEECH, pp. 1983‒1987, 2018.
[Liu+ 2020] L.-J. Liu, Y.-N. Chen, J.-X. Zhang, Y. Jiang, Y.-J. Hu, Z.-H. Ling, L.-R. Dai. Non-parallel voice
conversion with autoregressive conversion model and duration adjustment. Proc. Joint workshop for the
Blizzard Challenge and Voice Conversion Challenge 2020, pp. 126‒130, 2020.
34. [Morise+ 2016] M. Morise, F. Yokomori, K. Ozawa. WORLD: a vocoder-based high-quality speech synthesis
system for real-time applications. IEICE Trans. Inf. & Syst., Vol. E99-D, No. 7, pp. 1877‒1884, 2016.
[Stylianou+ 1998] Y. Stylianou, O. Cappe, E. Moulines. Continuous probabilistic transform for voice
conversion. IEEE Trans. Speech & Audio Process., Vol. 6, No. 2, pp. 131‒142, 1998.
[Sun+ 2016] L. Sun, K. Li, H. Wang, S. Kang, H.M. Meng. Phonetic posteriorgrams for many-to-one voice
conversion without parallel data training. Proc. IEEE ICME, 6 pages, 2016.
[Tobing+ 2019] P.L. Tobing, Y. Wu, T. Hayashi, K. Kobayashi, T. Toda. Non-parallel voice conversion with
cyclic variational autoencoder. Proc. INTERSPEECH, pp. 674‒678, 2019.
[Tobing+ 2020] P.L. Tobing, Y. Wu, T. Toda. Baseline system of Voice Conversion Challenge 2020 with
cyclic variational autoencoder and parallel WaveGAN. Proc. Joint workshop for the Blizzard Challenge and
Voice Conversion Challenge 2020, pp. 155‒159, 2020.
[Toda+ 2007] T. Toda, A.W. Black, K. Tokuda. Voice conversion based on maximum likelihood estimation of
spectral parameter trajectory. IEEE Trans. Audio, Speech & Lang. Process., Vol. 15, No. 8, pp. 2222‒2235,
2007.
[Toda+ 2016] T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, J. Yamagishi. The Voice
Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1632‒1636, 2016.
[van den Oord+ 2016] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N.
Kalchbrenner, A. W. Senior, K. Kavukcuoglu. WaveNet: a generative model for raw audio. arXiv preprint,
arXiv:1609.03499, 15 pages, 2016.
[Wang+ 2019] X. Wang, S. Takaki, J. Yamagishi. Neural source-filter waveform models for statistical
parametric speech synthesis. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 28, pp. 402‒415, 2020.
[Wu+ 2021a] Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, T. Toda. Quasi-periodic parallel WaveGAN: a
non-autoregressive raw waveform generative model with pitch-dependent dilated convolution neural
network. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 29, pp. 792‒806, 2021.
[Wu+ 2021b] Y.-C. Wu, T. Hayashi, P.L. Tobing, K. Kobayashi, T. Toda. Quasi-periodic WaveNet: an
autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network.
IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 29, pp. 1134‒1148, 2021.
参考⽂献:2
35. [Yamagishi+ 2019] J. Yamagishi, C. Veaux, K. MacDonald. CSTR VCTK corpus: English multi-speaker
corpus for CSTR voice cloning toolkit. University of Edinburgh, CSTR, 2019 (https://doi.org/10.7488/ds/2645).
[Yamamoto+ 2020] R. Yamamoto, E. Song, J.-M. Kim. Parallel WaveGAN: a fast waveform generation model
based on generative adversarial networks with multi-resolution spectrogram. Proc. ICASSP, pp. 6199‒6203,
2020.
[Yoneyama+ 2022] R. Yoneyama, Y.-C. Wu, T. Toda. Unified source-filter GAN with harmonic-plus-noise
source excitation generation. Proc. INTERSPEECH, 2022 (https://arxiv.org/abs/2205.06053).
[Zhang+ 2020] J.-X. Zhang, L.-J. Liu, Y.-N. Chen, Y.-J. Hu, Y. Jiang, Z.-H. Ling, L.-R. Dai. Voice conversion by
cascading automatic speech recognition and text-to-speech synthesis with prosody transfer. Proc. Joint
workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 121‒125, 2020.
参考⽂献:3