The past, present and future of singing synthesis

Kanru Hua (華侃如)
June 19, 2016
The Past, Present and Future
of Singing Voice Modeling

Motivation
“You are making too many assumptions, this thing won’t work on real
speech signal.”
— Jont B. Allen
● What’s wrong with current and past researches in this area?
● What’s our next step?

What’s in a Speech/Singing Synthesizer
Parameter
Generator
Vocoder
Text / Music Score
Speech Audio
Generate pitch, duration and spectrum…
from input
Generate waveform from parameters
Vocoder

Part 1
History of Speech
Analysis/Synthesis
(http://clas.mq.edu.au/speech/synthesis/history_synthesis/)

History of Math & Acoustics
1600 1700 1800 1900 2000
Law of Forces/Motions,
Foundation of Calculus
Wave Equation,
Complex Number
Fourier/Laplace Transform,
Analog Circuits & Electromagnetism
Newton Bernoulli, Euler,
d‘Alembert
(http://www2.ling.su.se/staff/hartmut/kemplne.htm)
Gauss, Fourier, Laplace,
Riemann, Cauchy,
Kirchhoff, Heaviside
Filtering Theory, Digital Systems,
Sampling Theory, ...

History of Math & Acoustics
1600 1700 1800 1900 2000
Law of Forces/Motions,
Foundation of Calculus
Wave Equation,
Complex Number
Fourier/Laplace Transform,
Analog Circuits & Electromagnetism
Filtering Theory, Digital Systems,
Sampling Theory, ...
Newton Bernoulli, Euler,
d‘Alembert
Gauss, Fourier, Laplace,
Riemann, Cauchy,
Kirchhoff, Heaviside
(http://www2.ling.su.se/staff/hartmut/kemplne.htm)
= =
Frequency Response

Source-Filter Model
Vocal TractVocal Folds LipLung
t
f f
Signal Generator (Source) Filter 1 Filter 2
Signal Generator Filter 1 Filter 2Filter 0

20th Century, the Dawn of Speech Processing
Cooley and Tukey (1965): Fast Fourier Transform
Oppenheim (1969): one of the earliest digital implementation of speech analysis/
synthesis
Input
Pitch
(source)
Cepstrum
(vocal tract filter)
Analysis Synthesis
Spectrum
Output

Family Tree of Speech A/S Algorithms
Homomorphic Filtering
(Oppenheim, 1969)
STRAIGHT
(Kawahara, 1998)
WORLD1
(Morise, 2009)
WORLD2
(Morise, 2013)
TANDEM-STRAIGHT
(Kawahara & Morise, 2007)
PSOLA
(?, 1985)
Phase Vocoder
(Flanagan et al, 1966)
Source-Filter
Model
Sinusoidal Model
(McAulay & Quatieri, 1986)
SMS
(Serra, 1989)
Autotune
CELP
(Atal & Schroeder,1983)
LSP/LSF
(Itakura, 1975)
MGC/MLSA
(Imai, et al., 1983)
Sinsy
Melodyne
& NiaoNiao
& tn_fnds
Harmonic+Noise
(Stylianou, 1993)
NBVPM
(Bonada, 2004)
WBVPM
(Bonada, 2008)
Vocaloid Vocaloid 2+RUCE
(Rocaloid 4)
Rocaloid 3
Sine+Noise+Transient
(Levin & Smith, 1998)
CeVIO
Quasi-Harmonic Model
(Pantazis, et al., 2008)
Chiptune
Vocaine
(Agiomyrgiannakis, 2015)
Linear Prediction

Quasi-static Assumption
Algorithms affected:
● Homomorphic Filtering
● PSOLA
● Linear Prediction & CELP & MLSA
● Sinusoidal Model
● Harmonic+Noise Model
● SMS & NBVPM
● WORLD & STRAIGHT (slightly)

Mis-represented Aperiodic Component
Popular belief:
1. Speech = periodic signal + aperiodic signal (breathing noise)
2. Aperiodic signal is filtered white noise
Aperiodic
Periodic (Friction)

Mis-represented Aperiodic Component
t
Algorithms affected:
● (Quasi-)Harmonic+Noise Model
● SMS & Sines+Noise+Transients Model
● WORLD & (TANDEM-)STRAIGHT
● Algorithms that do not model aperiodic component
○ Phase vocoder, CELP, MLSA, ...

Over-simplified Source-Filter Model
Tract FilterOscillator Lip Filter
Tract FilterOscillator
Source Filter
Assumption: source filter is independent from pitch
Equivalent assumption:
“When my pitch is higher by 12 semitones, my vocal folds still
oscillate at the same speed.”
Affected algorithms: all of those listed on page 11

Part 3
Future: How to Fix &
the Low Level Speech Model

“Neoclassical” Approaches to Speech Modeling
Tract
Source
Lip
t
f
f
Input
Inverse
Linear Prediction
ARX
(Wen, et al., 1995)
ARX-LF
(Vincent, et al., 2005)
LF Model
(Liljencrants, Fant and
Lin, 1985)
OVE Synthesizer
(Fant, 1953)

“Neoclassical” Approaches to Speech Modeling
Degottex (2013): similar idea, but in frequency domain
Hua (2016, in progress): more robust under poor recording conditions; less
sensitive to processed input.

The Low Level Speech Model (new version)
Level 0
(Signal Level)
Input Signal
Pitch Harmonic Model Noise Model
Spectrum
Channel 1 Energy
Channel 2 Energy
Channel 3 Energy
...
Harmonic Model
Harmonic Model
Harmonic Model
Output Signal
Glottal/Source Information
(LF Model)
Vocal Tract Filter Lip FilterLevel 1
(Acoustic Level)
An acoustically meaningful speech model

Inverse Analysis of Speech
Original
Glottal Flow
(Source Signal)

Pitch Shifting powered by LLSM
Original
50% Pitch
200% Pitch

Pitch Shifting powered by LLSM
Original
50% Pitch
200% Pitch
Instants of vocal fold closure were revealed

Reference
● A.V. Oppenheim, “Speech Analysis-Synthesis System Based on Homomorphic Filtering”. JASA
(1969): Vol. 45, No. 2.
● Degottex, Gilles, et al. "Mixed source model and its adapted vocal tract filter estimate for voice
transformation and synthesis." Speech Communication 55.2 (2013): 278-294.
● H. K. Dunn, "The calculation of vowel resonances, and an electrical vocal tract", Journal of the
Acoustical Society of America, 1950, vol. 22, p. 740-753.
● Pantazis, Yannis, and Yannis Stylianou. "Improving the modeling of the noise part in the harmonic
plus noise model of speech." Acoustics, Speech and Signal Processing (2008). IEEE International
Conference on.

The past, present and future of singing synthesis

More Related Content

Similar to The past, present and future of singing synthesis

Recently uploaded

The past, present and future of singing synthesis