Things not described and Guesses
• Kernel size of the dilation filters 2
• Number of the layers (ResNet-blocks) 4*10~ 6*10
• Number of the channels in hidden layers hundreds? 256?
• the other activation function in a Res-block? may be no
• Batch normalization no reason not to use
• Sampling frequency ‘at least 16kHz’
• Where to let the skip connection out? Every 10?
• Skip connections have weights yes?
• Single-speaker speech dataset
• North American English dataset: 24.6hr
• Mandarin Chinese dataset: 34.8hr
• Receptive field 240ms
• Ad hoc architecture as →
Liguistic feature h_i
frequency F0(t) duration(t)
Liguistic feature h(t)
TTS: Mean Opinion Score
• TIMIT dataset (possibly ~4hrs)
• Add pooling layer after dilated convolution
• of 160x down sampling (Does it mean 7th layer?)
• Then a few non-causal convolutions.
• Loss to predict the next sample (same as ordinary WaveNet)
• And a loss to classify the frame
• 18.8PER, which is best score among raw-audio models.