Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neural nets”

Neural Speech Synthesis with
WaveNet and WaveNet 2
Grant Reaber
Head of Research, Respeecher
gr@respeecher.com
1

Why WaveNet?
2
WaveNet, and similar models like SampleRNN, are the only machine
learning models that directly generate audio waveform (PCM).
The quality they produce is unmatched (best quality text-to-speech of any
system)
These models are an essential piece of any truly state-of-the-art system that
needs to generate sound (though impressive results are also possible with
other techniques)
These systems can be used to produce various kinds of audio, but our focus
will be on speech

Autoregressive Models
3
WaveNet (and SampleRNN) learn the distribution of each audio sample
conditional on all that have come before.
In symbols, the joint probability of a waveform is modeled asx1
,…,xT
TY
t=1
p(xt | x1, . . . , xt 1)
<latexit sha1_base64="jjT5qDYzodMoP1cTJ30vZAWp+LA=">AAACE3icbVA9SwNBEJ3z2/gVtbRZFEFRw52NWgiiIJYREhVy8djb2+iSvdtjd04SjvwIG2v/hY2Fiq2Nnf/APyG4SSz8ejDweG+GmXlhKoVB131zBgaHhkdGx8YLE5NT0zPF2bkTozLNeJUpqfRZSA2XIuFVFCj5Wao5jUPJT8PmQdc/veLaCJVUsJ3yekwvEtEQjKKVguKan2oVBTnuep3zCklXWgH6sYhIK/DWfRkpNOsta294ndWguOSW3B7IX+J9kaW9w/ePWwAoB8VXP1Isi3mCTFJjap6bYj2nGgWTvFPwM8NTypr0gtcsTWjMTT3vPdUhy1aJSENpWwmSnvp9IqexMe04tJ0xxUvz2+uK/3m1DBvb9VwkaYY8Yf1FjUwSVKSbEImE5gxl2xLKtLC3EnZJNWVocyzYELzfL/8l1c3STsk7tmHsQx9jsACLsAIebMEeHEEZqsDgGu7gAR6dG+feeXKe+60DztfMPPyA8/IJ0fOglQ==</latexit><latexit sha1_base64="O/hgRqaq5ZrU2VbfR7C6OQ4F0Oc=">AAACE3icbVDLSsNAFJ3Ud31FXboZFEGxlsSNuhBEQVwqWC00NUwm0zp0kgkzN9IS6j+4cetnuHGh4taNrvwDf0Jw+lho9cCFwzn3cu89QSK4Bsf5sHJDwyOjY+MT+cmp6ZlZe27+TMtUUVaiUkhVDohmgsesBBwEKyeKkSgQ7DxoHHT88yumNJfxKbQSVo1IPeY1TgkYybfXvUTJ0M9g121fnOJktemDF/EQN3234IlQgi40jb3httd8e9kpOl3gv8Ttk+W9w8+vu3d5fezbb14oaRqxGKggWldcJ4FqRhRwKlg776WaJYQ2SJ1VDI1JxHQ16z7VxitGCXFNKlMx4K76cyIjkdatKDCdEYFLPeh1xP+8Sgq17WrG4yQFFtPeoloqMEjcSQiHXDEKomUIoYqbWzG9JIpQMDnmTQju4Mt/SWmzuFN0T0wY+6iHcbSIltAqctEW2kNH6BiVEEU36B49oifr1nqwnq2XXmvO6s8soF+wXr8BJd6iVA==</latexit><latexit sha1_base64="O/hgRqaq5ZrU2VbfR7C6OQ4F0Oc=">AAACE3icbVDLSsNAFJ3Ud31FXboZFEGxlsSNuhBEQVwqWC00NUwm0zp0kgkzN9IS6j+4cetnuHGh4taNrvwDf0Jw+lho9cCFwzn3cu89QSK4Bsf5sHJDwyOjY+MT+cmp6ZlZe27+TMtUUVaiUkhVDohmgsesBBwEKyeKkSgQ7DxoHHT88yumNJfxKbQSVo1IPeY1TgkYybfXvUTJ0M9g121fnOJktemDF/EQN3234IlQgi40jb3httd8e9kpOl3gv8Ttk+W9w8+vu3d5fezbb14oaRqxGKggWldcJ4FqRhRwKlg776WaJYQ2SJ1VDI1JxHQ16z7VxitGCXFNKlMx4K76cyIjkdatKDCdEYFLPeh1xP+8Sgq17WrG4yQFFtPeoloqMEjcSQiHXDEKomUIoYqbWzG9JIpQMDnmTQju4Mt/SWmzuFN0T0wY+6iHcbSIltAqctEW2kNH6BiVEEU36B49oifr1nqwnq2XXmvO6s8soF+wXr8BJd6iVA==</latexit><latexit sha1_base64="5a24bqAnQxuaINJOQ0ieH2NGml4=">AAACE3icbVA9SwNBEN2LXzF+RS1tFoMQMYY7G7UQgjaWERITyMVjb2+TLNm7PXbnJOHIj7Dxr9hYqNja2Plv3HwUGn0w8Hhvhpl5fiy4Btv+sjILi0vLK9nV3Nr6xuZWfnvnVstEUVanUkjV9IlmgkesDhwEa8aKkdAXrOH3r8Z+454pzWVUg2HM2iHpRrzDKQEjefkjN1Yy8FK4cEZ3NRwXBx64IQ/wwHNKrggk6NLA2MfO6NDLF+yyPQH+S5wZKaAZql7+0w0kTUIWARVE65Zjx9BOiQJOBRvl3ESzmNA+6bKWoREJmW6nk6dG+MAoAe5IZSoCPFF/TqQk1HoY+qYzJNDT895Y/M9rJdA5a6c8ihNgEZ0u6iQCg8TjhHDAFaMghoYQqri5FdMeUYSCyTFnQnDmX/5L6ifl87JzYxcql7M0smgP7aMictApqqBrVEV1RNEDekIv6NV6tJ6tN+t92pqxZjO76Besj2+5W52K</latexit>

Training and Generation
4
In training, we learn to predict each sample in a piece of audio given all that
have come before.
Because WaveNet is convolutional, we can do this in parallel for a sequence
of samples. (With RNNs, you could not do this and would use teacher
forcing.)
We generate audio sample by sample. Suppose we have generated the ﬁrst
n samples. Then we draw from the predicted distribution for the n+1 th
sample conditional on these n samples. Now we can compute a predicated
distribution for the n+2 th sample conditional on these n+1 samples.
We can’t do this in parallel, and because of this the original WaveNet
required minutes to generate a second of audio and was impractical for
most applications.
Sampling from the distribution works much better than using beam search
to ﬁnd a high likelihood sequence as is commonly done in machine
translation.

Conditional Autoregressive Models
5
Running generation by itself produces a kind of babbling.
For most applications, for instance text-to-speech, we want to control the
generated audio.
This can be done by training a conditioned model, where we condition on
some linguistic features derived from input texts. Then conditions can be
supplied in generation to generate what speech we like.
Can also supply “global” (not changing in time) conditions for things like
speaker identity to produce speech from many different speakers with one
model.

Modeling a Sample
6
WaveNet uses a 256 bin softmax to represent audio (8 bit sample depth,
using “mu encoding” to have smaller bins near zero).
Training is slow at ﬁrst, and it doesn’t scale to higher sample depth.
So WaveNet 2 uses a discretized mixture of logistic distributions instead (10
components according to Tacotron 2 paper).
WaveNet 2 is functionally identical to WaveNet except for this change,
modeling 24kHz audio instead of 16kHz, and one hyperparameter change to
increase the receptive ﬁeld, which we will mention later.

Why WaveNet 2?
7
Although WaveNet 2 does make some very minor tweaks to the
architecture, which we have just discussed, by far its main contribution is a
technique to speed up generation by about 3000x.
20x realtime generation
Interesting technique to do this: ﬁrst train a regular WaveNet, then use it to
train a model that produces audio in parallel rather than sample-by-sample;
The second,“distilled” model produces output that is just as high quality as
the original model’s output.

Modeling Speech (and other audio)
8
Log mel-scale magnitude spectrograms seem to compactly represent all the
information necessary to represent speech (cf.Tacotron 2)
80 channels x 80Hz (vs. 1 channel x 16-44kHz for PCM)
Computed algorithmically from PCM
Lossy, especially because phase information is discarded
Can be inverted with Grifﬁn-Lim
But WaveNet does a better job. Used in Tacotron 2, currently the best
text-to-speech system
Note that when WaveNet is inverting generated spectrogram, as with
Tacotron 2, it can actually learn to correct errors in that spectrogram. (The
loss function does not enforce that the spectrogram inversion is correct,
only that the whole transformation is.)

Convolutions
9
Convolutions are parallelizable in training, and they respect the structure of
the sequence.
But they have two problems.
For autoregression, it is critical not to allow the model to see input from
the future because it won’t be available at generation time (and it would be
trivial for the model to predict the next sample if it could see it!)
The receptive ﬁeld grows slowly with ordinary convolutions: after n
convolutions of width w, it is only on the order of nw. If, say, the audio is
16kHz, and we want 60ms of context, we need a receptive ﬁeld of about
1000 samples.

Dilated Causal Convolutions
10
The looking-into-the-future problem is fixed by just shifting the windows of
the convolutions so they only see the past. (The fancy term for this is
“causal” convolution.)
The small receptive field problem is fixed by using discontinuous windows.
If we give every nth sample as input to the convolutional kernel, we say that
the convolution has a dilation factor of n (so ordinary convolutions have a
dilation factor of 1).
In WaveNet, successive convolutional layers have exponentially growing
dilation factors: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512.
So with 10 layers and window size 2, we can get a receptive field of 1024
samples.

Convolution Hyperparameters
11
WaveNet used 30 layers of convolutions, three stacks of 10 convolutions
with dilation going from 1 to 512.
WaveNet 2 uses the same hyperparameters except that it uses a window
size of 3 instead of 2. (This is the “one hyperparameter change” mentioned
earlier.)
(Although DeepMind doesn’t release code and is not entirely explicit about
about their hyperparameters, so this isn’t certain.)

The Rest of the Architecture
12
(Actually, there should be separate convs
for the residual and skip connections and for
The tanh and sigmoid gates.)

Best guess hyperparameters
13
The paper is scant on details about hyperparameters, but here are best
guesses based on a talk by an insider (see https://github.com/ibab/
tensorﬂow-wavenet/issues/227).
256 skip channels
512 dilation channels
512 residual channels

Conditioning
14
Global and local conditions are projected to the number of dilation
channels and added to the outputs of the ﬁlter and gate convolutions of the
gated activation units (before activation).
Separate projections for ﬁlter and gate, global and local, and for each layer
(so 30 layer WaveNet with both types of conditioning uses 120 projection
matrices).
Local condition is often at a lower time frequency than the WaveNet. If so,
it is upsampled to the WaveNet frequency using transposed convolutions.

WaveNet 2
15
The rest of the talk will be about how WaveNet 2 is able to generate
samples 3000x faster than the original WaveNet.
The basic idea is to ﬁrst train a WaveNet and then use it to train a student
network.
The student network takes noise z1,…,zT, drawn from a standard logistic
distribution, as input, and outputs, for each timestep t, parameters st and μt,
which are computed only from z1,…,zt-1, of a logistic distribution from
which xt will be drawn. The draw is controlled by zt: xt = ztst + μt.
1
1 + e
x µ
s<latexit sha1_base64="d5LgZXyJS3q66smCoxfqBk26qkU=">AAACCHicbVC7TsMwFL0pr1JeAUYWiwoJCbVKWICtwMKEikRopTZUjuu0Vp2HbAdRRVlZ+BUWBkCsfAIbA/+C23SAliNd6fice+V7jxdzJpVlfRmFufmFxaXicmlldW19w9zcupFRIgh1SMQj0fSwpJyF1FFMcdqMBcWBx2nDG5yP/MYdFZJF4bUaxtQNcC9kPiNYaaljorYvMEntLLUP6G1ayZ/3lXaQZKnMsqxjlq2qNQaaJfaElGun35c1AKh3zM92NyJJQENFOJayZVuxclMsFCOcZqV2ImmMyQD3aEvTEAdUuun4kgztaaWL/EjoChUaq78nUhxIOQw83Rlg1ZfT3kj8z2slyj92UxbGiaIhyT/yE45UhEaxoC4TlCg+1AQTwfSuiPSxzkLp8Eo6BHv65FniHFZPqvaVDuMMchRhB3ZhH2w4ghpcQB0cIPAAT/ACr8aj8Wy8Ge95a8GYzGzDHxgfP/PFnEI=</latexit><latexit sha1_base64="/JGCDqEL6VhIU7tOrrLdopuLEOE=">AAACCHicbVC7TsMwFHXKq5RXgJHFokJCQq0SFmArj4EJFYnQSk2pHNdprTpOZDuIKMrKwq+wdADEyiewMfAvuE0HaDnSlY7PuVe+93gRo1JZ1pdRmJtfWFwqLpdWVtfWN8zNrVsZxgITB4csFE0PScIoJ46iipFmJAgKPEYa3uB85DfuiZA05DcqiUg7QD1OfYqR0lLHhK4vEE7tLLUPyF1ayZ8PFTeIs1RmWdYxy1bVGgPOEntCyrXT76sLa5jUO+an2w1xHBCuMENStmwrUu0UCUUxI1nJjSWJEB6gHmlpylFAZDsdX5LBPa10oR8KXVzBsfp7IkWBlEng6c4Aqb6c9kbif14rVv5xO6U8ihXhOP/IjxlUIRzFArtUEKxYognCgupdIe4jnYXS4ZV0CPb0ybPEOayeVO1rHcYZyFEEO2AX7AMbHIEauAR14AAMHsEzeAGvxpMxNN6M97y1YExmtsEfGB8/xdidoA==</latexit><latexit sha1_base64="/JGCDqEL6VhIU7tOrrLdopuLEOE=">AAACCHicbVC7TsMwFHXKq5RXgJHFokJCQq0SFmArj4EJFYnQSk2pHNdprTpOZDuIKMrKwq+wdADEyiewMfAvuE0HaDnSlY7PuVe+93gRo1JZ1pdRmJtfWFwqLpdWVtfWN8zNrVsZxgITB4csFE0PScIoJ46iipFmJAgKPEYa3uB85DfuiZA05DcqiUg7QD1OfYqR0lLHhK4vEE7tLLUPyF1ayZ8PFTeIs1RmWdYxy1bVGgPOEntCyrXT76sLa5jUO+an2w1xHBCuMENStmwrUu0UCUUxI1nJjSWJEB6gHmlpylFAZDsdX5LBPa10oR8KXVzBsfp7IkWBlEng6c4Aqb6c9kbif14rVv5xO6U8ihXhOP/IjxlUIRzFArtUEKxYognCgupdIe4jnYXS4ZV0CPb0ybPEOayeVO1rHcYZyFEEO2AX7AMbHIEauAR14AAMHsEzeAGvxpMxNN6M97y1YExmtsEfGB8/xdidoA==</latexit><latexit sha1_base64="QpJ0MVI6vGf/03JZCUsRA0790xY=">AAACCHicbVC7TsMwFHV4lvIKMLJYVEhIqFXCAmwVLIxFIrRSGyrHdVqrthPZDqKysrLwKywMgFj5BDb+BrfNAC1HutLxOffK954oZVRpz/t2FhaXlldWS2vl9Y3NrW13Z/dWJZnEJMAJS2QrQoowKkigqWaklUqCeMRIMxpejv3mPZGKJuJGj1ISctQXNKYYaSt1XdiJJcLGz41/TO5Mdfp8qHZ4lhuV53nXrXg1bwI4T/yCVECBRtf96vQSnHEiNGZIqbbvpTo0SGqKGcnLnUyRFOEh6pO2pQJxokIzuSSHh1bpwTiRtoSGE/X3hEFcqRGPbCdHeqBmvbH4n9fOdHwWGirSTBOBpx/FGYM6geNYYI9KgjUbWYKwpHZXiAfIZqFteGUbgj978jwJTmrnNf/aq9QvijRKYB8cgCPgg1NQB1egAQKAwSN4Bq/gzXlyXpx352PauuAUM3vgD5zPH1yomlY=</latexit>

The Primary Training Objective
16
The student is trained to minimize the KL divergence from it to the teacher
WaveNet, DKL(PS||PT) = H(PS, PT) - H(PS).
It is trying to generate samples that the teacher WaveNet considers likely,
but at the same time it is trying to maximize its own entropy (so it will not
collapse to a mode of the teacher).
For reasons of time, I will skip the procedure for estimating cross-entropy.
H(PS) can be simply estimated. Since the entropy of a logistic distribution
with scale parameter s is ln s + 2,
H(PS) = Ez⇠L(0,1)
 TX
t=1
ln st + 2T
<latexit sha1_base64="n9a6DzWPfVDCkFArHAIROemzTO4=">AAACKHicbVA9axtBEJ2z8yErH5bt0s0SE5BIEHdqkhQCYROjIoVMpFiguxx7q5W8eHfv2J0zyIf+i6s0/ituErCDW1f+E4asJBeJlAcDj/dmmJmXZFJY9P1bb239ydNnz0sb5RcvX73erGxtf7NpbhjvsVSmpp9Qy6XQvIcCJe9nhlOVSH6cnB7M/OMzbqxIdRcnGY8UHWsxEoyik+JKq13txF9rpEk+x8U5Ca1Q5EvVfx/UpmEixmM5CG2u4gKbwfR7N5Sa2BjnjonIO9LoxpU9v+7PQVZJ8Ej2Wof3DxcA0Ikrv8JhynLFNTJJrR0EfoZRQQ0KJvm0HOaWZ5Sd0jEfOKqp4jYq5p9OyVunDMkoNa40krn690RBlbUTlbhORfHELnsz8X/eIMfRx6gQOsuRa7ZYNMolwZTMYiNDYThDOXGEMiPcrYSdUEMZunDLLoRg+eVV0mvUP9WDIxfGPixQgl14A1UI4AO0oA0d6AGDH3AF13DjXXo/vd/e7aJ1zXuc2YF/4N39AWvzpuw=</latexit><latexit sha1_base64="WQm3saTGJeTN71Eq/L6oWwjkw0M=">AAACKHicbVA9SwNBEN3z2/gVtbRZFCGihDsbtRCColhYRExUyJ3H3mYTF3f3jt05IR7xr9jZ+FdsVFRsrfwTgpvEwq8HA4/3ZpiZFyWCG3DdV6evf2BwaHhkNDc2PjE5lZ+eOTJxqimr0ljE+iQihgmuWBU4CHaSaEZkJNhxdL7d8Y8vmDY8VhVoJSyQpKl4g1MCVgrzpb1COTxcwpt4J8wusW+4xPsFd8VbavsRbzZFzTepDDPY9NqnFV8obELoOjrAy3i1EuYX3KLbBf5LvC+yUNp9/7h+jK/KYf7Br8c0lUwBFcSYmucmEGREA6eCtXN+alhC6Dlpspqlikhmgqz7aRsvWqWOG7G2pQB31e8TGZHGtGRkOyWBM/Pb64j/ebUUGutBxlWSAlO0t6iRCgwx7sSG61wzCqJlCaGa21sxPSOaULDh5mwI3u+X/5LqanGj6B3YMLZQDyNoDs2jAvLQGiqhPVRGVUTRDbpDT+jZuXXunRfntdfa53zNzKIfcN4+Ab/PqKs=</latexit><latexit sha1_base64="WQm3saTGJeTN71Eq/L6oWwjkw0M=">AAACKHicbVA9SwNBEN3z2/gVtbRZFCGihDsbtRCColhYRExUyJ3H3mYTF3f3jt05IR7xr9jZ+FdsVFRsrfwTgpvEwq8HA4/3ZpiZFyWCG3DdV6evf2BwaHhkNDc2PjE5lZ+eOTJxqimr0ljE+iQihgmuWBU4CHaSaEZkJNhxdL7d8Y8vmDY8VhVoJSyQpKl4g1MCVgrzpb1COTxcwpt4J8wusW+4xPsFd8VbavsRbzZFzTepDDPY9NqnFV8obELoOjrAy3i1EuYX3KLbBf5LvC+yUNp9/7h+jK/KYf7Br8c0lUwBFcSYmucmEGREA6eCtXN+alhC6Dlpspqlikhmgqz7aRsvWqWOG7G2pQB31e8TGZHGtGRkOyWBM/Pb64j/ebUUGutBxlWSAlO0t6iRCgwx7sSG61wzCqJlCaGa21sxPSOaULDh5mwI3u+X/5LqanGj6B3YMLZQDyNoDs2jAvLQGiqhPVRGVUTRDbpDT+jZuXXunRfntdfa53zNzKIfcN4+Ab/PqKs=</latexit><latexit sha1_base64="09qO86Mb1t23HrdJ4F9yFyQIxqc=">AAACKHicbVDLSgNBEJz1bXxFPXoZDEJECbu5qAchKIIHDxETDWTXZXYySYbMzC4zvUJc8jte/BUvCipe/RInj4OvgoaiqpvurigR3IDrfjhT0zOzc/MLi7ml5ZXVtfz6xrWJU01ZncYi1o2IGCa4YnXgIFgj0YzISLCbqHc69G/umDY8VjXoJyyQpKN4m1MCVgrzlfNiNbzaxcf4LMzusW+4xBdFd9/bHfgR73RE0zepDDM49ga3NV8obEIYOTrAe7hcC/MFt+SOgP8Sb0IKaIJqmH/xWzFNJVNABTGm6bkJBBnRwKlgg5yfGpYQ2iMd1rRUEclMkI0+HeAdq7RwO9a2FOCR+n0iI9KYvoxspyTQNb+9ofif10yhfRhkXCUpMEXHi9qpwBDjYWy4xTWjIPqWEKq5vRXTLtGEgg03Z0Pwfr/8l9TLpaOSd+kWKieTNBbQFtpGReShA1RB56iK6oiiB/SEXtGb8+g8O+/Ox7h1ypnMbKIfcD6/AFNbo+E=</latexit>

Power Loss
17
It turns out that minimizing the KL-distance alone produces a student that
just produces low volume audio not resembling speech (perhaps resembling
whispering). This may be because whispering has a lot of entropy compared
to speech.
It is necessary to use an additional loss term that ensures that the power in
different frequency bands of the generated speech are on average about the
same as in human speech.
The loss is where
It is really only average power that matters, not having the right power over
time since averaging over time before computing the difference in the loss
didn’t make any noticeable difference to the results.
k (g(z, c)) (y)k2
<latexit sha1_base64="4uHXZLHW8AaxNhwZXC2ykIejVMI=">AAACGHicbZC7TgJBFIbP4g3xhlraTCAmEJXs0qgd0cYSExESFsnsMAsTZy+ZmTXBZV/DxsYHsbFQY0vn2zgsFAr+ySRf/v+cnDnHCTmTyjS/jczS8srqWnY9t7G5tb2T3927lUEkCG2QgAei5WBJOfNpQzHFaSsUFHsOp03n/nKSNx+okCzwb9QwpB0P933mMoKVtrp50x4hOxywUr8U246LHpNjlAJJymV0Ms1SY5iU7dFdtZsvmhUzFVoEawbFWsE+egGAejc/tnsBiTzqK8KxlG3LDFUnxkIxwmmSsyNJQ0zucZ+2NfrYo7ITp5sl6FA7PeQGQj9fodT93RFjT8qh5+hKD6uBnM8m5n9ZO1LuWSdmfhgp6pPpIDfiSAVocibUY4ISxYcaMBFM/xWRARaYKH3MnD6CNb/yIjSqlfOKdW0VaxcwVRYOoAAlsOAUanAFdWgAgSd4hXf4MJ6NN+PT+JqWZoxZzz78kTH+ATB1n5w=</latexit><latexit sha1_base64="upSNlfKN8Pc6Xq45aKk2z984m0I=">AAACGHicbZDLTsJAFIaneEO8oS7dTCAmEJW0bNRdoxuXmIiQ0EqmwxQmTC+ZmZrU0rcwbnwVNy7UuGXH2zgUFgr+ySRf/v+cnDnHCRkVUtcnWm5ldW19I79Z2Nre2d0r7h/ciyDimDRxwALedpAgjPqkKalkpB1ygjyHkZYzvJ7mrUfCBQ38OxmHxPZQ36cuxUgqq1vUrRG0wgGt9CuJ5bjwKT2FGeC0WoVnsywz4rRqjR7q3WJZr+mZ4DIYcyibJevkeWLGjW5xbPUCHHnEl5ghITqGHko7QVxSzEhasCJBQoSHqE86Cn3kEWEn2WYpPFZOD7oBV8+XMHN/dyTIEyL2HFXpITkQi9nU/C/rRNK9sBPqh5EkPp4NciMGZQCnZ4I9ygmWLFaAMKfqrxAPEEdYqmMW1BGMxZWXoVmvXdaMW6NsXoGZ8uAIlEAFGOAcmOAGNEATYPAC3sAH+NRetXftS/uelea0ec8h+CNt/AM4WaEi</latexit><latexit sha1_base64="upSNlfKN8Pc6Xq45aKk2z984m0I=">AAACGHicbZDLTsJAFIaneEO8oS7dTCAmEJW0bNRdoxuXmIiQ0EqmwxQmTC+ZmZrU0rcwbnwVNy7UuGXH2zgUFgr+ySRf/v+cnDnHCRkVUtcnWm5ldW19I79Z2Nre2d0r7h/ciyDimDRxwALedpAgjPqkKalkpB1ygjyHkZYzvJ7mrUfCBQ38OxmHxPZQ36cuxUgqq1vUrRG0wgGt9CuJ5bjwKT2FGeC0WoVnsywz4rRqjR7q3WJZr+mZ4DIYcyibJevkeWLGjW5xbPUCHHnEl5ghITqGHko7QVxSzEhasCJBQoSHqE86Cn3kEWEn2WYpPFZOD7oBV8+XMHN/dyTIEyL2HFXpITkQi9nU/C/rRNK9sBPqh5EkPp4NciMGZQCnZ4I9ygmWLFaAMKfqrxAPEEdYqmMW1BGMxZWXoVmvXdaMW6NsXoGZ8uAIlEAFGOAcmOAGNEATYPAC3sAH+NRetXftS/uelea0ec8h+CNt/AM4WaEi</latexit><latexit sha1_base64="puEWTBR6uMeAwhWUjXZsLoFGP8c=">AAACGHicbZC7TsMwFIadcivlFmBksaiQWgmqpAuwVbAwFonQSk2oHNdprToX2Q5SSPMaLLwKCwMg1m68DW6aAVqOZOnT/5+j4/O7EaNCGsa3VlpZXVvfKG9WtrZ3dvf0/YN7EcYcEwuHLORdFwnCaEAsSSUj3YgT5LuMdNzx9czvPBIuaBjcySQijo+GAfUoRlJJfd2wJ9CORrQ2rKW268Gn7BTmgLN6HZ7NvVxIsro9eWj29arRMPKCy2AWUAVFtfv61B6EOPZJIDFDQvRMI5JOirikmJGsYseCRAiP0ZD0FAbIJ8JJ88syeKKUAfRCrl4gYa7+nkiRL0Tiu6rTR3IkFr2Z+J/Xi6V34aQ0iGJJAjxf5MUMyhDOYoIDygmWLFGAMKfqrxCPEEdYqjArKgRz8eRlsJqNy4Z5a1ZbV0UaZXAEjkENmOActMANaAMLYPAMXsE7+NBetDftU/uat5a0YuYQ/Clt+gMff54T</latexit>
(x) = |STFT(x)|2
<latexit sha1_base64="lPoRiuCH5o7K9y8MaXsv8n948mw=">AAACEHicbVDLSgNBEOyNrxhfUY9ehgQhIoTdXNSDEBTEY8TEBLIxzE5mkyGzD2ZmJWGTX/DiD/gRXjyoePXozb9x8kA0saChqOqmu8sJOZPKNL+MxMLi0vJKcjW1tr6xuZXe3rmRQSQIrZCAB6LmYEk582lFMcVpLRQUew6nVad7PvKrd1RIFvhl1Q9pw8Ntn7mMYKWlZjpnhx2Wi23HRb3hATpFA9tzgl58Xb4oD3/0wW2hmc6aeXMMNE+sKckWM/bhIwCUmulPuxWQyKO+IhxLWbfMUDViLBQjnA5TdiRpiEkXt2ldUx97VDbi8UdDtK+VFnIDoctXaKz+noixJ2Xfc3Snh1VHznoj8T+vHin3uBEzP4wU9clkkRtxpAI0ige1mKBE8b4mmAimb0WkgwUmSoeY0iFYsy/Pk0ohf5K3rqxs8QwmSMIeZCAHFhxBES6hBBUgcA9P8AKvxoPxbLwZ75PWhDGd2YU/MD6+Af15nYw=</latexit><latexit sha1_base64="agot0VInaW2AAaLU3Re4vvRXyvE=">AAACEHicbVDLSsNAFJ34rPVVdelmaBEqQkm6URdCURCXFRtbaGKZTCft0MmDmYk0pPkFN/opblyouHXprn/j9IFo64ELh3Pu5d57nJBRIXV9qC0sLi2vrGbWsusbm1vbuZ3dWxFEHBMTByzgDQcJwqhPTEklI42QE+Q5jNSd3sXIr98TLmjg12QcEttDHZ+6FCOppFauaIVdWkwsx4X99BCewYHlOUE/uald1tIffXBXbuUKekkfA84TY0oKlbx19DSsxNVW7stqBzjyiC8xQ0I0DT2UdoK4pJiRNGtFgoQI91CHNBX1kUeEnYw/SuGBUtrQDbgqX8Kx+nsiQZ4QseeoTg/Jrpj1RuJ/XjOS7omdUD+MJPHxZJEbMSgDOIoHtiknWLJYEYQ5VbdC3EUcYalCzKoQjNmX54lZLp2WjGujUDkHE2TAPsiDIjDAMaiAK1AFJsDgATyDV/CmPWov2rv2MWld0KYze+APtM9vBWyfEg==</latexit><latexit sha1_base64="agot0VInaW2AAaLU3Re4vvRXyvE=">AAACEHicbVDLSsNAFJ34rPVVdelmaBEqQkm6URdCURCXFRtbaGKZTCft0MmDmYk0pPkFN/opblyouHXprn/j9IFo64ELh3Pu5d57nJBRIXV9qC0sLi2vrGbWsusbm1vbuZ3dWxFEHBMTByzgDQcJwqhPTEklI42QE+Q5jNSd3sXIr98TLmjg12QcEttDHZ+6FCOppFauaIVdWkwsx4X99BCewYHlOUE/uald1tIffXBXbuUKekkfA84TY0oKlbx19DSsxNVW7stqBzjyiC8xQ0I0DT2UdoK4pJiRNGtFgoQI91CHNBX1kUeEnYw/SuGBUtrQDbgqX8Kx+nsiQZ4QseeoTg/Jrpj1RuJ/XjOS7omdUD+MJPHxZJEbMSgDOIoHtiknWLJYEYQ5VbdC3EUcYalCzKoQjNmX54lZLp2WjGujUDkHE2TAPsiDIjDAMaiAK1AFJsDgATyDV/CmPWov2rv2MWld0KYze+APtM9vBWyfEg==</latexit><latexit sha1_base64="D4dz6belcndpmBaJNH3XgMpim9E=">AAACEHicbVBNS8NAEN34WetX1KOXxSLUS0l6UQ9CURCPFRtbaGLZbDft0s0m7G6kJe1f8OJf8eJBxatHb/4bt20QbX0w8Hhvhpl5fsyoVJb1ZSwsLi2vrObW8usbm1vb5s7urYwSgYmDIxaJho8kYZQTR1HFSCMWBIU+I3W/dzH26/dESBrxmhrExAtRh9OAYqS01DKLbtylxdT1A9gfHcEzOHRDP+qnN7XL2uhHH96VW2bBKlkTwHliZ6QAMlRb5qfbjnASEq4wQ1I2bStWXoqEopiRUd5NJIkR7qEOaWrKUUikl04+GsFDrbRhEAldXMGJ+nsiRaGUg9DXnSFSXTnrjcX/vGaighMvpTxOFOF4uihIGFQRHMcD21QQrNhAE4QF1bdC3EUCYaVDzOsQ7NmX54lTLp2W7Gu7UDnP0siBfXAAisAGx6ACrkAVOACDB/AEXsCr8Wg8G2/G+7R1wchm9sAfGB/f7IOcAw==</latexit>

Other Losses
18
With KL-Loss and power loss, the student is already pretty good, but for
the best results, two other losses helped.
They used a perceptual loss similar to the style loss used in style transfer
but with features extracted by a WaveNet-like classiﬁer trained to predict
phones from raw audio.
And they used a contrastive loss that penalizes the student for producing
high likelihood samples that are high likelihood independent of the
condition.

Multiple Flows
19
Another thing that was not strictly necessary but improved the ﬁnal result
was to chain several student models together with the output of one fed to
the input of the next.
The ﬁnal distribution is still logistic with parameters

Architecture of the Student
20
The student network is a WaveNet except that it doesn’t have skip
connections. (They don’t give the reason for this change.)

Questions?
21
Thanks for attending!
I hope we have time for a few questions
If you are interested in developing models like WaveNet and WaveNet 2,
Respeecher is hiring! Talk to me or our CEO Alex Serdiuk at the
conference, or send us a message on our web site, respeecher.com.

Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neural nets”

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neural nets”

Similar to Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neural nets” (20)

More from Lviv Startup Club

More from Lviv Startup Club (20)

Recently uploaded

Recently uploaded (20)

Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neural nets”