WaveNet is a generative model for raw audio that uses dilated causal convolutions to efficiently process long audio sequences. It has achieved amazing results in multi-speaker speech generation, text-to-speech, music generation, and speech recognition by learning the raw waveform directly. The author discusses their thoughts on improving WaveNet further by incorporating ideas like RNN convolution kernels with multiple scales or RNNs within RNNs to better capture relationships at different timescales.