Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and loudness using deep neural networks

Amplitude Spectrogram Prediction
from Mel-Frequency Cepstrum Coefficients and
Loudness Using Deep Neural Networks
Shoya Kawaguchi* and Daichi Kitamura*
*National Institute of Technology, Kagawa College, Japan
2023 RISP International Workshop on Nonlinear Circuits, Communications and Signal Processing
Session: 1AM2-2: Sound and Speech Processing
Time: Wed., 1st Mar., 12:40-13:00 (UTC -10)

2
Background
• Timbre conversion of musical instrument sounds
– Differentiable Digital Signal Processing (DDSP) [Engel+, 2020]
– Generation of musical instrument sounds using variational auto-encoder
(VAE) [Luo+, 2019]
• Conversion of timbre and generation of musical instrument
sound using VAE [Kingma+, 2013]
– Timbre intermediate between piano and guitar
– New musical instrument sound
Conversion
Existing music Converted music
Conversion

3
VAE
• One type of unsupervised learning
• Calculate probability distributions from latent variables and
display latent space
Latent space

4
• Proposed timbre conversion system
– extracts “Pitch”, “Timbre”, and “Volume” from musical instrument sound
– trains “Timbre” using VAE
• Problem
– “Timbre” is a dimensionality-reduced feature
– Decoding from “Timbre” to spectrogram cannot be realized through linear
operations or analytical manners
Overview of proposed timbre conversion system
Extracted
Timbre
Time
Coefficient
Encoder
Original
amplitude
spectrogram
Original wave
STFT
Frequency
Time
Pitch
Volume
Volume
Time
Generated
Timbre
Time
Coefficient
Generated
amplitude
spectrogram
Frequency
Time
Generated wave
Phase recovery
& inverse STFT
VAE
Decoder
DNN
Decoder
Scope of this presentation

5
• We employ a DNN as the decoder
– DNN decoder predicts an amplitude spectrogram of synthesized sound
from the inputted “Pitch”, “Timbre”, and “Volume”
Subject of this presentation
Experimentally investigate the suitable DNN architecture for the DNN decoder
Pitch
MFCC
Loudness
Time
Volume
Time
Coefficient
Encoder
Pitch-specific DNN decoders
Select corresponding DNN
Loss
Original
amplitude
spectrogram
Frequency
Time
Predicted
amplitude
spectrogram
Frequency
Time
C3 D3♭ B5
Timbre: mel-frequency cepstrum coefficient (MFCC)

6
• Volume: loudness
– Feature of time-frame-wise sound volume
• Timbre: mel-frequency cepstrum coefficient (MFCC)
– Legacy timbre features
Input features
MFCC
Amplitude spectrogram
Time [s]
Frequency
[kHz]
Time [s]
Coefficient
Loudness
Time [s] Time [s]
Frequency
[kHz]
Volume
Amplitude spectrogram Frequency
[kHz]
Normalized amplitude spectrogram
Time [s]
Frequency bin:
Time frame:
Mel-filter bank
and DCT

7
Input features
• Pitch: sound note number
– Multiple DNN decoders are trained for each pitch
• Preparing note-specific DNN decoders
• Pitch estimated by the encoder is utilized to select these DNN decoders
Pitch
Original
amplitude
spectrogram
Frequency
Time
Encoder
Note-specific DNN decoders
C3 B5
Select
corresponding
DNN
Predicted
amplitude
spectrogram
Frequency
Time
MFCC
Time
Coefficient
Loudness
Volume
Time
MFCC
Pitch
Loudness
D 3
♭
D 3
♭

8
Multilayer perceptron
• Multilayer perceptron (MLP)
Reshape
512 512
1024
64638
Three hidden layers
Input data Predicted
amplitude
spectrogram
Frequency
Time
Vectorize
8190

9
• Bidirectional recurrent neural network （BiRNN）
– using gated recurrent unit （BiGRU）
– using long-short term memory unit （BiLSTM）
Bidirectional recurrent neural network
Input data
Entry-wise
product
Predicted
amplitude
spectrogram
Entry-wise
product
Entry-wise
product
Unit
Unit Unit Unit
Unit
Unit Unit Unit
Unit
Unit Unit Unit
Unit
Unit Unit Unit
Unit
Unit Unit Unit
Unit
Unit Unit Unit

10
Entry-wise
product
Predicted
amplitude
spectrogram
Entry-wise
product
Entry-wise
product
Unit
Unit
Unit
Unit
Unit
Unit
Unit Unit
Unit Unit
Unit Unit
Unit Unit
Unit Unit
Unit Unit
Input data
Unit
Unit
Unit
Unit
Unit
Unit

11
Entry-wise
product
Predicted
amplitude
spectrogram
Entry-wise
product
Entry-wise
product
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
Four BiRNN layers
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
Input data
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM

12
• Dataset of musical instruments: Nsynth [Engel+, 2017]
– 305,979 signals of four-second-long various musical instrumental sounds
with 16 kHz sampling frequency
– was split into 289,205 (95%) training, 12,678 (4%) validation, and
4,096 (1%) test data
• Other conditions
Condition
Window and shift lengths in STFT 64 / 32 ms
Window function in STFT Hann window
Number of epochs 10000
Maximum frequency of mel filter bank 8.00 kHz
Minimum frequency of mel filter bank 0.00 kHz
Number of mel filters ( ) 64
Loss function Mean squared error

13
Results
• The best case (flute)
Original MLP
BiGRU BiLSTM

14
Results
• Typical case (keyboard)
Original MLP
BiGRU BiLSTM

15
• Amplitude relative squared error (ARSE)
– Relative error between original amplitude spectrogram and predicted
amplitude spectrogram
• MFCC relative squared error (MRSE)
– Relative error between MFCC of original amplitude spectrogram and MFCC
of predicted amplitude spectrogram
Evaluation criteria

16
Average of ARSE and MRSE
Poor
Good
Poor
Good
Bass Brass Flute Guitar Keyboard Mallet Organ Reed String Vocal
Bass Brass Flute Guitar Keyboard Mallet Organ Reed String Vocal
The best case
The worst case

17
Conclusion
• Motivation
– Solve problem of proposed timbre conversion system, which allows for
conversion of timbre and generation of musical instrument sounds
– Construct a DNN decoder to predict amplitude spectrogram from the
MFCC
• Results
– Using BiLSTM as a DNN
decoder, amplitude
spectrogram could be
predicted with high accuracy
• Future work
– Training MFCC with VAE
– Construct whole of the proposed timbre conversion system
Thank you for your attention.

Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and loudness using deep neural networks

More Related Content

Similar to Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and loudness using deep neural networks

More from Kitamura Laboratory

Recently uploaded

Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and loudness using deep neural networks

Editor's Notes