Amplitude Spectrogram Prediction
from Mel-Frequency Cepstrum Coefficients and
Loudness Using Deep Neural Networks
Shoya Kawaguchi* and Daichi Kitamura*
*National Institute of Technology, Kagawa College, Japan
2023 RISP International Workshop on Nonlinear Circuits, Communications and Signal Processing
Session: 1AM2-2: Sound and Speech Processing
Time: Wed., 1st Mar., 12:40-13:00 (UTC -10)
2
Background
• Timbre conversion of musical instrument sounds
– Differentiable Digital Signal Processing (DDSP) [Engel+, 2020]
– Generation of musical instrument sounds using variational auto-encoder
(VAE) [Luo+, 2019]
• Conversion of timbre and generation of musical instrument
sound using VAE [Kingma+, 2013]
– Timbre intermediate between piano and guitar
– New musical instrument sound
Conversion
Existing music Converted music
Conversion
3
VAE
• One type of unsupervised learning
• Calculate probability distributions from latent variables and
display latent space
Latent space
4
• Proposed timbre conversion system
– extracts “Pitch”, “Timbre”, and “Volume” from musical instrument sound
– trains “Timbre” using VAE
• Problem
– “Timbre” is a dimensionality-reduced feature
– Decoding from “Timbre” to spectrogram cannot be realized through linear
operations or analytical manners
Overview of proposed timbre conversion system
Extracted
Timbre
Time
Coefficient
Encoder
Original
amplitude
spectrogram
Original wave
STFT
Frequency
Time
Pitch
Volume
Volume
Time
Generated
Timbre
Time
Coefficient
Generated
amplitude
spectrogram
Frequency
Time
Generated wave
Phase recovery
& inverse STFT
VAE
Decoder
DNN
Decoder
Scope of this presentation
5
• We employ a DNN as the decoder
– DNN decoder predicts an amplitude spectrogram of synthesized sound
from the inputted “Pitch”, “Timbre”, and “Volume”
Subject of this presentation
Experimentally investigate the suitable DNN architecture for the DNN decoder
Pitch
MFCC
Loudness
Time
Volume
Time
Coefficient
Encoder
Pitch-specific DNN decoders
Select corresponding DNN
Loss
Original
amplitude
spectrogram
Frequency
Time
Predicted
amplitude
spectrogram
Frequency
Time
C3 D3♭ B5
Timbre: mel-frequency cepstrum coefficient (MFCC)
6
• Volume: loudness
– Feature of time-frame-wise sound volume
• Timbre: mel-frequency cepstrum coefficient (MFCC)
– Legacy timbre features
Input features
MFCC
Amplitude spectrogram
Time [s]
Frequency
[kHz]
Time [s]
Coefficient
Loudness
Time [s] Time [s]
Frequency
[kHz]
Volume
Amplitude spectrogram Frequency
[kHz]
Normalized amplitude spectrogram
Time [s]
Frequency bin:
Time frame:
Mel-filter bank
and DCT
7
Input features
• Pitch: sound note number
– Multiple DNN decoders are trained for each pitch
• Preparing note-specific DNN decoders
• Pitch estimated by the encoder is utilized to select these DNN decoders
Pitch
Original
amplitude
spectrogram
Frequency
Time
Encoder
Note-specific DNN decoders
C3 B5
Select
corresponding
DNN
Predicted
amplitude
spectrogram
Frequency
Time
MFCC
Time
Coefficient
Loudness
Volume
Time
MFCC
Pitch
Loudness
D 3
♭
D 3
♭
8
Multilayer perceptron
• Multilayer perceptron (MLP)
Reshape
512 512
1024
64638
Three hidden layers
Input data Predicted
amplitude
spectrogram
Frequency
Time
Vectorize
8190
9
• Bidirectional recurrent neural network (BiRNN)
– using gated recurrent unit (BiGRU)
– using long-short term memory unit (BiLSTM)
Bidirectional recurrent neural network
Input data
Entry-wise
product
Predicted
amplitude
spectrogram
Entry-wise
product
Entry-wise
product
Unit
Unit Unit Unit
Unit
Unit Unit Unit
Unit
Unit Unit Unit
Unit
Unit Unit Unit
Unit
Unit Unit Unit
Unit
Unit Unit Unit
10
• Bidirectional recurrent neural network (BiRNN)
– using gated recurrent unit (BiGRU)
– using long-short term memory unit (BiLSTM)
Entry-wise
product
Predicted
amplitude
spectrogram
Entry-wise
product
Entry-wise
product
Unit
Unit
Unit
Unit
Unit
Unit
Bidirectional recurrent neural network
Unit Unit
Unit Unit
Unit Unit
Unit Unit
Unit Unit
Unit Unit
Input data
Unit
Unit
Unit
Unit
Unit
Unit
11
• Bidirectional recurrent neural network (BiRNN)
– using gated recurrent unit (BiGRU)
– using long-short term memory unit (BiLSTM)
Entry-wise
product
Predicted
amplitude
spectrogram
Entry-wise
product
Entry-wise
product
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
Bidirectional recurrent neural network
Four BiRNN layers
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
Input data
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
GRU
or
LSTM
12
• Dataset of musical instruments: Nsynth [Engel+, 2017]
– 305,979 signals of four-second-long various musical instrumental sounds
with 16 kHz sampling frequency
– was split into 289,205 (95%) training, 12,678 (4%) validation, and
4,096 (1%) test data
• Other conditions
Condition
Window and shift lengths in STFT 64 / 32 ms
Window function in STFT Hann window
Number of epochs 10000
Maximum frequency of mel filter bank 8.00 kHz
Minimum frequency of mel filter bank 0.00 kHz
Number of mel filters ( ) 64
Loss function Mean squared error
13
Results
• The best case (flute)
Original MLP
BiGRU BiLSTM
14
Results
• Typical case (keyboard)
Original MLP
BiGRU BiLSTM
15
• Amplitude relative squared error (ARSE)
– Relative error between original amplitude spectrogram and predicted
amplitude spectrogram
• MFCC relative squared error (MRSE)
– Relative error between MFCC of original amplitude spectrogram and MFCC
of predicted amplitude spectrogram
Evaluation criteria
16
Average of ARSE and MRSE
Poor
Good
Poor
Good
Bass Brass Flute Guitar Keyboard Mallet Organ Reed String Vocal
Bass Brass Flute Guitar Keyboard Mallet Organ Reed String Vocal
The best case
The worst case
17
Conclusion
• Motivation
– Solve problem of proposed timbre conversion system, which allows for
conversion of timbre and generation of musical instrument sounds
– Construct a DNN decoder to predict amplitude spectrogram from the
MFCC
• Results
– Using BiLSTM as a DNN
decoder, amplitude
spectrogram could be
predicted with high accuracy
• Future work
– Training MFCC with VAE
– Construct whole of the proposed timbre conversion system
Thank you for your attention.

Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and loudness using deep neural networks

  • 1.
    Amplitude Spectrogram Prediction fromMel-Frequency Cepstrum Coefficients and Loudness Using Deep Neural Networks Shoya Kawaguchi* and Daichi Kitamura* *National Institute of Technology, Kagawa College, Japan 2023 RISP International Workshop on Nonlinear Circuits, Communications and Signal Processing Session: 1AM2-2: Sound and Speech Processing Time: Wed., 1st Mar., 12:40-13:00 (UTC -10)
  • 2.
    2 Background • Timbre conversionof musical instrument sounds – Differentiable Digital Signal Processing (DDSP) [Engel+, 2020] – Generation of musical instrument sounds using variational auto-encoder (VAE) [Luo+, 2019] • Conversion of timbre and generation of musical instrument sound using VAE [Kingma+, 2013] – Timbre intermediate between piano and guitar – New musical instrument sound Conversion Existing music Converted music Conversion
  • 3.
    3 VAE • One typeof unsupervised learning • Calculate probability distributions from latent variables and display latent space Latent space
  • 4.
    4 • Proposed timbreconversion system – extracts “Pitch”, “Timbre”, and “Volume” from musical instrument sound – trains “Timbre” using VAE • Problem – “Timbre” is a dimensionality-reduced feature – Decoding from “Timbre” to spectrogram cannot be realized through linear operations or analytical manners Overview of proposed timbre conversion system Extracted Timbre Time Coefficient Encoder Original amplitude spectrogram Original wave STFT Frequency Time Pitch Volume Volume Time Generated Timbre Time Coefficient Generated amplitude spectrogram Frequency Time Generated wave Phase recovery & inverse STFT VAE Decoder DNN Decoder Scope of this presentation
  • 5.
    5 • We employa DNN as the decoder – DNN decoder predicts an amplitude spectrogram of synthesized sound from the inputted “Pitch”, “Timbre”, and “Volume” Subject of this presentation Experimentally investigate the suitable DNN architecture for the DNN decoder Pitch MFCC Loudness Time Volume Time Coefficient Encoder Pitch-specific DNN decoders Select corresponding DNN Loss Original amplitude spectrogram Frequency Time Predicted amplitude spectrogram Frequency Time C3 D3♭ B5 Timbre: mel-frequency cepstrum coefficient (MFCC)
  • 6.
    6 • Volume: loudness –Feature of time-frame-wise sound volume • Timbre: mel-frequency cepstrum coefficient (MFCC) – Legacy timbre features Input features MFCC Amplitude spectrogram Time [s] Frequency [kHz] Time [s] Coefficient Loudness Time [s] Time [s] Frequency [kHz] Volume Amplitude spectrogram Frequency [kHz] Normalized amplitude spectrogram Time [s] Frequency bin: Time frame: Mel-filter bank and DCT
  • 7.
    7 Input features • Pitch:sound note number – Multiple DNN decoders are trained for each pitch • Preparing note-specific DNN decoders • Pitch estimated by the encoder is utilized to select these DNN decoders Pitch Original amplitude spectrogram Frequency Time Encoder Note-specific DNN decoders C3 B5 Select corresponding DNN Predicted amplitude spectrogram Frequency Time MFCC Time Coefficient Loudness Volume Time MFCC Pitch Loudness D 3 ♭ D 3 ♭
  • 8.
    8 Multilayer perceptron • Multilayerperceptron (MLP) Reshape 512 512 1024 64638 Three hidden layers Input data Predicted amplitude spectrogram Frequency Time Vectorize 8190
  • 9.
    9 • Bidirectional recurrentneural network (BiRNN) – using gated recurrent unit (BiGRU) – using long-short term memory unit (BiLSTM) Bidirectional recurrent neural network Input data Entry-wise product Predicted amplitude spectrogram Entry-wise product Entry-wise product Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit
  • 10.
    10 • Bidirectional recurrentneural network (BiRNN) – using gated recurrent unit (BiGRU) – using long-short term memory unit (BiLSTM) Entry-wise product Predicted amplitude spectrogram Entry-wise product Entry-wise product Unit Unit Unit Unit Unit Unit Bidirectional recurrent neural network Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Unit Input data Unit Unit Unit Unit Unit Unit
  • 11.
    11 • Bidirectional recurrentneural network (BiRNN) – using gated recurrent unit (BiGRU) – using long-short term memory unit (BiLSTM) Entry-wise product Predicted amplitude spectrogram Entry-wise product Entry-wise product GRU or LSTM GRU or LSTM GRU or LSTM GRU or LSTM GRU or LSTM GRU or LSTM Bidirectional recurrent neural network Four BiRNN layers GRU or LSTM GRU or LSTM GRU or LSTM GRU or LSTM GRU or LSTM GRU or LSTM GRU or LSTM GRU or LSTM GRU or LSTM GRU or LSTM GRU or LSTM GRU or LSTM Input data GRU or LSTM GRU or LSTM GRU or LSTM GRU or LSTM GRU or LSTM GRU or LSTM
  • 12.
    12 • Dataset ofmusical instruments: Nsynth [Engel+, 2017] – 305,979 signals of four-second-long various musical instrumental sounds with 16 kHz sampling frequency – was split into 289,205 (95%) training, 12,678 (4%) validation, and 4,096 (1%) test data • Other conditions Condition Window and shift lengths in STFT 64 / 32 ms Window function in STFT Hann window Number of epochs 10000 Maximum frequency of mel filter bank 8.00 kHz Minimum frequency of mel filter bank 0.00 kHz Number of mel filters ( ) 64 Loss function Mean squared error
  • 13.
    13 Results • The bestcase (flute) Original MLP BiGRU BiLSTM
  • 14.
    14 Results • Typical case(keyboard) Original MLP BiGRU BiLSTM
  • 15.
    15 • Amplitude relativesquared error (ARSE) – Relative error between original amplitude spectrogram and predicted amplitude spectrogram • MFCC relative squared error (MRSE) – Relative error between MFCC of original amplitude spectrogram and MFCC of predicted amplitude spectrogram Evaluation criteria
  • 16.
    16 Average of ARSEand MRSE Poor Good Poor Good Bass Brass Flute Guitar Keyboard Mallet Organ Reed String Vocal Bass Brass Flute Guitar Keyboard Mallet Organ Reed String Vocal The best case The worst case
  • 17.
    17 Conclusion • Motivation – Solveproblem of proposed timbre conversion system, which allows for conversion of timbre and generation of musical instrument sounds – Construct a DNN decoder to predict amplitude spectrogram from the MFCC • Results – Using BiLSTM as a DNN decoder, amplitude spectrogram could be predicted with high accuracy • Future work – Training MFCC with VAE – Construct whole of the proposed timbre conversion system Thank you for your attention.

Editor's Notes

  • #2 【0:00】 (Thank you for your introduction.) Hi everyone , I’m Shoya Kawaguchi/ from National Institute of Technology,/ Japan. I’m gonna(んごな) talk about this title.
  • #3 【0:10】 In re’cent years,↑/ deep neural ne’tworks↑/ have been developed↓/ for timbre conversion↓/ a’nd sound generation. As an example,↑/ differentiable digital signal processing↑/ DDSP↑/ u’tilizes DNNs to sy’nthesize (あく―スティック) acoustic signals. Variational autoencoders↑/ VAEs(ビーエーイー)↑/ are also u’tilized↓/ to a’nalyze and ge’nerate musical instrument sounds. We beli’eve that↑/ we can contribute to the development of music culture↓/ through this kind of tech’niques↓/ namely,↑/ timbre conve’rsion↓/ a’nd instrumental sound ge’neration. Therefore↑,/ we aim to propose an a’ccurate(あきゅれいと) timbre conversion sy‘stem. We expect that↑/ the proposed sys’tem can ge’nerate a new instrument sound↓,/ for example,↑/ an intermediate sound of pia’no and guita’r,↓/ and so on. 【1:10】
  • #4 【1:10】 Let me (えくすぷれいん)explain/ VAE(びぃーえーいー). VAE(びぃーえーいー) is a DNN↓/ that has an inte’rpretable(インタープリタボウ) low-dimensional space↓/ called latent space. I explain the latent space↓ using this figure.(指す) When we train the VAE(びぃーえーいー)↓ using i’mages of hand-written numbers↑,/ the latent space-/ is obtained like this figure.(指す) We can see some clusters~ like “7“↑/ or ”9”↓ /in this latent space. Furthermore(ふぁざもあ),↑/ we can see ne’w images↓/ that look like intermediate numbers↓/ between "7" and "9“. / Now we are planning to apply this VAE(びぃーえーいー) to not the hand-written numbers↓/ but the timbres of musical instruments. Therefore,↑/ we expect that↑/ the clusters of pia’no(7) and (9)guita’r-timbres are obtained↓ in the latent space.(指す) Also,↑/ a new timbre↓/ that unifies multiple musical instruments~ can be ge’nerated↓,/ like↑/ the intermediate-timbre of piano and guitar(中間の数字を指す). 【2:20】
  • #5 【2:20】 This slide explains↑/ the overview of (あわー)our propo’sed timbre conversion sy’stem↓/ based on the VAE(びぃーえーいー). First,↑/ this system ca’lculates the amplitude spectrogram↓/ from the original wave,↓/ a’nd then extract(えくすとらくと) “pitch,“↑/ ”timbre,“↑/ and ”volume“↓/ through the encoder↓/ from the amplitude spectrogram. Next,↑/ the timbres are input to the VAE(びぃーえーいー). (ざす)Thus,↑/ VAE(びぃーえーいー) trains only the timbres↓ of various (べぇありあす) musical instruments~↓ as the latent space. When we ge’nerate a new i’nstrumental sound↑,/ the output of the VAE,↑// generated timbre is input to the decoder↓/ with pitch and volume i’nformation. Finally↑,/ the decoder predicts the amplitude spectrogram↓/ of the new generated sound. When we train the VAE(びぃーえーいー),↑/ we use this process flow. The VAE is updated↓/ so-that-the o’utput wave becomes(びかむず) e’quivalent(え↑く↓い↑ばれんと) to the original wave. After the training, ↑/(アニメーション) the VAE can ge’nerate a ne’w timbre feature~↓ from (サム)some random values. Therefore,↑/ by using pitch,↑/ generated timbre,↑/ and volume,↑/ we can ge’nerate a new I’nstrumental sound↓ through the decoder. However,↑/ there is a problem↓/ with this system. The problem is that↑/ there is no linear decoder↓/ that can predict the amplitude spectrogram↓/ from pitch↑, timbre↑, and volume. This is because↑/ the timbre is a dimensionality(でぃめんしょなりティー)-reduced-feature. To solve this problem,↑/ (ここアニメーション) we propo’se to use a DNN-based decoder,↓/ that predicts an amplitude spectrogram↓ from these(じーず) features. This~ is the scope of this(my) presentation. 【4:30】 Q&A どうやって振幅スペクトログラムを求めている? We used short time Fourier transform/ STFT/ to calculate the amplitude spectrogram. The STFT is most used calcurating method to time-frequency space from time spece. 予測振幅スペクトログラムから波形はどうやって求める? We used Griffin-lim algorithm as a method to recover of phase from the amplitude spectrogram. First, random phase information put on amplitude spectrogram. Then it iterates iSTFT and STFT, and update only phase information. Iterate : 反復する 何故波形を直接求めない? To predict the original wave from three features using DNN decoder is equivalent to predict both amplitude and phase spectrogram of the original wave. And in general, predicational of phase spectrogram is very difficult. For this reason, we only forces on the prediction of amplitude spectrogram, in this research. But of Couse prediction of sound waves is an ideal approach, so I will check it as the future work.
  • #6 【4:30】 This slide shows the detail of (this) DNN decoder. We train the DNN↓/ th’at predi’cts original amplitude spectrogram of various(べありあす) instrumental s’ounds↓/ from only the pitch,↑/ timbre,↑/ and volume features. Therefore,↑/ the loss function between ori’ginal~ and predi’cted amplitude spectrograms↓/ i’s mi↓nimized. In particular,↑/ we use↑/ mel-frequency cepstrum(セプストラム) coefficient↓/ MFCC in short↓/ as a timbre feature↓// a’nd lo’udness↓ /as a volume feature. 【5:10】
  • #7 【5:10】 Let me explain/ the three features↓ we used. The loudness is a volume feature↓/ th’at represents time-frame-wise energy(エナジー)~ of the sound. //(1s) MFCC is the legacy timbre feature↓ that can be ca’lculated by applying mel-filter bank↓/ and discrete(でぃすくりーと) cosine transform↓ to the amplitude spectrogram. In this research,↑/ we normalize the amplitude spectrogram↓/ using the lo‘udness↓/ before ca’lculating the(じ) MFCC. This amplitude spectrogram~ Y~ is the piano sound↓/ of one octave tones. We apply the-loudness-normalization ↓a’nd obta’in Y_. Finally,↑/ we ca’lculate MFCC of Y_↓ by applying mel-filter bank ↓/and DCT. In this case,↑/ since the amplitude spectrogram Y,↑/ consists of the to’nes with the same timbre,↑/ MFCC dose not change like this. 【6:25】 Q&A なぜloudnessを分離したのか? The information of musical instrument constructs pitch, timbre, and volume. We hope that the latent space of VAE trains pure timbre. Therefore, we disentangle pitch, timbre,volume from amplitude spectrogram.
  • #8 【6:25】 As the pitch feature,↑/ we u’tilize sound note numbers↓/ such as C3,↑/ D♭3,↑/ or B5. Since the note number is a discrete i’nformation↑,/ we prepare(ぷりぺあー) note-specific DNN decoders↓/ by training (ぜむ)them using the so‘und dataset↓ of the same pitch. Then,↑/ the pitch i’nformation is used~ to select the DNN decoder. / The i’nput to the DNN decoder↓~ is defined as X,↓/ which is(式指しながら)an MFCC and loudness. Finally,↑/ the predicted amplitude spectrogram~ Y^~ i’s obtained↓ like this equation. 【7:10】 Prepare: 準備する Utilize: 活用する Q&A ピッチの情報はどのように求めているのか? We used sound note number prepared as a label for Nsynth. For example, file name is “keyboard_acoustic_002_069_025”. 002 is number of instrument of all keyboard. 069 is sound note number. 025 is velocity.
  • #9 【7:10】 ok.~ Let’s explain the three types of DNNs ↓/ th’at are used-as-the-DNN-decoder↓ in this research. The first one is(ワニズ)~↑ the multilayer perceptron,↓/ MLP ,↓/ whi’ch is the most basic type of DNN. We use an MLP↓/ with three hidden layers. At the(アッディ) input layer↑,/ the input data X is vectorized,↓/ and at the final layer↑,/ the output vector is reshaped↓ to the amplitude spectrogram~ Y^. 【7:50】 So that~: ~になるように Q&A We used ReLU as the activation function.
  • #10 【7:50】 The second one is(わにず)↑/ bidirectional recurrent neural ne’twork,↓/ BiRNN. BiRNN~ consists of t’wo RNNs~↓ with forward↑ and backward directions↓/ along-with-the-time-flame-axis. BiRNN~ trains e’ach unit↓ based on the o’utput of a’djacent(アジャセント) unit. (戻る) For example↑,/ for the forward direction↑, / the past unit~↓ i’s cone’cted to the (かれんと)current unit. On the other hand↑,/ for the backward direction↑,/ the future unit~↓ i’s cone’cted to the (かれんと)current unit. For this reason,↑/ BiRNN~ can train the ne’twork↓/ considering-the temporal dependency. 【8:35】
  • #11 【8:35】 At the input layer,↑/ we input the vectors of each time frame in X↓/ to each of forward~↑ and backward u’nits↓/ (like this figure.) At the final layer,↑/ we multiply the output vectors~↓ of the forward~↑ and backward directions. And then,↑/ we construct the amplitude spectrogram↓ using the output vectors. 【9:05】
  • #12 【9:05】 In this research,↑/ we use two types of BiRNNs,↓/ namely,↑/ gated recurrent u’nit↓ called BiGRU,/ and long-short term me’mory↓ called BiLSTM. (And,↑/ we use four RNN layers↓/ for both BiGRU~↑ and BiLSTM~↓ as shown in this figure.) 【9:30】
  • #13 【9:30】 Let’s move on to the experiment. We used a dataset of musical instrumental sounds↓/ which is (ぷろヴぁいでぃっど)provided as Nsynth. Nsynth~ is an audio dataset including four-second-long signals↓/ of various(べイリアス) musical instrumental sounds↓/ and consists of about 300,000 signals. These signals were split into 95% training,↑/ 4% validation,↑/ and 1% test data. The other conditions~ are shown↓/ in this (てーぼー)table. (We used mean (すくえあーど)squared error↓/ as a lo’ss function↓ of all the types of DNN. ) 【10:20】 Q&A なぜNsynthを用いたのか As a condition of BiRNN, only fixed-length datas can be used as an input. In addition, as a condition of VAE, it needs a variety musical instrument sounds for training. Therefore, we used Nsynth that satisfies these conditions.
  • #14 【10:20】 The first example~ is a flute sound. These figures show the ori’ginal/ and predi’cted~ amplitude spectrograms~↓ of each DNN. Please note that↑/ this resu’lt is the best case↓/ of the prediction. In the case of MLP↑,/ there are so many spectral holes,↓/ which provide harmful(はーむふる) artificial(あーてぃふぃしゃる) distortion. In contrast↑,/ BiGRU↑ and BiLSTM a’ccurately↑ ↓predi’cted-the-original-amplitude-spectrogram. In particular↑,/ we can confirm that↑/ the harmonic structures~ of the flute sound/ are p’recisely(ぷりさいすりー)↑ synthesized. ok, let’s listen to the sound↓ (of these spectrograms). I will play the sounds in the order of original↓, BiGRU↓, a’nd BiLSTM. (再生) As you can hear,↑/ the BiLSTM achieves(あちーぶす) the best performance↓ in this case. 【11:50】 Q&A フルートが予測できた理由 ‘First,)the flute sound has simple harmonic structures and volume attenuations. For DNNs, it was easy to predict the amplitude spectrogram, I guess. MLPが予測できなかった理由 We consider the reason for this to be the use of ReLU as the activation function for MLP. As a result,/ results (of MLP) increase zero-element, and spectrogram (of MLP) increase the spectral holes.
  • #15 【11:50】 The next example is a ke’yboard sound. This is a typical case↓/ among the test data. The ke’yboard sound has more complex time-frequency structure↓/ compered(こんぺあーど) with the flute sound. However,↑/ similar to the previous(ぷりヴぃあす) result,↑/ BiGRU~↑ and BiLSTM can predict the original amplitude spectrogram↓/ with high accuracy. Again,↑/ I will play these sounds. I thi‘nk you can hear that↑/ BiGRU slightly(すらいとりー) includes artificial distortion~↓ compared with BiLSTM. (Again,↑/ I will play original and BiLSTM sounds.) (時間なかったら) (上のセリフを言ってから再生) 【12:45】 Q&A キーボードの予測が出来た理由 We consider the reason for this to be due to the loudness used as an input. This is because/ the loudness is a feature with volume information.
  • #16 【12:45】 To (いばりゅえいと) eva’luate the performance of the predi’ction,↑/we used two objective scores,↓/ amplitude (りらてぃぶ)relative (すくえあーど)squared error,↓/ARSE in short↓/ a’nd MFCC relative (すくえあーど)squared error,↓/MRSE in short. ARSE~ and MRSE are defined by these(じーず) equations,↓/ where ARSE is the (すくえあーど)squared error between ori’ginal~ and predi’cted amplitude spectrograms/ a’nd MRSE is the (すくえあーど)squared error between origi’nal MFCC// and MFCC of predi’cted amplitude spectrograms. 【13:30】
  • #17 【13:30】 The’se ↓graphs show the ARSE and MRSE↓ averaged over each instrument. The horizontal axis shows each of musical i’nstruments,↓/ and the vertical axis shows the scores. Each bar corresponds(これすぽんず) to MLP↑, BIGRU↑, and BiLSTM. As we already (はーど)heard,↑/ the’ flute sounds achieved(あちーぶど) the best ARSE score. On the other hand,↑/ the’ mallet sounds were the lowest score↓ in this experiment. (時間なかったら上二つはとばず) From all the results,↑/ we can see that↑/ BiLSTM always outperforms(あうとぱふぉーむず) the other DNNs↓/ in both scores. 【14:25】
  • #18 【14:25】 This is a conclusion. That’s all. Thank you for your attention. 【14:30】
  • #19 【8:05】 (8:05) Let us now describe the experimental conditions. First, we explain the structure of DNN. MLP consists of three hidden layers.
  • #20 【12:20】 (12:20) The final example~ is a mallet sound, and this is the worst case/ among the test data. In this results,/ all the DNNs could not predict the original amplitude spectrogram~ accurately. Please listen these sounds. The reason of this prediction (ふぇいりあー)failure~ is likely due to the fact that/ the mallet is often characterized to a percussion instrument,/ which has complex inharmonic structures/ and time transitions. The other instruments have simply harmonic structures and time transitions, for example flute and keyboard. Q&A (On the other hand) the mallet sound has complex inharmonic structures. For DNNs, it was difficult to predict the amplitude spectrogram, I guess. Please listen mallet sounds.