5. Presenter!
Musical Word Embedding: Bridging the Gap between Listening Contexts and Music
Seungheon Doh, Jongpil Lee, Tae Hong Park, and Juhan Nam
Machine Learning for Media Discovery Workshop, International Conference on Machine Learning (ICML), 2020
8. 발표의 중점!
• 전반적인 Audio Deep Learning Task를 알아봅니다.
• 위 Task들을 푸는데 도움을 주는 Torchaudio 모듈을 알아봅니다.
• 구체적인 논문 리뷰는 다루지 않습니다.
9. Audio Deep Learning Task?
• Sound
• Sound Classification & Auto-tagging
(Acoustic Scene / Event Identification)
• Speech
• Speech Recognition (ASR, STT)
• Speech Synthesis (TTS)
• Speech Style Transfer (STS)
Sound, Speech, Music
• Music
• Music Generation
• Source Separation
• Singing Voice Synthesis
• Instruments Transcription
• Music Classification. Auto-tagging
• Music Recommendation…
10. Audio Deep Learning Task?
• Sound
• Speech Classification
- Activation
• Speech
• Speech Recognition (STT)
- Input User Interface
• Speech Synthesis (TTS)
- Output of Interaction
With Smart Speaker!
11. Audio Deep Learning Task?
Speech Application?
https://www.youtube.com/watch?v=klnfWhPGPRs&ab_channel=naverd2
https://www.youtube.com/watch?v=aqoXFCCTfm4&ab_channel=Apple
12. Audio Deep Learning Task?
Music Application?
Dhariwal, Prafulla, et al. "Jukebox: A generative model for music.” ICLR 2020
Juheon Lee, Hyeong-Seok Choi, et al, “Adversarially Trained End-to-end Korean Singing Voice Synthesis System”,
in Proceedings of Interspeech, 2019, Best Student Paper Award
14. Audio Deep Learning Task?
Music Auto-tagging
Nam, Juhan, et al. "Deep learning for audio-based music classification and tagging: Teaching computers to
distinguish rock from bach." IEEE signal processing magazine 2018
Data Loader
WavLabel
Cross Entropy Loss
Loss
Input Feature
Label Score
Label
Model
15. Audio Deep Learning Task?
Speech Recognition
Data Loader
WavText
Attention!
Cross Entropy Loss
Loss
Input Feature
Label Score Greedy Decoder
Edit-Distance
오늘 날씨가 어때?
CER
Label
Listen, Attend and Spell
https://arxiv.org/abs/1508.01211
https://github.com/clovaai/ClovaCall/tree/master/las.pytorch
16. Audio Deep Learning Task?
Speech Synthesis
Data Loader
WavText
Mel Loss
Loss
Input Feature
Decoder
Mel Output
Encoder & Attention
Mel
Post Net
Linear Output
Linear Loss
Tacotron: Towards End-to-End Speech Synthesis:
https://arxiv.org/abs/1703.10135
https://github.com/r9y9/tacotron_pytorch
17. Audio Deep Learning Task?
Source Separation
Music Source Separation in the Waveform Domain
https://github.com/facebookresearch/demucs
Data Loader
Wav
Reconstruction Loss
Loss
Mixed Audio
Separate Audio
Label
Model
Wav Wav
18. Audio Input?
Auto-tagging
Data Loader
WavLabel
Cross Entropy Loss
Loss
Input Feature
Label Score
bel
Model
Speech Recognition
Data Loader
WavText
Attention!
Cross Entropy Loss
Loss
Input Feature
Label Score Greedy Decoder
Edit-Distance
오늘 날씨가 어때?
CER
Label
Speech Synthesis
Data Loader
WavText
Mel Loss
Loss
Input Feature
Decoder
Mel Output
Encoder & Attention
Mel
Post Net
Linear Outpu
Linear Loss
19. Audio Input?
Auto-tagging
Data Loader
WavLabel
Cross Entropy Loss
Loss
Input Feature
Label Score
bel
Model
Speech Recognition
Data Loader
WavText
Attention!
Cross Entropy Loss
Loss
Input Feature
Label Score Greedy Decoder
Edit-Distance
오늘 날씨가 어때?
CER
Label
Speech Synthesis
Data Loader
WavText
Mel Loss
Loss
Input Feature
Decoder
Mel Output
Encoder & Attention
Mel
Post Net
Linear Outpu
Linear LossWaveform or Spectrogram!
Sampling Rate STFT
24. torchaudio.load
I/O Functionality
sox : Default, 16-bit signed integer외 에는
에러가 발생할때가 있음
sox_io : Recommended, 그러나 윈도우에는 적용이 안됨…
Soundfile : PySoundFile을 별도로 설치해 주셔야 사용이
가능합니다.
25. torchaudio.load, torchaudio.save
I/O Functionality
Data load and Save 시
train-clean-100/19/198/19-198-0001.flac - 225 kB
성능은 비슷하다…
Save 시 sox_io 가 좀더 좋은 퍼포먼스를 보이는듯!
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
loader_mean saver_mean
sox sox_io soundfile
27. torchaudio.transform
Auto-tagging
Data Loader
WavLabel
Cross Entropy Loss
Loss
Input Feature
Label Score
bel
Model
Speech Recognition
Data Loader
WavText
Attention!
Cross Entropy Loss
Loss
Input Feature
Label Score Greedy Decoder
Edit-Distance
오늘 날씨가 어때?
CER
Label
Speech Synthesis
Data Loader
WavText
Mel Loss
Loss
Input Feature
DecoderEncoder & Attention Post Net
Linear Los
Mel
Mel Output
Linear Outp
35. torchaudio.transform
Transform
Resampling
Sample Rate가 그렇게 중요하지 않을 때 연산 효율성을
위해서 활용합니다. 음성의 경우 SR이 8000만 되도
인지적으로 구분이 가능합니다.
mu-law encoding
사람의 귀는 소리의 amplitude에 대해 log적으로
반응합니다. 작은값에는 높은 분별력(high resolution)을,
큰값끼리는 낮은 분별력(low resolution)을 갖도록
합니다.
36. torchaudio.Dataset
Dataset
All datasets are subclasses of torch.utils.data.Dataset i.e, they have __getitem__ and __len__ methods implemented. Hence, they
can all be passed to a torch.utils.data.DataLoader which can load multiple samples parallelly using torch.multiprocessing workers
52. CPU vs GPU
https://github.com/KinWaiCheuk/nnAudio
K. W. Cheuk, H. Anderson, K. Agres and D. Herremans,
"nnAudio: An on-the-fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks," in IEEE Access, doi: 10.1109/ACCESS.2020.3019084.
55. Take Home message
• Torchaudio 는 tensor 연산을 지원한다는 점에서 좋다! 언제든 GPU로 보낼수 있음!, .to(device)
• Best Practice는 자신의 CPU, GPU 리소스에 따라서 다르지만
1080ti 이상의 스펙이라면 waveform 이후 STFT등 연산은 GPU에서 하는게 더 좋은 것 같다.
(조금 더 찾아보면 GPU로 바로 dataload도 가능하다)
• 물론 하드에 저장공간 많으면 feature를 npy로 가지고 있으면 편하게 실험이 가능하다.
하지만 hparams 별 feature를 뽑을 실험 설계를 할 정도의 리소스가 있을 것인가!
Encoder로서 음성 신호를 high level feature
Listener는 BLSTM을 Pyramidal 형식으로 3개를 붙여서 사용하고 있다.
Speller는 Decoder로서, attention-based LSTM 변환기 역할을 맡고 있다
입력 음성에 대해 알맞는 sequence의 log probability를 maximize한다
음성 데이터 input은 Mel-Spectrogram을 사용하여 number of mel을 40으로 주고 10ms마다 features들을 뽑았다 (40 dimensional log-mel filter bank).
CBHG는 Convolution Bank + Highway + GRU
먼저 디코더의 출력인 Mel-Spectrogram 출력과 타겟 Mel-Spectrogram과의 L1 Distance를 구한다.
Linear-Spectrogram 출력과 타겟 Linear-Spectrogram과의 L1 Distance를 구한다. 마찬가지로 Linear-Loss로 정의한다
입력 텍스트 및 타겟 오디오를 전처리하여 Mel-Spectrogram과 Linear-Spectrogram을 준비해놔야 한다
For the reconstruction loss L(gs(x; θ), xs) in equation 1,
we either use the average mean square error or average absolute error between waveforms:
for a waveform xs containing T samples and corresponding to source s, a predicted waveform xˆs and denoting with a subscript t the t-th sample of a waveform, we use one of L1 or L2:
Create a Dataset for LibriSpeech. Each item is a tuple of the form:
waveform,
sample_rate,
utterance,
speaker_id, chapter_id, utterance_id