Audio data preprocessing and data loading using torchaudio

seungheondoh@kaist.ac.kr
SeungHeon Doh
TorchAudio 와 함께하는
전처리부터 데이터 로딩

• Audio Deep Learning Task Review
• TorchAudio Review
• TorchAudio vs Librosa
Contents

Presenter!
• 도승헌 (Doh SeungHeon)
• Education
Master Student | GSCT KAIST
Music and Audio Computing Lab
• Research Interest
Music Informational Retrieval, Multi-Media Discovery,
Multi-Modal Deep Learning
• Blog : https://seungheondoh.netlify.com/about
• Github : https://github.com/dohppak
연사소개

Presenter!
Musical Word Embedding: Bridging the Gap between Listening Contexts and Music
Seungheon Doh, Jongpil Lee, Tae Hong Park, and Juhan Nam
Machine Learning for Media Discovery Workshop, International Conference on Machine Learning (ICML), 2020

발표의 중점!
Torch Audio!

발표의 중점!
• 전반적인 Audio Deep Learning Task를 알아봅니다.
• 위 Task들을 푸는데 도움을 주는 Torchaudio 모듈을 알아봅니다.
• 구체적인 논문 리뷰는 다루지 않습니다.

Audio Deep Learning Task?
• Sound
• Sound Classification & Auto-tagging
(Acoustic Scene / Event Identification)
• Speech
• Speech Recognition (ASR, STT)
• Speech Synthesis (TTS)
• Speech Style Transfer (STS)
Sound, Speech, Music
• Music
• Music Generation
• Source Separation
• Singing Voice Synthesis
• Instruments Transcription
• Music Classification. Auto-tagging
• Music Recommendation…

• Sound
• Speech Classification
- Activation
• Speech
• Speech Recognition (STT)
- Input User Interface
• Speech Synthesis (TTS)
- Output of Interaction
With Smart Speaker!

Speech Application?
https://www.youtube.com/watch?v=klnfWhPGPRs&ab_channel=naverd2
https://www.youtube.com/watch?v=aqoXFCCTfm4&ab_channel=Apple

Music Application?
Dhariwal, Prafulla, et al. "Jukebox: A generative model for music.” ICLR 2020
Juheon Lee, Hyeong-Seok Choi, et al, “Adversarially Trained End-to-end Korean Singing Voice Synthesis System”,
in Proceedings of Interspeech, 2019, Best Student Paper Award

Sound Classification & Auto-tagging

Music Auto-tagging
Nam, Juhan, et al. "Deep learning for audio-based music classification and tagging: Teaching computers to
distinguish rock from bach." IEEE signal processing magazine 2018
Data Loader
WavLabel
Cross Entropy Loss
Loss
Input Feature
Label Score
Label
Model

Speech Recognition
Data Loader
WavText
Attention!
Cross Entropy Loss
Loss
Input Feature
Label Score Greedy Decoder
Edit-Distance
오늘 날씨가 어때?
CER
Label
Listen, Attend and Spell
https://arxiv.org/abs/1508.01211
https://github.com/clovaai/ClovaCall/tree/master/las.pytorch

Speech Synthesis
Data Loader
WavText
Mel Loss
Loss
Input Feature
Decoder
Mel Output
Encoder & Attention
Mel
Post Net
Linear Output
Linear Loss
Tacotron: Towards End-to-End Speech Synthesis:
https://github.com/r9y9/tacotron_pytorch

Source Separation
Music Source Separation in the Waveform Domain
https://github.com/facebookresearch/demucs
Data Loader
Wav
Reconstruction Loss
Loss
Mixed Audio
Separate Audio
Label
Model
Wav Wav

Audio Input?
Auto-tagging
Data Loader
WavLabel
Cross Entropy Loss
Loss
Input Feature
Label Score
bel
Model
Speech Recognition
Data Loader
WavText
Attention!
Cross Entropy Loss
Loss
Input Feature
Edit-Distance
CER
Label
Speech Synthesis
Data Loader
WavText
Mel Loss
Loss
Input Feature
Decoder
Mel Output
Encoder & Attention
Mel
Post Net
Linear Outpu
Linear Loss

Audio Input?
Auto-tagging
Data Loader
WavLabel
Cross Entropy Loss
Loss
Input Feature
Label Score
bel
Model
Speech Recognition
Data Loader
WavText
Attention!
Cross Entropy Loss
Loss
Input Feature
Edit-Distance
CER
Label
Speech Synthesis
Data Loader
WavText
Mel Loss
Loss
Input Feature
Decoder
Mel Output
Encoder & Attention
Mel
Post Net
Linear Outpu
Linear LossWaveform or Spectrogram!
Sampling Rate STFT

Torch Audio!
I/O
Transforms
Dataset
데이터 Load, Save!
Wav > Spectrogram
Data Augmentation
LIBRISPEECH
YESNO
GTZAN

torchaudio.load
I/O Functionality
Data load시 [Channel, Waveform]
Mono인 경우는 Channel 이 1이지만, 아닌경우는 Channel을 Mono로 바꿔주는 작업이 필요함! .mean(axis=1)

torchaudio.load
소리 = 진동으로 인한 공기의 압축
압축이 얼마나 됬느냐 = Wave(파동)
Format of Audio
len(D) = {Sampling Rate x Duration}

torchaudio.load
Format of Audio
len(D) = {Sampling Rate x Duration}

torchaudio.load
I/O Functionality
sox : Default, 16-bit signed integer외 에는
에러가 발생할때가 있음
sox_io : Recommended, 그러나 윈도우에는 적용이 안됨…
Soundfile : PySoundFile을 별도로 설치해 주셔야 사용이
가능합니다.

torchaudio.load, torchaudio.save
I/O Functionality
Data load and Save 시
train-clean-100/19/198/19-198-0001.flac - 225 kB
성능은 비슷하다…
Save 시 sox_io 가 좀더 좋은 퍼포먼스를 보이는듯!
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
loader_mean saver_mean
sox sox_io soundfile

torchaudio.load, torchaudio.save
I/O Functionality
Data load and Save 시
train-clean-100/19/198/19-198-0001.flac - 225 kB
성능은 비슷하다…
Save 시 sox_io 가 좀더 좋은 퍼포먼스를 보이는듯!

torchaudio.transform
Auto-tagging
Data Loader
WavLabel
Cross Entropy Loss
Loss
Input Feature
Label Score
bel
Model
Speech Recognition
Data Loader
WavText
Attention!
Cross Entropy Loss
Loss
Input Feature
Edit-Distance
CER
Label
Speech Synthesis
Data Loader
WavText
Mel Loss
Loss
Input Feature
DecoderEncoder & Attention Post Net
Linear Los
Mel
Mel Output
Linear Outp

- Amplitude(Intensity) : 진폭
- Frequency : 주파수
- Phase(Degress of displacement) : 위상

- FFT는 시간에 흐름에 따라 신호의 주파수가 변했을때,
어느 시간대에 주파수가 변하는지 모르게 됩니다.
STFT는 시간의 길이를 나눠서 이제 퓨리에 변환을 하게 됩니다.

- 우리의 인지기관이 categorical한 구분을 하기 때문입니다
멜 스펙트럼은 주파수 단위를 다음 공식에 따라 멜 단위로 바꾼 것을 의미합니다.
이 멜 필터는 저주파 주변에서 얼마만큼 에너지가 있는지를 알려줍니다.

Transform
[Channel, Mel, Time]

Transform
장점
- nn.Sequential 에 묶어서 연산을 처리할 수 있다.
- 즉 Audio Path, index만 있으면 load 이후로는 GPU에서
놀수 있다는 것!

Transform
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Transform
Resampling
Sample Rate가 그렇게 중요하지 않을 때 연산 효율성을
위해서 활용합니다. 음성의 경우 SR이 8000만 되도
인지적으로 구분이 가능합니다.
mu-law encoding
사람의 귀는 소리의 amplitude에 대해 log적으로
반응합니다. 작은값에는 높은 분별력(high resolution)을,
큰값끼리는 낮은 분별력(low resolution)을 갖도록
합니다.

torchaudio.Dataset
Dataset
All datasets are subclasses of torch.utils.data.Dataset i.e, they have __getitem__ and __len__ methods implemented. Hence, they
can all be passed to a torch.utils.data.DataLoader which can load multiple samples parallelly using torch.multiprocessing workers

torchaudio.Dataset
DataLoader!
Data Loader를 전처리 함수 Collate_fn만 잘 설계하면
더 편하게 짤수 있다!

Torchaudio 꼭 필요한가요?
아닙니다!

wavio
madmom
tf.signal
Keras Audio Preprocessors
np.memmap()

torchaudio 이전의 Audio Dataloader

data_manager.py
Wav
hparmas.py
feature_extraction.py .npy
Wavaudio_augmentation.py
src/model.pytrain_test.py
src/metric.pyinference.py
src/utils.py
Wav Wav
torchaudio 이전의
Audio Dataloader

data_manager.py
Wav
hparmas.py
audio_augmentation.py
src/utils.py
Audio Dataloader
.librosa.stft()를 통해
CPU 연산
+ 하드 저장공간 필요
WavWav Wav

data_manager.py
Wav
hparmas.py
src/utils.py
Wav Wav
Audio Dataloader
.DataLoader를 통해
load 이후,
to(device)로 GPU로

data_manager.py
Wav
hparmas.py
src/utils.py
Wav Wav
Audio Dataloader
Dataset의
__getitem__(self, index)
에서 STFT 연산 후
GPU로
feature_extraction

0.03 0.050.07 0.090.09 0.150.15 0.370.23 0.650.38
1.43
0.69
2.702.64
11.54
torchaudio liborsa
2 4 8 16 32 64 128 512
CPU

CPU vs GPU
그렇다면 TorchAudio의 장점을 살리기 위해
STFT부터 GPU에서 돌리면 되는거 아닌가요?

CPU vs GPU
0.0111
0.001
0
0.002
0.004
0.006
0.008
0.01
0.012
CPU GPU
MelSTFT
MelSTFT

CPU vs GPU
0.0111
0.001
0.175
0.0073
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
CPU GPU
MelSTFT One Instance MelSTFT One Batch

Librosa
• Visualization
• Music Domain Specific Modules
• Filters
• Tempo
• Onset
• Filter
• Segment

Librosa
• Visualization
• Music Domain Specific Modules + Examples!
• Filters
• Tempo
• Onset
• Filter
• Segment

CPU vs GPU
https://github.com/KinWaiCheuk/nnAudio
K. W. Cheuk, H. Anderson, K. Agres and D. Herremans,
"nnAudio: An on-the-fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks," in IEEE Access, doi: 10.1109/ACCESS.2020.3019084.

Visualization!
https://pytorch.org/tutorials/beginner/audio_preprocessing_tutorial.html
.numpy()

Visualization!
https://pytorch.org/tutorials/beginner/audio_preprocessing_tutorial.html
.librosa.display

Take Home message
• Torchaudio 는 tensor 연산을 지원한다는 점에서 좋다! 언제든 GPU로 보낼수 있음!, .to(device)
• Best Practice는 자신의 CPU, GPU 리소스에 따라서 다르지만
1080ti 이상의 스펙이라면 waveform 이후 STFT등 연산은 GPU에서 하는게 더 좋은 것 같다.
(조금 더 찾아보면 GPU로 바로 dataload도 가능하다)
• 물론 하드에 저장공간 많으면 feature를 npy로 가지고 있으면 편하게 실험이 가능하다.
하지만 hparams 별 feature를 뽑을 실험 설계를 할 정도의 리소스가 있을 것인가!

Audio data preprocessing and data loading using torchaudio

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Audio data preprocessing and data loading using torchaudio

Similar to Audio data preprocessing and data loading using torchaudio (20)

Audio data preprocessing and data loading using torchaudio

Editor's Notes