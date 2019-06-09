Successfully reported this slideshow.
Deep Learning with Audio Signals Prepare, Process, Design, Expect Keunwoo Ch i
Keunwoo Choi QMUL, UK ETRI, S. Korea SNU, S. Korea @keunwoochoi (twtr, github) Research Scientist
WARNING THIS MATERIAL IS WRITTEN FOR ATTENDEES IN QCON.AI, NAMELY, SOFTWARE ENGINEERS AND DEEP LEARNING PRACTITIONERS TO P...
Content • Prepare the dataset • Pre-process the signal • Design your network • Expect the result
Prepare the datasets or, know your data Q. How to start an audio task?
LMGTFY • Google them, of course • But....
Audio dataset • Lucky → the exactly same class(es), many of them, yay! • Meh → same or similar classes, sounds alright.. ...
Audio (or, sound) dataset • Our algorithm is living in the digital space • So is the .wav ﬁles • But,  the sound is in t...
Audio dataset Source Noise Reverberation Microphone • Room reverberation image from https://johnlsayers.com/Recmanual/Page...
Audio dataset Dear everyone, YOU ARE ALWAYS IN THE "UGH..." SITUATION → HOW TO BUILD A CORRECT AUDIO DATASET?
What we can do • Know your real situation • You can mimic noise/reverberation/mic if you have • clean/dry/high-quality s...
Simulate the real world + noise signalclean signal noisy signal room impulse responsedry signal wet signal band-pass ﬁlter...
What to Google Noise babble noise recording home noise recording cafe noise recording street noise recording white noise, ...
Pre-process the signals or, log(melgram) Q. What to do after loading the signals?
Digital Audio 101 • 1 second of digital audio:  size=(44100, ), dtype=int16 • MNIST: (28, 28, 1), int8  CIFAR10: (32, 32,...
Audio representations Type Description Data shape and size for e.g., 1 second,  sampling rate=44100 Waveform x 44100 x [i...
Spectrograms • 2-dim representation of audio signal TODO: IMAGE
Practitioner's choice • Rule of thumb: DISCARD ALL THE REDUNDANCY • Sample rate, or bandwidth • Goal: To optimize the in...
Practitioner's choice • Melspectrogram  - in decibel scale  - which only covers the frequency range you're interested in. ...
import librosa import madmom • Python libraries - librosa/madmom/scipy/.. • Computations on CPU • Best when all the pro...
import kapre • Keras Audio Preprocessing layers • CPU and GPU • Best when you want to do things on the ﬂy/GPU  = Best to...
Design your network or, know the assumptions Q. What kind of network structure I need?
A dumb-but-strong-therefore-good-while- annoying-since-it's-from-computer-vision baseline approach • Trim the signals prop...
Go even dumber • Just download some pre-trained networks for..  - music  - audio  - image (?) • Re-use it for your task (...
Better and stronger, by understanding assumptions • assert "Receptive ﬁeld" size == size of the target pattern • How spar...
Have no idea? • Go see how computer vision people are doing • Clone it • It's ok, it's a good baseline at least
My spectrogram is 28x28 bc the model I downloaded is trained on MNIST Don't use spectrograms as if they are images It all ...
Expecting the result or, know the problem Q. How would it work?
YOU • You are responsible for the feasibility • Is it a task you can? • Is the information in the input (mel-spectrogram...
Think about it! • Is it possible? To what extent? E.g., • Baby crying detection • Baby crying recognition and classiﬁca...
Conclusion Conclusion.. Conclusion!
Conclusion • Sound is analog, you might need to think about some analog process, too. • Pre-process: Follow others when y...
Deep Learning with Audio Signal Prepare, Process, Design, Expect Keunwoo Ch i Q&A PS. See you soon at the panel talk!
  1. 1. Deep Learning with Audio Signals Prepare, Process, Design, Expect Keunwoo Ch i
  2. 2. Keunwoo Choi QMUL, UK ETRI, S. Korea SNU, S. Korea @keunwoochoi (twtr, github) Research Scientist
  3. 3. WARNING THIS MATERIAL IS WRITTEN FOR ATTENDEES IN QCON.AI, NAMELY, SOFTWARE ENGINEERS AND DEEP LEARNING PRACTITIONERS TO PROVIDE AN OFF-THE- SHELF GUIDE. MY ADVICE MIGHT NOT BE THE FINAL SOLUTION FOR YOUR PROBLEM, BUT WOULD BE A GOOD STARTING POINT. ..ALSO, THERE'S NO SPOTIFY SECRET HERE :P
  4. 4. Content • Prepare the dataset • Pre-process the signal • Design your network • Expect the result
  5. 5. Prepare the datasets or, know your data Q. How to start an audio task?
  6. 6. LMGTFY • Google them, of course • But....
  7. 7. Audio dataset • Lucky → the exactly same class(es), many of them, yay! • Meh → same or similar classes, sounds alright.. • Ugh.. → there are 2 in freesound.org and 3 on youtube
  8. 8. Audio (or, sound) dataset • Our algorithm is living in the digital space • So is the .wav ﬁles • But,  the sound is in the real world Our lovely cyberspace
  9. 9. Audio dataset Source Noise Reverberation Microphone • Room reverberation image from https://johnlsayers.com/Recmanual/Pages/Reverb.htm
  10. 10. Audio dataset Dear everyone, YOU ARE ALWAYS IN THE "UGH..." SITUATION → HOW TO BUILD A CORRECT AUDIO DATASET?
  11. 11. What we can do • Know your real situation • You can mimic noise/reverberation/mic if you have • clean/dry/high-quality source signals DL models are robust only within the variance they've seen. → Good at interpolation.. only. E.g., a model trained with clean signals probably can't deal with noisy signals noisy environment cheap mic
  12. 12. Simulate the real world + noise signalclean signal noisy signal room impulse responsedry signal wet signal band-pass ﬁlter original signal recorded signal
  13. 13. What to Google Noise babble noise recording home noise recording cafe noise recording street noise recording white noise, brown noise x_noise = x + alpha * noise Reverberation (maybe skip it) room impulse responses, RIR reverberation simulators x_wet = np.conv(x, rir) Microphone band pass ﬁlter scipy.signal ﬁltering microphone speciﬁcation speaker speciﬁcation microphone frequency response scipy.signal.convolve scipy.signal.ﬀtconvolve Or trimming-oﬀ your spectrograms
  14. 14. Pre-process the signals or, log(melgram) Q. What to do after loading the signals?
  15. 15. Digital Audio 101 • 1 second of digital audio:  size=(44100, ), dtype=int16 • MNIST: (28, 28, 1), int8  CIFAR10: (32, 32, 3), int8  ImageNet: (256, 256, 3), int8 • Audio: Lots of data points in one item!
  16. 16. Audio representations Type Description Data shape and size for e.g., 1 second,  sampling rate=44100 Waveform x 44100 x [int16] Spectrograms STFT(x) Melspectrogram(x) CQT(x) 513 x 87 x [ﬂoat32] 128 x 87 x [ﬂoat32] 72 x 87 x [ﬂoat32] Features MFCC(x) = some process on STFT(x) 20 x 87 x [ﬂoat32] Spoiler: log10(Melspectrograms) for the win, but let's see some details
  17. 17. Spectrograms • 2-dim representation of audio signal TODO: IMAGE
  18. 18. Practitioner's choice • Rule of thumb: DISCARD ALL THE REDUNDANCY • Sample rate, or bandwidth • Goal: To optimize the input audio data for your model • by resampling - can be computation heavier • by discarding some freq bands - can be storage heavy https://www.summerrankin.com/dogandponyshow/2017/10/16/catdog
  19. 19. Practitioner's choice • Melspectrogram  - in decibel scale  - which only covers the frequency range you're interested in. • Why?  - smaller, therefore easier and faster training  - perceptual - weighing more on the freq region where humans are more interested  - faster than CQT to compute  - decibel scale - another perceptually motivated choice Q. Ok, how can I compute them?
  20. 20. import librosa import madmom • Python libraries - librosa/madmom/scipy/.. • Computations on CPU • Best when all the processing will be done before the training
  21. 21. import kapre • Keras Audio Preprocessing layers • CPU and GPU • Best when you want to do things on the ﬂy/GPU  = Best to optimize audio-related parameters • pip install kapre • There's also pytorch-audio!Disclaimer: I'm the maintainer
  22. 22. Design your network or, know the assumptions Q. What kind of network structure I need?
  23. 23. A dumb-but-strong-therefore-good-while- annoying-since-it's-from-computer-vision baseline approach • Trim the signals properly (e.g. 1-sec) • Do the classiﬁcation with 2D convnet, 3x3 kernel (=aka vggnet) • Raise $1B • Retire • Post "why i retired.." on Medium • Happy life!
  24. 24. Go even dumber • Just download some pre-trained networks for..  - music  - audio  - image (?) • Re-use it for your task (aka transfer learning) • 1B - retire - Medium - happy - repeat
  25. 25. Better and stronger, by understanding assumptions • assert "Receptive ﬁeld" size == size of the target pattern • How sparse the target pattern is?  - Bird singing sparse?   - Voice-in-music sparse?   - Distortion-guitar-in-Metallica sparse?
  26. 26. Have no idea? • Go see how computer vision people are doing • Clone it • It's ok, it's a good baseline at least
  27. 27. My spectrogram is 28x28 bc the model I downloaded is trained on MNIST Don't use spectrograms as if they are images It all boils down to the pattern recognition, they're actually similar tasks. the time and frequency axes have totally different meanings I don't know how to incorporate them into my model.. BUT IT WORKS!
  28. 28. Expecting the result or, know the problem Q. How would it work?
  29. 29. YOU • You are responsible for the feasibility • Is it a task you can? • Is the information in the input (mel-spectrogram)? • Are similar tasks being solved?
  30. 30. Think about it! • Is it possible? To what extent? E.g., • Baby crying detection • Baby crying recognition and classiﬁcation • Dog barking translation • Hit song detection
  31. 31. Conclusion Conclusion.. Conclusion!
  32. 32. Conclusion • Sound is analog, you might need to think about some analog process, too. • Pre-process: Follow others when you're lost • Audio is big in data size, but sparse in information. Reduce the size. Don't start with end-to-end. • Design: Follow others when you're lost • Expect: Make sure if it's doable
  33. 33. Deep Learning with Audio Signal Prepare, Process, Design, Expect Keunwoo Ch i Q&A PS. See you soon at the panel talk!

