SlideShare a Scribd company logo
1 of 33
Download to read offline
Audio Inpainting
with
Generative Adversarial Network
Authors: Pirmin Philipp Ebner, Amr Eltelt
Published in Ar Xiv [ eess. AS ], March 13, 2020
Presenter : Kuan Hsun Ho
Date : 2021/10/08
Outline
1. Introduction
-Background, Objective
2. Architecture
-GAN, WGAN, Long/short Boarders, Dual Discriminators WGAN, Loss Function
3. Methodology
-Dataset, Signal Preprocessing, Training, Evaluation Metrics
4. Results & Discussion
-Different Network, Different Dataset, Different Training Steps
5. Conclusion & Outlook
6. References
Introduction
01
● Corrupted audio files, lost information in audio transmission (e.g. VoIP), and
audio signals locally contaminated by noise are highly important problems in
various audio processing tasks like in music enhancement and restoration.
● Loss of the connection in audio transmission -> data loss beyond hundred
milliseconds. This has highly unpleasant consequences for a listener, and it is
hardly feasible to reconstruct the lost content from local information only.
● The audio features are high dimensional, complex and non-correlated ->
applying state-of-the art models in image or video inpaintings tends not to
work well[2].
● Restoring of lost information in audio : waveform substitution [5], audio
inter/extrapolation [6,7], or audio inpainting [8].
Introduction
Background
● Reconstruction is usually aimed at providing a coherent and meaningful information
while preventing audible artifacts so that the listener remains unaware of any occurred
problem.
● Gaps < 50ms, small gaps, no synthesis needed, consider only over statistics
=> Uses sparsity-based techniques.
Otherwise, long gaps, regressive modeling[7] / sinusoidal modeling[10,11] / GAN[13] is
used.
● Audio signal are composed of fundamental freq and overtones.
● Having multiple different instrument in the training set, a higher frequency
spectrum has to be covered and the generated audio signal is stronger influenced by
noise.
Introduction
Background
● Audio inpainting of long gap (500~550ms) audio content.
● Focus on waveform strategies rather than spectrogram or multimodal strategies.
● Using WGAN in generating audio with
1. global coherence.
2. good audio quality.
3. different type of instruments from different dataset.
=> ability to adapt to different audio signals
● Finding a optimal model setting that is generalized rather than overfitted to specific dataset.
● There is no specific metric that can be used to evaluate the quality of generated audio signal
and therefore we depend on human judgement using the objective difference grading (ODG)
technique for evaluation.
● It is clear that in this unsupervised setting, our goal is not to generate a perfectly matching
audio similar to ground truth.
Introduction
Objective
Architecture
02
● GANs rely on two competing neural networks trained simultaneously in a two-player min-max
game: The generator produces new data from samples of a random variable; The
discriminator attempts to distinguish between these generated and real data.
● Generator : To fool the discriminator
Discriminator : Learn to better classify real and generated (fake) data.
● Advantages of GAN-based approaches is the rapid and straightforward sampling of large
amounts of audio.
● Several GAN approaches have been investigate to fill gaps in audio like WaveGAN [4] or
TiFGAN [14].
Architecture
GAN
● Improve the stability of learning.
○ Both generator and discriminator have to learn in correct directions.
● Get rid of problems like mode collapse.
○ Generator and discriminator have to be equally robust.
● Provide meaningful learning curves useful for debugging and hyperparameter
searches.
● Guarantee for data diversity.
Architecture
WGAN
Architecture
WGAN
G : Z -> X D : X -> [0,1] -> backpropagation
● The approach applied here is the extraction of the short-range borders and the long-range borders of
the missing audio content.
● Close neighbor audio have proven success to inpaint short period missing audio, however it fails for long
periods.
● A larger range of neighboring audio can tell us something about the long missing audio content.
Architecture
Long/Short Boarders
Boarders 1 length = 1+0.5+1 = 2.5s.
Boarders 2 length = 3+0.5+3 = 6.5s.
(Architecture of proposed model implemented by Tensorflow)
Architecture
Dual Discriminators WGAN
● To make GAN perform well, we minimize the total loss of both discriminators and generator,
and expect convergence.
Architecture
Loss Function
This figure shows the performance of training the total D-loss and total G-loss for different dataset.
Methodology
03
● We considered instrument sounds including piano, acoustic guitar and piano together with
a string orchestra using three different dataset.
1. The PIANO dataset contains only piano samples.
2. The SOLO dataset contains different instruments recording including accordion, acoustic
guitar, cello, flute, saxophone, trumpet, violin, and xylophone. In our work only the
acoustic guitar dataset.
3. The MAESTRO dataset contains recordings from the International Piano-e-Competition.
It is a piano performance competition where virtuoso pianists perform on Yamaha
claviers together with a string orchestra. The data must be splitted in training / validation
/ testing so that the same composition, even if performed by multiple contestants, does
not appear in multiple subsets. (Important!)
Methodology
Dataset
(Training / validation / testing segmentation)
Methodology
Dataset
Methodology
Signal Preprocessing
The filter cutoff frequency is set lower than Nyquist frequency. The audio signals are then
downsampled by 3.
● Both Generator and Discriminator shares same hyperparameter :
Learning rate : 1e-4
Optimizer : Adam
Batch size : 64 MAESTRO
Iteration ( steps )
PIANO, SOLO
● Parameters : 53248
● 3 datasets x 2 model = 6 training
● Figure besides shows comparison of generated audio
and real audio.
Methodology
Training
WGAN : 73 k
D2WGAN : 56.3 k
WGAN : 40 k
D2WGAN : 53.6 k
● Using the signal-to-noise ratios (SNRs) applied to the time-domain waveforms and
magnitude spectrograms is not an appropriate indicator to evaluate the model.
● Therefore, we conducted a user study evaluating the inpainting quality by means of
objective difference grades (ODG [22]).
● 50 samples of 6.5 s length for each dataset and model, in total 150 samples per model,
evaluate them by two independent person without knowing the source of the samples and
evaluate using ODG.
Methodology
Evaluation Metrics
Results & Discussion
04
Results & Discussion
Different Network
For PIANO,
● D-loss and G-loss in both models
converge.
● No overfitting occurs.
=> WGAN slightly outperforms.
For SOLO,
● D-loss in WGAN hasn’t converged.
● G-loss in both models converge.
=> D2WGAN highly outperforms.
● ODG table is as follows.
Results & Discussion
Different Network
+0.02 +0.08 +0.2 +0.1
-0.33 -0.07
● Inpainting quality : SOLO > PIANO > MAESTRO
● The SOLO dataset offers much more inpainting techniques and sound variations than the
PIANO dataset.
Guitar, 144 tones, covers four octaves. Piano, 88 tones, covers seven octaves.
● Lower frequency signal represents information at a lower rate (needs lesser samples to
reconstruct it) whereas a high frequency signal represents information at a faster rate (and hence
needs more samples while reconstruction).
Results & Discussion
Different Dataset
● Human ears are more sensitive in high frequencies, we can know it
from k-weighted filter.
● The MAESTRO dataset, a combination of piano together with a
string orchestra, is much harder to train. Also, it is a huge dataset of
201 hours of music and there is not only one single instrument
playing at the same time.
Results & Discussion
Different Dataset
● Improves ODG performance by training our network longer.
● The D2WGAN model was trained on the PIANO dataset for 140000 steps and tested both on the
PIANO and MAESTRO dataset.
● The D2WGAN model showed on both dataset a huge improvement. The results indicate that we
didn’t overfit the D2WGAN model and got excellent performance (both mean & std).
● Explanation may be that the model interpreted the orchestra mainly as background noise. Thus,
small noise in the inpainting part will be stronger smoothed and therefore the impairment is
more perceptible but not annoying.
Results & Discussion
Different Training Steps
Conclusion & Outlook
05
● Audio inpainting of long gap (500~550ms) audio content.
● Focus on waveform strategies rather than spectrogram or multimodal strategies.
● Using WGAN in generating audio with
1. global coherence.
2. good audio quality.
3. different type of instruments from different dataset.
=> ability to adapt to different audio signals
● Finding a optimal model setting that is generalized rather than overfitted to specific dataset.
● There is no specific metric that can be used to evaluate the quality of generated audio signal
and therefore we depend on human judgement using the objective difference grading (ODG)
technique for evaluation.
● It is clear that in this unsupervised setting, our goal is not to generate a perfectly matching
audio similar to ground truth.
Introduction
Objective
● We analysed long gap (500 - 550 ms) audio content inpainting using WGAN and D2WGAN.
● The D2WGAN model is a new proposed architecture improvement to inpaint the missing
audio content using a short-range and a long-range borders.
● The new proposed D2WGAN model slightly outperforms the classic WGAN model for all
three different dataset (piano, guitar, piano and orchestra).
● The long range borders combined with short range borders can potential provide
information in correlation with the gap content.
● A better improvement was observed for the SOLO and MAESTRO.
● The poor results for the PIANO can be illustrated by the fact that it lacks audible variation in
sound.
● As we aim for generalized solution, we need to have a dataset with a generalized audio
content.
Conclusion & Outlook
● By increasing the training step, the D2WGAN has significantly improved performance
without overfitting.
● Better results can be achieved for audio dataset where a particular instrument is
accompanist by other instruments if we train the network only on this particular instrument
and neglecting the other instruments.
● As for future work, few topics can be done
1. playing around with the boarder length
2. adding more layers to the convolution operation or changing filters sizes
3. exploring multimodal strategies where we combine waveform and spectrogram
images
4. finding datasets that helps strengthen reconstruction of high frequency
5. training on a wide range of music instruments
Conclusion & Outlook
References
06
[1] N. Perraudin, N. Holighaus, P. Majdak, and P. Balazs, “Inpainting of long audio segments with similarity graphs,” IEEE/ACM Transactions on Audio,
Speech, and Language Processing, 2018.
[2] Y.-L. Chang, K.-Y. Lee, P.-Y. Wu, H.-y. Lee, and W. Hsu, “Deep long audio inpainting,” arXiv:1911.06476v1, 2019.
[3] A.VanDenOord,S.Dieleman,H.Zen,K.Simonyan,O.Vinyals,A.Graves,N.Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative
model for raw audio.,” in SSW, p. 125, 2016.
[4] C. Donahue, J. McAuley, and M. Puckette, “Synthesizing audio with generative adversarial networks,” arXiv:1802.04208, 2018.
[5] D. Goodman, G. Lockhart, O. Wasem, and W.-C. Wong, “Waveform substitution techniques for recovering missing speech segments in packed voice
communications,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 34, no. 6, pp. 1440–1448, 1986.
[6] I. Kauppinen, J. Kauppinen, and P. Saarinen, “A method for long extrapolation of audio signals,” Journal of the Audio Engineering Society, vol. 49, no.
12, pp. 1167–1180, 2001.
[7] W.Etter,“Restoration of a discrete-time signal segment by interpolation based on the left-sided and right-sided autoregressive parameters,” IEEE
Transactions on Signal Processing, vol. 44, no. 5, pp. 1124–1135, 1996.
[8] A. Adler, V. Emiya, M. Jafari, M. Elad, R. Gribonval, and M. Plumbley, “Audio inpainting,” IEEE Transactions on Audio, Speech and Language
Processing, vol. 20, no. 3, pp. 922–932, 2012.
[9] K. Siedenburg, M. Drfler, and M. Kowalski, “Audio inpainting with social sparsity,” SPARS Signal Processing with Adaptive Sparse Structured
Representations, 2013.
References
[10] M. Lagrange, S. Marchand, and J.-B. Rault, “Long interpolation of audio signals using linear prediction in sinusoidal modeling,” Audio Eng. Soc., vol.
53, no. 10, pp. 891–905, 2005.
[11] A. Lukin and J. Todd, “Parametric interpolation of gaps in audio signals,” Audio Engineering Society Convention, vol. 125, 2008.
[12] Y.Bahat,Y.Schechner,andM.Elad,“Self-content-basedaudioinpainting,”SignalProcessing, vol. 111, pp. 61–72, 2015.
[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in
neural information processing sys- tems, pp. 2672–2680, 2014.
[14] A. Marafioti, N. Holighaus, N. Perraudin, and P. Majdak, “Adversarial generation of time- frequency features with application in audio synthesis,”
arXiv preprint arXiv:1902.04072, 2019.
[15] T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li, “Mode regularized generative adversarial networks,” International Conference on Learning
Representations, 2017.
[16] A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton, “VEE-GAN: Reducing mode collapse in GANs using implicit variational
learning,” Advances In Neutral Information Processing Systems, 2017.
[17] K. J. Liang, C. Li, G. Wang, and L. Carin, “Generative adversarial network training is a con- tinual learning problem,” arXiv:1811.11083v1, 2018.
References
CREDITS: This presentation template was created
by Slidesgo, including icons by Flaticon,and
infographics & images by Freepik
Thanks

More Related Content

Similar to Audio Inpainting with D2WGAN.pdf

Audio Noise Removal – The State of the Art
Audio Noise Removal – The State of the ArtAudio Noise Removal – The State of the Art
Audio Noise Removal – The State of the Artijceronline
 
Review Paper on Noise Reduction Using Different Techniques
Review Paper on Noise Reduction Using Different TechniquesReview Paper on Noise Reduction Using Different Techniques
Review Paper on Noise Reduction Using Different TechniquesIRJET Journal
 
A review of analog audio scrambling methods for residual intelligibility
A review of analog audio scrambling methods for residual intelligibilityA review of analog audio scrambling methods for residual intelligibility
A review of analog audio scrambling methods for residual intelligibilityAlexander Decker
 
Audio Steganography Coding Using the Discreet Wavelet Transforms
Audio Steganography Coding Using the Discreet Wavelet TransformsAudio Steganography Coding Using the Discreet Wavelet Transforms
Audio Steganography Coding Using the Discreet Wavelet TransformsCSCJournals
 
IRJET- Wavelet Transform based Steganography
IRJET- Wavelet Transform based SteganographyIRJET- Wavelet Transform based Steganography
IRJET- Wavelet Transform based SteganographyIRJET Journal
 
A novel speech enhancement technique
A novel speech enhancement techniqueA novel speech enhancement technique
A novel speech enhancement techniqueeSAT Publishing House
 
IRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET- Survey on Efficient Signal Processing Techniques for Speech EnhancementIRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET- Survey on Efficient Signal Processing Techniques for Speech EnhancementIRJET Journal
 
Snorm–A Prototype for Increasing Audio File Stepwise Normalization
Snorm–A Prototype for Increasing Audio File Stepwise NormalizationSnorm–A Prototype for Increasing Audio File Stepwise Normalization
Snorm–A Prototype for Increasing Audio File Stepwise NormalizationIJERA Editor
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)IRJET Journal
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...karthik annam
 
IRJET- Musical Instrument Recognition using CNN and SVM
IRJET-  	  Musical Instrument Recognition using CNN and SVMIRJET-  	  Musical Instrument Recognition using CNN and SVM
IRJET- Musical Instrument Recognition using CNN and SVMIRJET Journal
 
Quality and Distortion Evaluation of Audio Signal by Spectrum
Quality and Distortion Evaluation of Audio Signal by SpectrumQuality and Distortion Evaluation of Audio Signal by Spectrum
Quality and Distortion Evaluation of Audio Signal by SpectrumCSCJournals
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
IRJET- A Novel Hybrid Image Denoising Technique based on Trilateral Filtering...
IRJET- A Novel Hybrid Image Denoising Technique based on Trilateral Filtering...IRJET- A Novel Hybrid Image Denoising Technique based on Trilateral Filtering...
IRJET- A Novel Hybrid Image Denoising Technique based on Trilateral Filtering...IRJET Journal
 
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and SynthesizerIRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and SynthesizerIRJET Journal
 
Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...eSAT Journals
 

Similar to Audio Inpainting with D2WGAN.pdf (20)

Audio Noise Removal – The State of the Art
Audio Noise Removal – The State of the ArtAudio Noise Removal – The State of the Art
Audio Noise Removal – The State of the Art
 
Review Paper on Noise Reduction Using Different Techniques
Review Paper on Noise Reduction Using Different TechniquesReview Paper on Noise Reduction Using Different Techniques
Review Paper on Noise Reduction Using Different Techniques
 
A review of analog audio scrambling methods for residual intelligibility
A review of analog audio scrambling methods for residual intelligibilityA review of analog audio scrambling methods for residual intelligibility
A review of analog audio scrambling methods for residual intelligibility
 
Audio Steganography Coding Using the Discreet Wavelet Transforms
Audio Steganography Coding Using the Discreet Wavelet TransformsAudio Steganography Coding Using the Discreet Wavelet Transforms
Audio Steganography Coding Using the Discreet Wavelet Transforms
 
IRJET- Wavelet Transform based Steganography
IRJET- Wavelet Transform based SteganographyIRJET- Wavelet Transform based Steganography
IRJET- Wavelet Transform based Steganography
 
A novel speech enhancement technique
A novel speech enhancement techniqueA novel speech enhancement technique
A novel speech enhancement technique
 
IRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET- Survey on Efficient Signal Processing Techniques for Speech EnhancementIRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
IRJET- Survey on Efficient Signal Processing Techniques for Speech Enhancement
 
Snorm–A Prototype for Increasing Audio File Stepwise Normalization
Snorm–A Prototype for Increasing Audio File Stepwise NormalizationSnorm–A Prototype for Increasing Audio File Stepwise Normalization
Snorm–A Prototype for Increasing Audio File Stepwise Normalization
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
IRJET- Musical Instrument Recognition using CNN and SVM
IRJET-  	  Musical Instrument Recognition using CNN and SVMIRJET-  	  Musical Instrument Recognition using CNN and SVM
IRJET- Musical Instrument Recognition using CNN and SVM
 
SEGAN: Speech Enhancement Generative Adversarial Network
SEGAN: Speech Enhancement Generative Adversarial NetworkSEGAN: Speech Enhancement Generative Adversarial Network
SEGAN: Speech Enhancement Generative Adversarial Network
 
Quality and Distortion Evaluation of Audio Signal by Spectrum
Quality and Distortion Evaluation of Audio Signal by SpectrumQuality and Distortion Evaluation of Audio Signal by Spectrum
Quality and Distortion Evaluation of Audio Signal by Spectrum
 
Sudormrf.pdf
Sudormrf.pdfSudormrf.pdf
Sudormrf.pdf
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
IRJET- A Novel Hybrid Image Denoising Technique based on Trilateral Filtering...
IRJET- A Novel Hybrid Image Denoising Technique based on Trilateral Filtering...IRJET- A Novel Hybrid Image Denoising Technique based on Trilateral Filtering...
IRJET- A Novel Hybrid Image Denoising Technique based on Trilateral Filtering...
 
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and SynthesizerIRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
 
Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...
 
H010234144
H010234144H010234144
H010234144
 
Nd2421622165
Nd2421622165Nd2421622165
Nd2421622165
 

More from ssuser849b73

Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfssuser849b73
 
Frame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfFrame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfssuser849b73
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...ssuser849b73
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdfssuser849b73
 

More from ssuser849b73 (7)

Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Frame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfFrame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdf
 
WaveNet.pdf
WaveNet.pdfWaveNet.pdf
WaveNet.pdf
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
 
Wavesplit.pdf
Wavesplit.pdfWavesplit.pdf
Wavesplit.pdf
 
EEND-SS.pdf
EEND-SS.pdfEEND-SS.pdf
EEND-SS.pdf
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
 

Recently uploaded

AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...vershagrag
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilVinayVitekari
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...jabtakhaidam7
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 

Recently uploaded (20)

AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 

Audio Inpainting with D2WGAN.pdf

  • 1. Audio Inpainting with Generative Adversarial Network Authors: Pirmin Philipp Ebner, Amr Eltelt Published in Ar Xiv [ eess. AS ], March 13, 2020 Presenter : Kuan Hsun Ho Date : 2021/10/08
  • 2. Outline 1. Introduction -Background, Objective 2. Architecture -GAN, WGAN, Long/short Boarders, Dual Discriminators WGAN, Loss Function 3. Methodology -Dataset, Signal Preprocessing, Training, Evaluation Metrics 4. Results & Discussion -Different Network, Different Dataset, Different Training Steps 5. Conclusion & Outlook 6. References
  • 4. ● Corrupted audio files, lost information in audio transmission (e.g. VoIP), and audio signals locally contaminated by noise are highly important problems in various audio processing tasks like in music enhancement and restoration. ● Loss of the connection in audio transmission -> data loss beyond hundred milliseconds. This has highly unpleasant consequences for a listener, and it is hardly feasible to reconstruct the lost content from local information only. ● The audio features are high dimensional, complex and non-correlated -> applying state-of-the art models in image or video inpaintings tends not to work well[2]. ● Restoring of lost information in audio : waveform substitution [5], audio inter/extrapolation [6,7], or audio inpainting [8]. Introduction Background
  • 5. ● Reconstruction is usually aimed at providing a coherent and meaningful information while preventing audible artifacts so that the listener remains unaware of any occurred problem. ● Gaps < 50ms, small gaps, no synthesis needed, consider only over statistics => Uses sparsity-based techniques. Otherwise, long gaps, regressive modeling[7] / sinusoidal modeling[10,11] / GAN[13] is used. ● Audio signal are composed of fundamental freq and overtones. ● Having multiple different instrument in the training set, a higher frequency spectrum has to be covered and the generated audio signal is stronger influenced by noise. Introduction Background
  • 6. ● Audio inpainting of long gap (500~550ms) audio content. ● Focus on waveform strategies rather than spectrogram or multimodal strategies. ● Using WGAN in generating audio with 1. global coherence. 2. good audio quality. 3. different type of instruments from different dataset. => ability to adapt to different audio signals ● Finding a optimal model setting that is generalized rather than overfitted to specific dataset. ● There is no specific metric that can be used to evaluate the quality of generated audio signal and therefore we depend on human judgement using the objective difference grading (ODG) technique for evaluation. ● It is clear that in this unsupervised setting, our goal is not to generate a perfectly matching audio similar to ground truth. Introduction Objective
  • 8. ● GANs rely on two competing neural networks trained simultaneously in a two-player min-max game: The generator produces new data from samples of a random variable; The discriminator attempts to distinguish between these generated and real data. ● Generator : To fool the discriminator Discriminator : Learn to better classify real and generated (fake) data. ● Advantages of GAN-based approaches is the rapid and straightforward sampling of large amounts of audio. ● Several GAN approaches have been investigate to fill gaps in audio like WaveGAN [4] or TiFGAN [14]. Architecture GAN
  • 9. ● Improve the stability of learning. ○ Both generator and discriminator have to learn in correct directions. ● Get rid of problems like mode collapse. ○ Generator and discriminator have to be equally robust. ● Provide meaningful learning curves useful for debugging and hyperparameter searches. ● Guarantee for data diversity. Architecture WGAN
  • 10. Architecture WGAN G : Z -> X D : X -> [0,1] -> backpropagation
  • 11. ● The approach applied here is the extraction of the short-range borders and the long-range borders of the missing audio content. ● Close neighbor audio have proven success to inpaint short period missing audio, however it fails for long periods. ● A larger range of neighboring audio can tell us something about the long missing audio content. Architecture Long/Short Boarders Boarders 1 length = 1+0.5+1 = 2.5s. Boarders 2 length = 3+0.5+3 = 6.5s.
  • 12. (Architecture of proposed model implemented by Tensorflow) Architecture Dual Discriminators WGAN
  • 13. ● To make GAN perform well, we minimize the total loss of both discriminators and generator, and expect convergence. Architecture Loss Function This figure shows the performance of training the total D-loss and total G-loss for different dataset.
  • 15. ● We considered instrument sounds including piano, acoustic guitar and piano together with a string orchestra using three different dataset. 1. The PIANO dataset contains only piano samples. 2. The SOLO dataset contains different instruments recording including accordion, acoustic guitar, cello, flute, saxophone, trumpet, violin, and xylophone. In our work only the acoustic guitar dataset. 3. The MAESTRO dataset contains recordings from the International Piano-e-Competition. It is a piano performance competition where virtuoso pianists perform on Yamaha claviers together with a string orchestra. The data must be splitted in training / validation / testing so that the same composition, even if performed by multiple contestants, does not appear in multiple subsets. (Important!) Methodology Dataset
  • 16. (Training / validation / testing segmentation) Methodology Dataset
  • 17. Methodology Signal Preprocessing The filter cutoff frequency is set lower than Nyquist frequency. The audio signals are then downsampled by 3.
  • 18. ● Both Generator and Discriminator shares same hyperparameter : Learning rate : 1e-4 Optimizer : Adam Batch size : 64 MAESTRO Iteration ( steps ) PIANO, SOLO ● Parameters : 53248 ● 3 datasets x 2 model = 6 training ● Figure besides shows comparison of generated audio and real audio. Methodology Training WGAN : 73 k D2WGAN : 56.3 k WGAN : 40 k D2WGAN : 53.6 k
  • 19. ● Using the signal-to-noise ratios (SNRs) applied to the time-domain waveforms and magnitude spectrograms is not an appropriate indicator to evaluate the model. ● Therefore, we conducted a user study evaluating the inpainting quality by means of objective difference grades (ODG [22]). ● 50 samples of 6.5 s length for each dataset and model, in total 150 samples per model, evaluate them by two independent person without knowing the source of the samples and evaluate using ODG. Methodology Evaluation Metrics
  • 21. Results & Discussion Different Network For PIANO, ● D-loss and G-loss in both models converge. ● No overfitting occurs. => WGAN slightly outperforms. For SOLO, ● D-loss in WGAN hasn’t converged. ● G-loss in both models converge. => D2WGAN highly outperforms.
  • 22. ● ODG table is as follows. Results & Discussion Different Network +0.02 +0.08 +0.2 +0.1 -0.33 -0.07
  • 23. ● Inpainting quality : SOLO > PIANO > MAESTRO ● The SOLO dataset offers much more inpainting techniques and sound variations than the PIANO dataset. Guitar, 144 tones, covers four octaves. Piano, 88 tones, covers seven octaves. ● Lower frequency signal represents information at a lower rate (needs lesser samples to reconstruct it) whereas a high frequency signal represents information at a faster rate (and hence needs more samples while reconstruction). Results & Discussion Different Dataset ● Human ears are more sensitive in high frequencies, we can know it from k-weighted filter. ● The MAESTRO dataset, a combination of piano together with a string orchestra, is much harder to train. Also, it is a huge dataset of 201 hours of music and there is not only one single instrument playing at the same time.
  • 25. ● Improves ODG performance by training our network longer. ● The D2WGAN model was trained on the PIANO dataset for 140000 steps and tested both on the PIANO and MAESTRO dataset. ● The D2WGAN model showed on both dataset a huge improvement. The results indicate that we didn’t overfit the D2WGAN model and got excellent performance (both mean & std). ● Explanation may be that the model interpreted the orchestra mainly as background noise. Thus, small noise in the inpainting part will be stronger smoothed and therefore the impairment is more perceptible but not annoying. Results & Discussion Different Training Steps
  • 27. ● Audio inpainting of long gap (500~550ms) audio content. ● Focus on waveform strategies rather than spectrogram or multimodal strategies. ● Using WGAN in generating audio with 1. global coherence. 2. good audio quality. 3. different type of instruments from different dataset. => ability to adapt to different audio signals ● Finding a optimal model setting that is generalized rather than overfitted to specific dataset. ● There is no specific metric that can be used to evaluate the quality of generated audio signal and therefore we depend on human judgement using the objective difference grading (ODG) technique for evaluation. ● It is clear that in this unsupervised setting, our goal is not to generate a perfectly matching audio similar to ground truth. Introduction Objective
  • 28. ● We analysed long gap (500 - 550 ms) audio content inpainting using WGAN and D2WGAN. ● The D2WGAN model is a new proposed architecture improvement to inpaint the missing audio content using a short-range and a long-range borders. ● The new proposed D2WGAN model slightly outperforms the classic WGAN model for all three different dataset (piano, guitar, piano and orchestra). ● The long range borders combined with short range borders can potential provide information in correlation with the gap content. ● A better improvement was observed for the SOLO and MAESTRO. ● The poor results for the PIANO can be illustrated by the fact that it lacks audible variation in sound. ● As we aim for generalized solution, we need to have a dataset with a generalized audio content. Conclusion & Outlook
  • 29. ● By increasing the training step, the D2WGAN has significantly improved performance without overfitting. ● Better results can be achieved for audio dataset where a particular instrument is accompanist by other instruments if we train the network only on this particular instrument and neglecting the other instruments. ● As for future work, few topics can be done 1. playing around with the boarder length 2. adding more layers to the convolution operation or changing filters sizes 3. exploring multimodal strategies where we combine waveform and spectrogram images 4. finding datasets that helps strengthen reconstruction of high frequency 5. training on a wide range of music instruments Conclusion & Outlook
  • 31. [1] N. Perraudin, N. Holighaus, P. Majdak, and P. Balazs, “Inpainting of long audio segments with similarity graphs,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018. [2] Y.-L. Chang, K.-Y. Lee, P.-Y. Wu, H.-y. Lee, and W. Hsu, “Deep long audio inpainting,” arXiv:1911.06476v1, 2019. [3] A.VanDenOord,S.Dieleman,H.Zen,K.Simonyan,O.Vinyals,A.Graves,N.Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio.,” in SSW, p. 125, 2016. [4] C. Donahue, J. McAuley, and M. Puckette, “Synthesizing audio with generative adversarial networks,” arXiv:1802.04208, 2018. [5] D. Goodman, G. Lockhart, O. Wasem, and W.-C. Wong, “Waveform substitution techniques for recovering missing speech segments in packed voice communications,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 34, no. 6, pp. 1440–1448, 1986. [6] I. Kauppinen, J. Kauppinen, and P. Saarinen, “A method for long extrapolation of audio signals,” Journal of the Audio Engineering Society, vol. 49, no. 12, pp. 1167–1180, 2001. [7] W.Etter,“Restoration of a discrete-time signal segment by interpolation based on the left-sided and right-sided autoregressive parameters,” IEEE Transactions on Signal Processing, vol. 44, no. 5, pp. 1124–1135, 1996. [8] A. Adler, V. Emiya, M. Jafari, M. Elad, R. Gribonval, and M. Plumbley, “Audio inpainting,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 3, pp. 922–932, 2012. [9] K. Siedenburg, M. Drfler, and M. Kowalski, “Audio inpainting with social sparsity,” SPARS Signal Processing with Adaptive Sparse Structured Representations, 2013. References
  • 32. [10] M. Lagrange, S. Marchand, and J.-B. Rault, “Long interpolation of audio signals using linear prediction in sinusoidal modeling,” Audio Eng. Soc., vol. 53, no. 10, pp. 891–905, 2005. [11] A. Lukin and J. Todd, “Parametric interpolation of gaps in audio signals,” Audio Engineering Society Convention, vol. 125, 2008. [12] Y.Bahat,Y.Schechner,andM.Elad,“Self-content-basedaudioinpainting,”SignalProcessing, vol. 111, pp. 61–72, 2015. [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing sys- tems, pp. 2672–2680, 2014. [14] A. Marafioti, N. Holighaus, N. Perraudin, and P. Majdak, “Adversarial generation of time- frequency features with application in audio synthesis,” arXiv preprint arXiv:1902.04072, 2019. [15] T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li, “Mode regularized generative adversarial networks,” International Conference on Learning Representations, 2017. [16] A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton, “VEE-GAN: Reducing mode collapse in GANs using implicit variational learning,” Advances In Neutral Information Processing Systems, 2017. [17] K. J. Liang, C. Li, G. Wang, and L. Carin, “Generative adversarial network training is a con- tinual learning problem,” arXiv:1811.11083v1, 2018. References
  • 33. CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon,and infographics & images by Freepik Thanks