This document summarizes a research paper on developing efficient neural network architectures for universal audio source separation. The proposed model, called SuDoRM-RF, uses successive downsampling and resampling of multi-resolution features to extract information from multiple temporal scales with fewer computations. Experimental results show that SuDoRM-RF achieves comparable or better source separation performance than state-of-the-art models while requiring significantly fewer computational resources in terms of FLOPs, memory usage, and training time. Improved and causal variants of SuDoRM-RF are also proposed to further reduce parameters and enable real-time processing.
We present a causal speech enhancement model working on the
raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with
skip-connections. It is optimized on both time and frequency
domains, using multiple loss functions. Empirical evidence
shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises,
as well as room reverb. Additionally, we suggest a set of
data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities. We perform evaluations on several standard
benchmarks, both using objective metrics and human judgements. The proposed model matches state-of-the-art performance of both causal and non causal methods while working
directly on the raw waveform.
Index Terms: Speech enhancement, speech denoising, neural
networks, raw waveform
EEND-SS is a joint model that performs speaker diarization, speech separation, and speaker counting in an end-to-end manner. It combines the EEND-EDA diarization model and Conv-TasNet separation model. EEND-SS uses a multi-task learning approach and multiple 1x1 convolution layers to estimate separation masks for a flexible number of speakers. Experimental results on the LibriMix dataset show it outperforms baseline methods in metrics like SDRi and DER, handling both a fixed and variable number of speakers.
Conv-TasNet is a convolutional neural network architecture for speech separation that directly operates on the raw audio waveform in the time domain. It surpasses previous approaches that perform separation in the time-frequency domain using magnitude masking. Conv-TasNet uses a convolutional encoder to transform short audio segments into representations, estimates separation masks for each source, and uses a convolutional decoder to reconstruct the sources. Experiments show Conv-TasNet achieves substantially better speech separation than ideal time-frequency masking baselines as measured by metrics like SI-SNRi and SDRi. The best performing model uses a small filter length, increasing numbers of encoder/decoder filters, and deep convolutional layers.
Optimized implementation of an innovative digital audio equalizera3labdsp
Digital audio equalization is one of the most common operations in the acoustic field, but its performance
depends on computational complexity and filter design techniques. Starting from a previous FIR
implementation based on multirate systems and filterbanks theory, an optimized digital audio equalizer
is derived. The proposed approach employs IIR filters to improve the filterbanks structure developed to
avoid ripple between adjacent bands. The effectiveness of the optimized implementation is shown comparing
it with the previous approach. The solution presented here has several advantages increasing
the equalization performance in terms of low computational complexity, low delay, and uniform frequency
response.
A robust doa–based smart antenna processor for gsm base stationsmarwaeng
This document summarizes a robust smart antenna processor for GSM base stations that uses direction-of-arrival (DOA) estimation. It estimates DOAs in the uplink using multiple algorithms, including unitary ESPRIT and Capon's beamformer. It then tracks DOAs separately for uplink and downlink to form antenna patterns that suppress interference. By adapting weights within each GSM frame, it provides up to a 35dB improvement in signal-to-noise-and-interference ratio and outperforms conventional beamformers that place sharp nulls.
This document discusses using RNN-LSTM algorithms to classify music genres based on audio data. It begins with an abstract describing music genre classification and the use of RNN and LSTM algorithms. It then discusses the existing methods, proposed improvements, and experiments conducted. The proposed system uses MFCC features extracted from audio clips to train an LSTM model to classify songs into 10 genres. It achieves a train accuracy of 94% and test accuracy of 75% after training the model on audio clips split into 3-second segments. The document concludes that LSTM networks are well-suited for this task due to their ability to learn long-term dependencies from audio data.
Here, we have implemented CNN network in FPGA by incorporating a novel technique of convolution which includes pipelining technique as well as parallelism (by optimizing) between the two.
Experimental simulation and real world study on wi fi ad-hoc mode for differe...Nazmul Hossain Rakib
The Ad-Hoc mode for wireless communication is not used frequently. But the demand for Wi-Fi communication is continuously increasing as use of Smart-phones and Laptops has enormously popular recent years. Through Ad-Hoc mode users can communicate point to point with mobility feature without using any central BSS. Wireless Ad-Hoc mode uses electromagnetic wave and so this technology has losses and limitations caused by free space propagation media as well as attenuation for interferences.
We present a causal speech enhancement model working on the
raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with
skip-connections. It is optimized on both time and frequency
domains, using multiple loss functions. Empirical evidence
shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises,
as well as room reverb. Additionally, we suggest a set of
data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities. We perform evaluations on several standard
benchmarks, both using objective metrics and human judgements. The proposed model matches state-of-the-art performance of both causal and non causal methods while working
directly on the raw waveform.
Index Terms: Speech enhancement, speech denoising, neural
networks, raw waveform
EEND-SS is a joint model that performs speaker diarization, speech separation, and speaker counting in an end-to-end manner. It combines the EEND-EDA diarization model and Conv-TasNet separation model. EEND-SS uses a multi-task learning approach and multiple 1x1 convolution layers to estimate separation masks for a flexible number of speakers. Experimental results on the LibriMix dataset show it outperforms baseline methods in metrics like SDRi and DER, handling both a fixed and variable number of speakers.
Conv-TasNet is a convolutional neural network architecture for speech separation that directly operates on the raw audio waveform in the time domain. It surpasses previous approaches that perform separation in the time-frequency domain using magnitude masking. Conv-TasNet uses a convolutional encoder to transform short audio segments into representations, estimates separation masks for each source, and uses a convolutional decoder to reconstruct the sources. Experiments show Conv-TasNet achieves substantially better speech separation than ideal time-frequency masking baselines as measured by metrics like SI-SNRi and SDRi. The best performing model uses a small filter length, increasing numbers of encoder/decoder filters, and deep convolutional layers.
Optimized implementation of an innovative digital audio equalizera3labdsp
Digital audio equalization is one of the most common operations in the acoustic field, but its performance
depends on computational complexity and filter design techniques. Starting from a previous FIR
implementation based on multirate systems and filterbanks theory, an optimized digital audio equalizer
is derived. The proposed approach employs IIR filters to improve the filterbanks structure developed to
avoid ripple between adjacent bands. The effectiveness of the optimized implementation is shown comparing
it with the previous approach. The solution presented here has several advantages increasing
the equalization performance in terms of low computational complexity, low delay, and uniform frequency
response.
A robust doa–based smart antenna processor for gsm base stationsmarwaeng
This document summarizes a robust smart antenna processor for GSM base stations that uses direction-of-arrival (DOA) estimation. It estimates DOAs in the uplink using multiple algorithms, including unitary ESPRIT and Capon's beamformer. It then tracks DOAs separately for uplink and downlink to form antenna patterns that suppress interference. By adapting weights within each GSM frame, it provides up to a 35dB improvement in signal-to-noise-and-interference ratio and outperforms conventional beamformers that place sharp nulls.
This document discusses using RNN-LSTM algorithms to classify music genres based on audio data. It begins with an abstract describing music genre classification and the use of RNN and LSTM algorithms. It then discusses the existing methods, proposed improvements, and experiments conducted. The proposed system uses MFCC features extracted from audio clips to train an LSTM model to classify songs into 10 genres. It achieves a train accuracy of 94% and test accuracy of 75% after training the model on audio clips split into 3-second segments. The document concludes that LSTM networks are well-suited for this task due to their ability to learn long-term dependencies from audio data.
Here, we have implemented CNN network in FPGA by incorporating a novel technique of convolution which includes pipelining technique as well as parallelism (by optimizing) between the two.
Experimental simulation and real world study on wi fi ad-hoc mode for differe...Nazmul Hossain Rakib
The Ad-Hoc mode for wireless communication is not used frequently. But the demand for Wi-Fi communication is continuously increasing as use of Smart-phones and Laptops has enormously popular recent years. Through Ad-Hoc mode users can communicate point to point with mobility feature without using any central BSS. Wireless Ad-Hoc mode uses electromagnetic wave and so this technology has losses and limitations caused by free space propagation media as well as attenuation for interferences.
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Lviv Startup Club
WaveNet and WaveNet 2 are neural network models that can directly generate audio waveforms from text. WaveNet produces the highest quality text-to-speech but is slow, taking minutes to generate seconds of audio. WaveNet 2 speeds this up by 3000x through a "distillation" technique that trains a faster model using the original WaveNet. Both models are autoregressive, generating each audio sample conditioned on previous samples, and can be conditioned on text to enable text-to-speech synthesis.
Design and implementation of different audio restoration techniques for audio...eSAT Journals
This document summarizes research on designing and implementing different audio restoration techniques for removing distortions like clipping, clicks, and broadband noise from audio signals. It presents methods for declipping audio using sparse representations and frame-based reconstruction. Clicks are addressed using an adaptive filtering method, and broadband noise is reduced via spectral subtraction. The performance of these techniques is evaluated using metrics like SNR and algorithms like OMP. Hardware implementation of click removal is done on a TMS320C6713 DSK board using tools like MATLAB and Code Composer Studio.
Methodology of Implementing the Pulse code techniques for Distributed Optical...Editor IJCATR
In recent researches Coding techniques are used in OTDR approach improve Signal-to-Noise Ratio (SNR). For
example, the use of simplex coding (S-coding) in conjunction with OTDR can be effectively used to enhance the Signal-to-
Noise Ratio (SNR) of the backscattered detected light without sacrificing the spatial resolution; In particular, simplex codes
have been demonstrated to be the most efficient among other suitable coding techniques, allowing for a good improvement in
SNR even at short code lengths. Coding techniques based on Simplex or Golay codes exploit a set of different sequences (i.e.
codes) of short (about 10 ns) NRZ laser pulses to increase the launched energy without impairing the spatial resolution using
longer pulse width. However, the required high repetition rate of the laser pulses, hundreds of MHz for meter-scale spatial
resolution, is not achievable by high peak power lasers, such as rare-earth doped fibre or passive Q-switched ones, which
feature a maximum repetition rate of few hundred kHz. New coding technique, cyclic simplex coding (a subclass of simplex
coding), tailored to high-power pulsed lasers has been proposed. The basic idea is to periodically sense the probing fibre with
a multi-pulse pattern, the repetition period of which is equal to the fibre round-trip time. This way, the pattern results as a code
spread along the whole fibre, with a bit time inversely proportional to the code length. The pulse width can be kept in the order
of 10 ns to guarantee a meter-scale spatial resolution and the peak power can be set close to the nonlinear effect threshold.
Digital Watermarking Applications and Techniques: A Brief ReviewEditor IJCATR
The frequent availability of digital data such as audio, images and videos became possible to the public through the expansion
of the internet. Digital watermarking technology is being adopted to ensure and facilitate data authentication, security and copyright
protection of digital media. It is considered as the most important technology in today’s world, to prevent illegal copying of data. Digital
watermarking can be applied to audio, video, text or images. This paper includes the detail study of watermarking definition and various
watermarking applications and techniques used to enhance data security.
Hardware Architecture of Complex K-best MIMO DecoderCSCJournals
This paper presents a hardware architecture of complex K-best Multiple Input Multiple Output (MIMO) decoder reducing the complexity of Maximum Likelihood (ML) detector. We develop a novel low-power VLSI design of complex K-best decoder for MIMO and 64 QAM modulation scheme. Use of Schnorr-Euchner (SE) enumeration and a new parameter, Rlimit in the design reduce the complexity of calculating K-best nodes to a certain level with increased performance. The total word length of only 16 bits has been adopted for the hardware design limiting the bit error rate (BER) degradation to 0.3 dB with list size, K and Rlimit equal to 4. The proposed VLSI architecture is modeled in Verilog HDL using Xilinx and synthesized using Synopsys Design Vision in 45 nm CMOS technology. According to the synthesize result, it achieves 1090.8 Mbps throughput with power consumption of 782 mW and latency of 0.33 us. The maximum frequency the design proposed is 181.8 MHz.
This document describes a simulator designed to analyze bit error rates using orthogonal frequency division multiplexing (OFDM) under different modulation schemes and channel conditions. The simulator was implemented in MATLAB and allows users to choose modulation types, channel types (AWGN, Rayleigh, Rician), and other parameters. It then generates plots of bit error rate versus signal-to-noise ratio for performance analysis. Screenshots of the user interface are provided along with sample output plots and discussion of the simulator design and capabilities.
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...IRJET Journal
This document reviews progress in acoustic modeling for statistical parametric speech synthesis, from early hidden Markov models to recent neural network approaches. It discusses how hidden Markov models were previously dominant but artificial neural networks are now replacing them due to improvements in naturalness. The document also examines developing accurate audio classifiers using machine learning techniques on public datasets and improving classification accuracy beyond current levels of 50-79% by employing strategies like cross-validation.
Mobile Networking and Mobile Ad Hoc Routing Protocol ModelingIOSR Journals
This document summarizes a research paper about modeling mobile ad hoc networks and routing protocols. It discusses modeling the network topology and connectivity between nodes. It also describes modeling the routing protocol instances running on each node. The document outlines different approaches to modeling the network, including explicitly modeling topology and transitions, parameterizing based on network properties, and developing a universal quantification model. It discusses techniques for coping with state explosion when modeling mobile ad hoc networks, such as symbolic representation, partial order reduction, and abstraction.
Diversity combining is a technique in wireless network that uses multiple antenna system to improve the quality of radio signal. Mobile radio system suffers multipath propagation due to signal obstruction in the channel. A new hybridized diversity combining scheme consisting of Equal Gain Combining (EGC) and Maximal Ratio Combining (MRC) was proposed in this paper. Theperformance of the hybrid model was evaluated using Outage Probability (Pout) and Processing time (Pt) at different Signal-to-Noise Ratio (SNR) and Signal Paths (L=2,3) for 4-QAM and 8-QAM Modulation Schemes. A mathematical expression for the hybrid EGC-MRC was realized using the Probability Density Function (PDF) of the Nakagami fading channel. MATLAB R2015b software was used for the model simulation. The result shows that hybrid EGC-MRC outperforms the standalone EGC and MRC schemes by having lower Pout and Pt values. Hence, hybrid EGC-MRC exhibits enhanced potentials to mitigate multipath propagation at reduced system complexity.
A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...ijcsit
Diversity combining is a technique in wireless network that uses multiple antenna system to improve the quality of radio signal. Mobile radio system suffers multipath propagation due to signal obstruction in the channel. A new hybridized diversity combining scheme consisting of Equal Gain Combining (EGC) and Maximal Ratio Combining (MRC) was proposed in this paper. Theperformance of the hybrid model was evaluated using Outage Probability (Pout) and Processing time (Pt) at different Signal-to-Noise Ratio (SNR) and Signal Paths (L=2,3) for 4-QAM and 8-QAM Modulation Schemes. A mathematical expression for the hybrid EGC-MRC was realized using the Probability Density Function (PDF) of the Nakagami fading channel. MATLAB R2015b software was used for the model simulation. The result shows that hybrid EGC-MRC outperforms the standalone EGC and MRC schemes by having lower Pout and Pt values. Hence, hybrid EGC-MRC exhibits enhanced potentials to mitigate multipath propagation at reduced
system complexity.
This document describes a space-time block coding (STBC) orthogonal frequency-division multiplexing (OFDM) system for text message transmission over fading channels using multiple transmit antennas. It evaluates the bit error rate (BER) performance of the system using different digital modulation schemes (BPSK, QPSK, QAM-8) over additive white Gaussian noise (AWGN) channels and fading channels. Low-density parity-check (LDPC) coding is concatenated with convolutional coding in the system to improve error performance. Simulation results show that the system is effective in retrieving the transmitted text message under noise and fading conditions, and that BER performance degrades with increasing noise power as expected.
This document summarizes a research paper that proposes a space-time block coded (STBC) orthogonal frequency-division multiplexing (OFDM) system for text message transmission over fading channels using multiple transmit antennas. The system utilizes low-density parity-check channel coding concatenated with convolutional coding. Simulation results show that the proposed system achieves good error rate performance, especially when using BPSK modulation with 2x4 transmit antennas in AWGN, Rayleigh, and Rician fading channels. The system is effective in properly identifying and retrieving transmitted text messages in noisy and fading environments.
Open Source SDR Frontend and Measurements for 60-GHz Wireless ExperimentationAndreaDriutti
Summary of Open Source SDR Frontend and Measurements for 60-GHz Wireless Experimentation
Tesi fast track, laurea triennale Ingegneria Elettronica e Informatica.
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...ijasuc
This paper addresses the maximizing network lifetime problem in wireless sensor networks (WSNs) taking
into account the total Symbol Error rate (SER) at destination. Therefore, efficient power management is
needed for extend network lifetime. Our approach consists to provide the optimal transmission power
using the orthogonal multiple access channels between each sensor. In order to deeply study the
properties of our approach, firstly, the simple case is considered; the information sensed by the source
node passes by a single relay before reaching the destination node. Secondly, global case is studied; the
information passes by several relays. We consider, in the previous both cases, that the batteries are nonrechargeable. Thirdly, we spread our work the case where the batteries are rechargeable with unlimited
storage capacity. In all three cases, we suppose that Maximum Ratio Combining (MRC) is used as a
detector, and Amplify and Forward (AF) as a relaying strategy. Simulation results show the viability of
our approach which the network lifetime is extended of more than 70.72%when the batteries are non
rechargeable and 100.51% when the batteries are rechargeable in comparison with other traditional
method.
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...ijasuc
This paper addresses the maximizing network lifetime problem in wireless sensor networks (WSNs) taking
into account the total Symbol Error rate (SER) at destination. Therefore, efficient power management is
needed for extend network lifetime. Our approach consists to provide the optimal transmission power
using the orthogonal multiple access channels between each sensor. In order to deeply study the
properties of our approach, firstly, the simple case is considered; the information sensed by the source
node passes by a single relay before reaching the destination node. Secondly, global case is studied; the
information passes by several relays. We consider, in the previous both cases, that the batteries are nonrechargeable. Thirdly, we spread our work the case where the batteries are rechargeable with unlimited
storage capacity. In all three cases, we suppose that Maximum Ratio Combining (MRC) is used as a
detector, and Amplify and Forward (AF) as a relaying strategy. Simulation results show the viability of
our approach which the network lifetime is extended of more than 70.72%when the batteries are non
rechargeable and 100.51% when the batteries are rechargeable in comparison with other traditional
method.
IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...IRJET Journal
This document discusses a modified orthogonal matching pursuit algorithm used for channel estimation in digital terrestrial television systems. It proposes using compressed sensing based channel estimation at the receiver to eliminate sparse information. Thresholding is used to remove noise from the channel estimation and improve signal quality. Simulation results show that bit error rate decreases when the received signal power from different transmitters is almost equal.
Research Inventy : International Journal of Engineering and Scienceresearchinventy
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
The document describes the design of a low power preamplifier integrated circuit for cochlear implants using a split folded cascode technique. This technique splits the input transistors into two branches with equal aspect ratios, increasing the overall transconductance by 1.414 times compared to a normal folded cascode. Simulations of the proposed preamplifier design in Cadence Virtuoso using a 180nm process show a mid-band gain of 43.7 dB, bandwidth of 18-20 kHz, and input-referred noise of 473.47 nV/√Hz at 4 kHz, while consuming 4.47 μW from a 1.8V supply. The split folded cascode technique enhances performance over normal cascode
Artificial neural networks have been adopted for a broad range of tasks in multimedia analysis and processing, such as visual and acoustic classification, extraction of multimedia descriptors or image and video coding. The trained neural networks for these applications contain a large number of parameters (weights), resulting in a considerable size. Thus, transferring them to a number of clients using them in applications (e.g., mobile phones, smart cameras) benefits from a compressed representation of neural networks.
MPEG Neural Network Coding and Representation is the first international standard for efficient compression of neural networks (NNs). The standard is designed as a toolbox of compression methods, which can be used to create coding pipelines. It can be either used as an independent coding framework (with its own bitstream format) or together with external neural network formats and frameworks. For providing the highest degree of flexibility, the network compression methods operate per parameter tensor in order to always ensure proper decoding, even if no structure information is provided. The standard contains compression-efficient quantization and an arithmetic coding scheme (DeepCABAC) as core encoding and decoding technologies, as well as neural network parameter pre-processing methods like sparsification, pruning, low-rank decomposition, unification, local scaling, and batch norm folding. NNR achieves a compression efficiency of more than 97% for transparent coding cases, i.e. without degrading classification quality, such as top-1 or top-5 accuracies.
This talk presents an overview of the context, technical features, and characteristics of the NN coding standard, and discusses ongoing topics such as incremental neural network representation.
PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...ijistjournal
This document discusses performance analysis of Barker codes based on their correlation properties in a multi-user environment. It analyzes the auto-correlation and cross-correlation properties of long Barker codes. Barker codes have good auto-correlation properties and certain code pairs were found to have low cross-correlation, making them suitable for multi-user environments. The document also describes direct sequence spread spectrum modulation using pseudo-noise codes and the despreading process at the receiver.
This document summarizes a research paper on efficient Transformer-based speech enhancement using long frames and STFT magnitudes. It introduces the task of speech enhancement and issues with learned-domain approaches. The proposed method uses STFT magnitudes as input to an encoder-decoder architecture with a Transformer masker. Experiments show the method achieves equivalent quality and intelligibility scores compared to learned features, while reducing computations by 8x for 10-second utterances. The conclusions are that using STFT magnitudes enables efficient, high-quality speech enhancement on embedded devices.
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Lviv Startup Club
WaveNet and WaveNet 2 are neural network models that can directly generate audio waveforms from text. WaveNet produces the highest quality text-to-speech but is slow, taking minutes to generate seconds of audio. WaveNet 2 speeds this up by 3000x through a "distillation" technique that trains a faster model using the original WaveNet. Both models are autoregressive, generating each audio sample conditioned on previous samples, and can be conditioned on text to enable text-to-speech synthesis.
Design and implementation of different audio restoration techniques for audio...eSAT Journals
This document summarizes research on designing and implementing different audio restoration techniques for removing distortions like clipping, clicks, and broadband noise from audio signals. It presents methods for declipping audio using sparse representations and frame-based reconstruction. Clicks are addressed using an adaptive filtering method, and broadband noise is reduced via spectral subtraction. The performance of these techniques is evaluated using metrics like SNR and algorithms like OMP. Hardware implementation of click removal is done on a TMS320C6713 DSK board using tools like MATLAB and Code Composer Studio.
Methodology of Implementing the Pulse code techniques for Distributed Optical...Editor IJCATR
In recent researches Coding techniques are used in OTDR approach improve Signal-to-Noise Ratio (SNR). For
example, the use of simplex coding (S-coding) in conjunction with OTDR can be effectively used to enhance the Signal-to-
Noise Ratio (SNR) of the backscattered detected light without sacrificing the spatial resolution; In particular, simplex codes
have been demonstrated to be the most efficient among other suitable coding techniques, allowing for a good improvement in
SNR even at short code lengths. Coding techniques based on Simplex or Golay codes exploit a set of different sequences (i.e.
codes) of short (about 10 ns) NRZ laser pulses to increase the launched energy without impairing the spatial resolution using
longer pulse width. However, the required high repetition rate of the laser pulses, hundreds of MHz for meter-scale spatial
resolution, is not achievable by high peak power lasers, such as rare-earth doped fibre or passive Q-switched ones, which
feature a maximum repetition rate of few hundred kHz. New coding technique, cyclic simplex coding (a subclass of simplex
coding), tailored to high-power pulsed lasers has been proposed. The basic idea is to periodically sense the probing fibre with
a multi-pulse pattern, the repetition period of which is equal to the fibre round-trip time. This way, the pattern results as a code
spread along the whole fibre, with a bit time inversely proportional to the code length. The pulse width can be kept in the order
of 10 ns to guarantee a meter-scale spatial resolution and the peak power can be set close to the nonlinear effect threshold.
Digital Watermarking Applications and Techniques: A Brief ReviewEditor IJCATR
The frequent availability of digital data such as audio, images and videos became possible to the public through the expansion
of the internet. Digital watermarking technology is being adopted to ensure and facilitate data authentication, security and copyright
protection of digital media. It is considered as the most important technology in today’s world, to prevent illegal copying of data. Digital
watermarking can be applied to audio, video, text or images. This paper includes the detail study of watermarking definition and various
watermarking applications and techniques used to enhance data security.
Hardware Architecture of Complex K-best MIMO DecoderCSCJournals
This paper presents a hardware architecture of complex K-best Multiple Input Multiple Output (MIMO) decoder reducing the complexity of Maximum Likelihood (ML) detector. We develop a novel low-power VLSI design of complex K-best decoder for MIMO and 64 QAM modulation scheme. Use of Schnorr-Euchner (SE) enumeration and a new parameter, Rlimit in the design reduce the complexity of calculating K-best nodes to a certain level with increased performance. The total word length of only 16 bits has been adopted for the hardware design limiting the bit error rate (BER) degradation to 0.3 dB with list size, K and Rlimit equal to 4. The proposed VLSI architecture is modeled in Verilog HDL using Xilinx and synthesized using Synopsys Design Vision in 45 nm CMOS technology. According to the synthesize result, it achieves 1090.8 Mbps throughput with power consumption of 782 mW and latency of 0.33 us. The maximum frequency the design proposed is 181.8 MHz.
This document describes a simulator designed to analyze bit error rates using orthogonal frequency division multiplexing (OFDM) under different modulation schemes and channel conditions. The simulator was implemented in MATLAB and allows users to choose modulation types, channel types (AWGN, Rayleigh, Rician), and other parameters. It then generates plots of bit error rate versus signal-to-noise ratio for performance analysis. Screenshots of the user interface are provided along with sample output plots and discussion of the simulator design and capabilities.
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...IRJET Journal
This document reviews progress in acoustic modeling for statistical parametric speech synthesis, from early hidden Markov models to recent neural network approaches. It discusses how hidden Markov models were previously dominant but artificial neural networks are now replacing them due to improvements in naturalness. The document also examines developing accurate audio classifiers using machine learning techniques on public datasets and improving classification accuracy beyond current levels of 50-79% by employing strategies like cross-validation.
Mobile Networking and Mobile Ad Hoc Routing Protocol ModelingIOSR Journals
This document summarizes a research paper about modeling mobile ad hoc networks and routing protocols. It discusses modeling the network topology and connectivity between nodes. It also describes modeling the routing protocol instances running on each node. The document outlines different approaches to modeling the network, including explicitly modeling topology and transitions, parameterizing based on network properties, and developing a universal quantification model. It discusses techniques for coping with state explosion when modeling mobile ad hoc networks, such as symbolic representation, partial order reduction, and abstraction.
Diversity combining is a technique in wireless network that uses multiple antenna system to improve the quality of radio signal. Mobile radio system suffers multipath propagation due to signal obstruction in the channel. A new hybridized diversity combining scheme consisting of Equal Gain Combining (EGC) and Maximal Ratio Combining (MRC) was proposed in this paper. Theperformance of the hybrid model was evaluated using Outage Probability (Pout) and Processing time (Pt) at different Signal-to-Noise Ratio (SNR) and Signal Paths (L=2,3) for 4-QAM and 8-QAM Modulation Schemes. A mathematical expression for the hybrid EGC-MRC was realized using the Probability Density Function (PDF) of the Nakagami fading channel. MATLAB R2015b software was used for the model simulation. The result shows that hybrid EGC-MRC outperforms the standalone EGC and MRC schemes by having lower Pout and Pt values. Hence, hybrid EGC-MRC exhibits enhanced potentials to mitigate multipath propagation at reduced system complexity.
A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...ijcsit
Diversity combining is a technique in wireless network that uses multiple antenna system to improve the quality of radio signal. Mobile radio system suffers multipath propagation due to signal obstruction in the channel. A new hybridized diversity combining scheme consisting of Equal Gain Combining (EGC) and Maximal Ratio Combining (MRC) was proposed in this paper. Theperformance of the hybrid model was evaluated using Outage Probability (Pout) and Processing time (Pt) at different Signal-to-Noise Ratio (SNR) and Signal Paths (L=2,3) for 4-QAM and 8-QAM Modulation Schemes. A mathematical expression for the hybrid EGC-MRC was realized using the Probability Density Function (PDF) of the Nakagami fading channel. MATLAB R2015b software was used for the model simulation. The result shows that hybrid EGC-MRC outperforms the standalone EGC and MRC schemes by having lower Pout and Pt values. Hence, hybrid EGC-MRC exhibits enhanced potentials to mitigate multipath propagation at reduced
system complexity.
This document describes a space-time block coding (STBC) orthogonal frequency-division multiplexing (OFDM) system for text message transmission over fading channels using multiple transmit antennas. It evaluates the bit error rate (BER) performance of the system using different digital modulation schemes (BPSK, QPSK, QAM-8) over additive white Gaussian noise (AWGN) channels and fading channels. Low-density parity-check (LDPC) coding is concatenated with convolutional coding in the system to improve error performance. Simulation results show that the system is effective in retrieving the transmitted text message under noise and fading conditions, and that BER performance degrades with increasing noise power as expected.
This document summarizes a research paper that proposes a space-time block coded (STBC) orthogonal frequency-division multiplexing (OFDM) system for text message transmission over fading channels using multiple transmit antennas. The system utilizes low-density parity-check channel coding concatenated with convolutional coding. Simulation results show that the proposed system achieves good error rate performance, especially when using BPSK modulation with 2x4 transmit antennas in AWGN, Rayleigh, and Rician fading channels. The system is effective in properly identifying and retrieving transmitted text messages in noisy and fading environments.
Open Source SDR Frontend and Measurements for 60-GHz Wireless ExperimentationAndreaDriutti
Summary of Open Source SDR Frontend and Measurements for 60-GHz Wireless Experimentation
Tesi fast track, laurea triennale Ingegneria Elettronica e Informatica.
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...ijasuc
This paper addresses the maximizing network lifetime problem in wireless sensor networks (WSNs) taking
into account the total Symbol Error rate (SER) at destination. Therefore, efficient power management is
needed for extend network lifetime. Our approach consists to provide the optimal transmission power
using the orthogonal multiple access channels between each sensor. In order to deeply study the
properties of our approach, firstly, the simple case is considered; the information sensed by the source
node passes by a single relay before reaching the destination node. Secondly, global case is studied; the
information passes by several relays. We consider, in the previous both cases, that the batteries are nonrechargeable. Thirdly, we spread our work the case where the batteries are rechargeable with unlimited
storage capacity. In all three cases, we suppose that Maximum Ratio Combining (MRC) is used as a
detector, and Amplify and Forward (AF) as a relaying strategy. Simulation results show the viability of
our approach which the network lifetime is extended of more than 70.72%when the batteries are non
rechargeable and 100.51% when the batteries are rechargeable in comparison with other traditional
method.
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...ijasuc
This paper addresses the maximizing network lifetime problem in wireless sensor networks (WSNs) taking
into account the total Symbol Error rate (SER) at destination. Therefore, efficient power management is
needed for extend network lifetime. Our approach consists to provide the optimal transmission power
using the orthogonal multiple access channels between each sensor. In order to deeply study the
properties of our approach, firstly, the simple case is considered; the information sensed by the source
node passes by a single relay before reaching the destination node. Secondly, global case is studied; the
information passes by several relays. We consider, in the previous both cases, that the batteries are nonrechargeable. Thirdly, we spread our work the case where the batteries are rechargeable with unlimited
storage capacity. In all three cases, we suppose that Maximum Ratio Combining (MRC) is used as a
detector, and Amplify and Forward (AF) as a relaying strategy. Simulation results show the viability of
our approach which the network lifetime is extended of more than 70.72%when the batteries are non
rechargeable and 100.51% when the batteries are rechargeable in comparison with other traditional
method.
IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...IRJET Journal
This document discusses a modified orthogonal matching pursuit algorithm used for channel estimation in digital terrestrial television systems. It proposes using compressed sensing based channel estimation at the receiver to eliminate sparse information. Thresholding is used to remove noise from the channel estimation and improve signal quality. Simulation results show that bit error rate decreases when the received signal power from different transmitters is almost equal.
Research Inventy : International Journal of Engineering and Scienceresearchinventy
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
The document describes the design of a low power preamplifier integrated circuit for cochlear implants using a split folded cascode technique. This technique splits the input transistors into two branches with equal aspect ratios, increasing the overall transconductance by 1.414 times compared to a normal folded cascode. Simulations of the proposed preamplifier design in Cadence Virtuoso using a 180nm process show a mid-band gain of 43.7 dB, bandwidth of 18-20 kHz, and input-referred noise of 473.47 nV/√Hz at 4 kHz, while consuming 4.47 μW from a 1.8V supply. The split folded cascode technique enhances performance over normal cascode
Artificial neural networks have been adopted for a broad range of tasks in multimedia analysis and processing, such as visual and acoustic classification, extraction of multimedia descriptors or image and video coding. The trained neural networks for these applications contain a large number of parameters (weights), resulting in a considerable size. Thus, transferring them to a number of clients using them in applications (e.g., mobile phones, smart cameras) benefits from a compressed representation of neural networks.
MPEG Neural Network Coding and Representation is the first international standard for efficient compression of neural networks (NNs). The standard is designed as a toolbox of compression methods, which can be used to create coding pipelines. It can be either used as an independent coding framework (with its own bitstream format) or together with external neural network formats and frameworks. For providing the highest degree of flexibility, the network compression methods operate per parameter tensor in order to always ensure proper decoding, even if no structure information is provided. The standard contains compression-efficient quantization and an arithmetic coding scheme (DeepCABAC) as core encoding and decoding technologies, as well as neural network parameter pre-processing methods like sparsification, pruning, low-rank decomposition, unification, local scaling, and batch norm folding. NNR achieves a compression efficiency of more than 97% for transparent coding cases, i.e. without degrading classification quality, such as top-1 or top-5 accuracies.
This talk presents an overview of the context, technical features, and characteristics of the NN coding standard, and discusses ongoing topics such as incremental neural network representation.
PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...ijistjournal
This document discusses performance analysis of Barker codes based on their correlation properties in a multi-user environment. It analyzes the auto-correlation and cross-correlation properties of long Barker codes. Barker codes have good auto-correlation properties and certain code pairs were found to have low cross-correlation, making them suitable for multi-user environments. The document also describes direct sequence spread spectrum modulation using pseudo-noise codes and the despreading process at the receiver.
This document summarizes a research paper on efficient Transformer-based speech enhancement using long frames and STFT magnitudes. It introduces the task of speech enhancement and issues with learned-domain approaches. The proposed method uses STFT magnitudes as input to an encoder-decoder architecture with a Transformer masker. Experiments show the method achieves equivalent quality and intelligibility scores compared to learned features, while reducing computations by 8x for 10-second utterances. The conclusions are that using STFT magnitudes enables efficient, high-quality speech enhancement on embedded devices.
The document summarizes an academic paper on audio inpainting using generative adversarial networks (GANs). It discusses using a dual discriminator Wasserstein GAN (D2WGAN) model to fill in missing audio content of 500-550ms by extracting short and long-range borders. The model was tested on piano, guitar, and piano+orchestra datasets, with the guitar dataset achieving the best results. Increasing the training steps significantly improved performance without overfitting. Future work could explore different border lengths, additional network layers, and combining waveforms and spectrograms.
1) The document presents a frame-online DNN-WPE dereverberation method that uses a neural network to directly estimate the power spectral density (PSD) of target speech from observations for use in weighted prediction error (WPE) dereverberation.
2) Experiments on the REVERB challenge and WSJ+VoiceHome datasets show the DNN-based method improves word error rates over unprocessed baselines, particularly for online processing with two microphones.
3) While performance degrades with increased latency constraints and complex background noise, the DNN-WPE still provides an improvement over no enhancement and enables real-time dereverberation.
WaveNet is a generative model for raw audio developed by Google DeepMind that uses dilated causal convolutions to capture long-term dependencies in audio. It can generate speech that is subjectively natural and can be conditioned on speaker identity. The model was also shown to generate music from specific instruments and genres with varying levels of success depending on the conditioning information provided. Key aspects of the model include causal dilated convolutions to exponentially increase the receptive field, gated activation units, and residual and skip connections.
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...ssuser849b73
A single model is presented that can perform acoustic echo cancellation, speech enhancement, and speech separation jointly using a conformer architecture. The model takes as input a reference signal, noise context, and target speaker embedding. Evaluation shows the joint model achieves performance close to task-specific models while significantly improving the noise robustness of a large-scale ASR system.
Wavesplit is an end-to-end speech separation model that leverages speaker identities during training. It uses a speaker clustering network to produce speaker representations and a separation network conditioned on the representations. During training, speaker labels are used to learn discriminative representations and optimize reconstruction quality. At test time, the model relies on clustering the representations to assign speakers without labels. The model achieves state-of-the-art results on speech separation benchmarks and can also be applied to other source separation tasks like separating heart rates from abdominal recordings.
SepFormer and DPTNet are Transformer-based models for monaural speech separation that achieve state-of-the-art performance. SepFormer uses dual-path Transformers to model short and long-term dependencies without RNNs, allowing parallel processing. DPTNet introduces an improved Transformer with a recurrent layer to directly model contextual information in speech sequences. Experiments on standard datasets show SepFormer achieves SOTA results and is faster to train and infer than RNN baselines like DPRNN. Both models obtain competitive separation but SepFormer has advantages in parallelization and efficiency due to its RNN-free design.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
1. Compute and memory efficient
universal sound source separation
Authors: Efthymios Tzinis · Zhepei Wang · Xilin Jiang · Paris Smaragdis
Published in 2020 IEEE 30th International Workshop on MLSP, and
arXiv:2103.02644v2 [cs.SD] 14 Jul 2021
Presenter: 何冠勳
4. 3
Introduction
– There are three fields in audio separation: speech separation, universal sound
separation and music source separation.
– Previous works that are prestigious:
– Conv-TasNet
– DPRNN
– DPTNet
– Demucs
– Two step sound source separation
– Despite the dramatic advances in source separation performance, the
computational complexity of the aforementioned methods might hinder their
extensive usage across multiple devices.
– Additionally, training such systems is also an expensive computational undertaking
which can amount to significant costs.
5. 4
Introduction
– Several studies, mainly in the image domain, have introduced more efficient
architectures in order to overcome the growing concern of large models with high
computational requirements.
– E.g. depth-wise separable convolutions, dilated convolutions, meta-learning
– Despite the recent success of low-resource architectures in the image domain, little
progress has been made towards proposing efficient architectures for audio tasks
and especially source separation.
– Modern approaches, mainly in speech enhancement and music source separation,
have been focusing on developing models which are capable of real-time inference.
6. 5
Introduction
– In this study, we propose a novel neural network architecture for audio source
separation while following a more holistic approach in terms of computational
resources that we take into consideration (FLOPs, latency and total memory
requirements).
– We entitle SuDoRM-RF: SUccessive DOwnsampling and Resampling of Multi-
Resolution Features (SuDoRM-RF).
– We also propose improved versions and causal variations.
– The separation performance comparable to or even better than several recent
SOTA models on speech, environmental and universal sound separation tasks with
significantly lower computational requirements.
8. 7
Architecture
– On par with many SOTA approaches, SuDoRM-RF performs end-to-end audio
source separation using a mask-based architecture with adaptive encoder and
decoder basis.
– We have extended our basic model in order to also remove the mask estimation
process by introducing SuDoRM-RF++ that directly estimates the latent
representations of the sources.
– Besides original model, also propose
– improved version SuDoRM-RF++
– SuDoRM-RF++ with group communication
– causal variation C-SuDoRM-RF++
9.
10. 9
Architecture – Encoder
– The encoder E architecture consists of a one-dimensional convolution with kernel
size KE and stride equal to KE/2 similar to Conv-TasNet.
– We force the output of the encoder to be strictly non-negative by applying a recti-
fied linear unit (ReLU) activation on top of the output of the 1D-convolution.
(channel, kernel, stride)
11. 10
Architecture – Separator
– In essence, the separator S module performs the following transformations to the
encoded mixture representation
1 2 3 4 5
12. 11
Architecture – Separator
1. Projects the encoded mixture representation to a new channel space through a
layer-normalization followed by a point-wise convolution as shown next:
2. Performs repetitive non-linear transformations provided by B U-convolutional
blocks on . The output of the ith U-ConvBlock would be denoted as
and would be used as input for the (i + 1)th block.
3. Aggregates the information over multiple channels by applying a regular one-
dimensional convolution for each source on the transposed feature
representation .
13. 12
Architecture – Separator
3. (cont.) This step has been introduced in Two step sound source separation and
empirically shown to make the training process more stable.
4. Combines the aforementioned latent codes for all sources by performing a
softmax operation in order to get mask estimates, where mask coefficients all lie
in [0,1].
5. Estimates a latent representation for each source by multiplying element-wise
the encoded mixture representation with the corresponding mask.
14. 13
Architecture – U-ConvBlock
– U-ConvBlock uses a block structure which resembles a depth-wise separable
convolution but without a skip connection and residual unlike ConvTasNet.
– It extracts information from multiple resolutions using Q successive temporal
downsampling and Q upsampling operations that resembles a U-Net architecture.
– We postulate that this whole resampling procedure of extracting features at
multiple scales combined with the efficient increase of the effective receptive field.
17. 16
Architecture – Decoder
– Our decoder module is the final step in order to transform the latent space
representation for each source back to the time domain.
– Each latent source representation is fed through a different transposed convolution
decoder.
18. 17
Architecture – Variations 1
– In the improved version of the proposed architecture, namely, SuDoRM-RF++ , the
model estimates directly the latent representation for each target signal and uses
only one decoder module.
19. 18
Architecture – Variations 1
– Difference between original version:
– Replace the mask estimation and element-wise multiplication process with a direct
estimation of the latent target signals.
– Use only one trainable decoder.
– Replace the layer normalization layers with global layer normalization.
– Simplify the activation layers and we use PReLU activation layers with only one shared
learnable parameter.
– We would like to underline that original SuDoRM-RF models could potentially outperform
the improved SuDoRM-RF++ variation in cases where the direct estimation of the latent
targets would be more difficult.
– Moreover, the alternation of containing two decoders proposed in the initial version might
be more useful in cases where one wants to solve an audio source separation problem
containing two distinct classes of sounds (e.g. speech enhancement) where each decoder
could be fine-tuned.
20. 19
Architecture – Variations 2
– We also propose a new variation of our model, namely, SuDoRM-RF++ GC, where
we combine group communication.
– In the proposed architecture, the intermediate representations are being
processed in groups of sub-bands of channels.
– We divide the channels of each 1 × 1 convolutional block into 16 groups and we
process them first independently by sharing the parameters across all groups of
sub-bands.
– After, we apply a self-attention module to combine them.
– The resulting architecture leads to a significant improvement in the number of
trainable parameters.
22. 21
Architecture – Variations 3
– Our latest extension of the proposed
model, named C-SuDoRM-RF++, is to
be able to run online and enable
streamable extensions for real-time
applications.
– Differences between SuDoRM-RF++:
– Replace all non-causal convolutions
with causal counterparts.
– Remove all the normalization layers.
24. 23
Experiment – Datasets
– Speech: WSJ0-2mix (2 active speakers)
– Speaker mixtures are generated by randomly mixing speech utterances with
two active speakers at random SNR between −5 and 5dB from the WSJ0.
– Non-speech: ESC50 (2 active sources)
– ESC50 consists of a wide variety of sounds (non-speech human sounds, animal
sounds, natural soundscapes, interior sounds and urban noises). For each data
sample, two audio sources are mixed with a random SNR between −2.5 and
2.5dB where each source belongs to a distinct sound category from a total of
50.
– Universal sound separation: FUSS (1~4 active sources)
– We also evaluate our models under a purely universal sound separation setup
where we do not know how many sources are active in each input mixture.
25. 24
Experiment – Preprocessing
– For fixed number of sources:
1. random choosing two sound classes or speakers
2. random cropping of 4sec segments from two sources audio files
3. mixing the source segments with a random SNR
4. generating 20, 000 new training mixtures for each epoch
5. 3,000 mixtures for validation and test sets
6. down-sampling each audio clip to 8kHz.
– For variable number of sources:
1. using the same dataset splits as the one provided (10sec, 16kHz)
2. generating 20, 000 new training mixtures for each epoch
3. 5,000 and 3,000 mixtures for validation and test sets respectively
26. 25
Experiment – Loss
– For fixed number of sources:
– For variable number of sources:
– N denotes maximum sources, N’ denotes active sources
– Loss function is defined referencing FUSS baseline.
27. 26
Experiment – Evaluation
– In order to evaluate the performance of our models we use a stable version of the
permutation invariant SI-SDRi.
– For the case where non-active sources exist, thus, Nʹ < N, we omit to compute the
aforementioned metric as it provides infinity values.
– and * sign denote the best permutation; denote all possible permutation
sets; ε = 1e−9 and τ = 1e−3 solve numerical stability issues created by zero target
signals.
29. 28
Experiment – Model Conf.
– For the en/decoder we use a kernel size 21 for input mixtures sampled at 8kHz and
kernel size 41 for 16kHz. Also, the number of basis (channel) is 512.
– For the configuration of each U-ConvBlock we set the input number of channels
equal to 128, the number of successive resampling operations equal to 4 and, the
expanded number of channels equal to 512. In each subsampling operation we
reduce the temporal dimension by a factor of 2 and all depth-wise separable
convolutions have a kernel length of 5 and a stride of 2.
– For C-SuDoRM-RF++, we increase the number of input channels to 256 and the
default kernel length to 11 in order to increase the receptive field.
30. 29
Experiment – Model Conf.
– To simplify, SuDoRM-RF 2.0x , SuDoRM-RF 1.0x , SuDoRM-RF 0.5x , SuDoRM-RF
0.25x means that it consist of 32, 16, 8 and 4 blocks, respectively.
– The same applies to the improved version SuDoRM-RF++ and its causal variation C-
SuDoRM-RF++.
31. 30
Experiment – Comparison
– One of the main goals of this study is to propose models for audio source
separation which could be trained using limited computational resources and
deployed easily on a mobile or edge-computing device.
– Consider following aspects:
1. Number of executed floating point operations (FLOPs).
2. Number of trainable parameters.
3. Memory allocation required on the device for a single pass.
4. Time for completing each process.
32. 31
Result & Discussion
- Two active sources
- FLOPs
- Training cost
- Memory
- Variable active sources
- Causal capability
33. 32
Result – Two sources
– In Table 1, we focus on the computational aspects required during a forwards pass
of those models on a CPU.
– In Table 2, the same computational resource requirements are shown for a
backward pass on GPU as well as the number of trainable parameters.
34.
35.
36.
37. 36
Result – FLOPs
– The computational resource one might be interested in is the number of FLOPs
required during inference.
– In Figure 4, SuDoRM-RF models scale well as we increase the number of U-
ConvBlocks B from 4 → 8 → 16.
– SuDoRM-RF++ achieves similar or even better results than the original version of
the proposed model SuDoRM-RF with a lower number of FLOPs both in forward
and backward for a similar number of parameters and execution time.
– A significant drop in the absolute number of FLOPs is also obtained by combining
the group communication mechanism
38. 37
Result – Training cost
– Usually one of the most
detrimental factors for training
deep learning models is the
requirement of allocating
multiple GPU devices for
several days or weeks until an
adequate performance is
obtained on the validation set.
– SuDoRM-RF 1.0x has a faster
convergence.
39. 38
Result – Memory
– Group communication combined with SuDoRM-RF++ is one of the most effective
ways to reduce the number of trainable parameters caused by the bottleneck
dense layers between the U-ConvBlocks.
– However, the trainable parameters comprise only a small portion of the total
amount of memory required for a single forward or backward pass. The space
complexity could easily be dominated by the storage of intermediate
representations and not the actual memory footprint.
– It could become even worse when multiple skip connections are present, gradients
from multiple layers have to be stored or implementations require augmented
matrices (dilated, transposed convolutions, etc.).
40. 39
Result – WSJ0-2mix
– We perform an ablation study SuDoRM-RF++ on
WSJ0-2mix dataset.
– The stride of the encoder and decoder which is
always defined as kernel//2. By decreasing the size
of the stride, we force the model to perform more
computations and estimate the signal in a more fine-
grained scale closer to the time-domain resolution
leading to better results.
– GLN significantly helps our model to reach a better
solution.
– More U-ConvBlocks seems to get better results.
41. 40
Result – FUSS
– We see that by increasing our SuDoRM-RF and SuDoRM-RF++ model sizes to match
the size of TDCNN++ , we can match its performance.
– For the single source mixtures we see that our models per- form worse than the
TDCNN++, however, above 25dB it is difficult even for a human being to
understand the nuance artifacts which are barely audible.
42. 41
Result – Casual setup
– We see that we are able to obtain competitive separation performance for all
configurations while remaining ≈ 10 to 20 times faster than real time.
– However the performance still needs some improvements.
44. 43
Conclusions
– Our experiments suggest that SuDoRM-RF models
a) could be deployed on devices with limited resources,
b) be trained significantly faster and achieve good separation performance and
c) scale well when increasing the number of parameters.
– We show that these models can perform similarly or even better than recent state-
of-the-art models while requiring significantly less computational resources in
FLOPs, memory and, time for experiments with a fixed and a variable number of
sources.